[Generative] Llama 활용해 이미지에 대한 captioning, tagging하기

[Generative] Llama 활용해 이미지에 대한 captioning, tagging하기

2024. 12. 27. 23:24ㆍDevelopers 공간 [Shorts]/Vision & Audio

728x90

<분류>
A. 수단
- OS/Platform/Tool : Linux, Kubernetes(k8s), Docker, AWS
- Package Manager : node.js, yarn, brew,
- Compiler/Transpillar : React, Nvcc, gcc/g++, Babel, Flutter
- Module Bundler : React, Webpack, Parcel

B. 언어
- C/C++, python, Javacsript, Typescript, Go-Lang, CUDA, Dart, HTML/CSS

C. 라이브러리 및 프레임워크 및 SDK
- OpenCV, OpenCL, FastAPI, PyTorch, Tensorflow, Nsight

1. What? (현상)

LLaMA(Large Language Model Meta AI)는 Meta에서 23년 2월에 최초로 출시한 LLM으로, GPT와 다르게 Open된 모델이기 때문에 연구에 자주 활용됩니다.

** LLaMA: Open and Efficient Foundation Language Models (arxiv'23)

** Llama 2: Open Foundation and Fine-Tuned Chat Models (arxiv'23)

** The Llama 3 Herd of Models (arxiv'24)

최근 24년 12월 기준 Llama 3.3이 최신 모델이며, 모두 아래와 같은 특징들을 가지고 있습니다.

버전 별로 다양한 파라미터(1B, 8B, 11B, 70B, 90B, 405B 등)를 가진 모델을 제공합니다.
앞서 언급된 것처럼 비상업적 라이센스인 Open Source GPL3로 Open되어 연구 및 상업적 용도로 무료 활용될 수 있습니다.
LLM 뿐 아니라 Vision을 input으로 받을 수 있는 Multi-modal모델 LlaMA-V도 공개되어, 아래와 같은 용도로도 활용이 가능합니다.
- VQA(Visual Question Answering) & Visual Reasoning
- DocVQA(Document Visual Question Answering)
- Image Captioning
- Image-Text Retrieval
- Visual Grounding

참고로 비슷한 이름의 LLaVA(Large Language and Vision Assistant)는, Microsoft에서 공개한 논문에서 Visual Instruction Tuning이라고도 불리는 방법을 활용해 기존의 LLM을 multimodal에 적용하는 논문이며 Meta와는 직접적으로는 무관합니다.

** LLaVA: Large Language and Vision Assistant (NIPS'24)

이번 글에서는 Llama-V를 활용해 주어진 이미지를 보고 알맞는 문장을 만들어내는 Captioning과 태그들을 생성할 수 있는 Tagging을 보이려고 합니다.

2. Why? (원인)

3. How? (해결책)

그럼 모델을 받아 진행해보겠습니다.

Step 0. 모델 다운로드

먼저 Llama모델을 다운로드 받을 것인데 Llama는 아래와 같은 두가지 방법으로 받을 수 있습니다.

방법1. Huggingface를 활용 : huggingface 포맷으로 되어있으며, transformers 라이브러리를 바로 활용할 수 있으므로 편리합니다. 다운 받기위해서는 access를 허가 받아야하는 단계가 있습니다.
** https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
** https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct
방법2. Meta의 홈페이지를 활용해 직접 다운로드 : meta 홈페이지에서 직접 다운로드 받는 방법으로, 아래와 같은 파일들 및 torch 포맷으로 되어있기 때문에 state를 직접 로드해서 사용해야합니다. 역시나 access를 허가 받아야하는 단계가 있습니다.
** https://www.llama.com/llama-downloads/
- checklist.chk
- consolidated.00.pth
- params.json
- tokenizer.model

# Install Download Tool
pip3 install llama-stack

# Show all list
llama model list

# Download
MODEL_ID=Llama3.2-11B-Vision
llama model download --source meta --model-id $MODEL_ID

필자는 방법1을 활용해 다운로드 받았기 때문에 해당 내용을 중심으로 설명할 예정이며, 혹시 방법2를 활용해 받은 경우 아래와 같은 코드를 활용해 huggingface 포맷으로 변경해줄 수도 있습니다.

** https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

** 단, 현시점 기준으로 위 convert 코드에서 Llama-V 모델을 지원하지 않는 것으로 보입니다. 그래서 방법1을 통해 다운받으시는 것을 추천드립니다.

Step 1. LLama Vision 모델

먼저 Llama Vision 모델을 활용해 실행해보겠습니다. 아래는 huggingface의 ID를 제공해 받거나, 직접 huggingface에서 받은 포맷의 local위치를 명시해서 transformers에 로드하는 코드입니다.

** https://huggingface.co/meta-llama/Llama-3.2-11B-Vision

import torch
from transformers import AutoProcessor
from transformers import MllamaForConditionalGeneration

model_id = "meta-llama/Llama-3.2-11B-Vision"
model_id = "/Path/To/Llama-3.2-11B-Vision"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True
).to('cuda:0')
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

아래와 같은 코드를 통해 이미지를 활용하겠습니다.

import glob
from PIL import Image

dataset_root = "/Path/To/Folder"
image_path_list = glob.glob(f"{dataset_root}/**/*.jpg", recursive=True)

image_path = image_path_list[0]

image = Image.open(image_path).convert("RGB")

요청할 태그를 적어줍니다. 사실 뒤에서 설명하겠지만 지금 로드한 모델은 Pre-trained모델로 Instruction-tuned되지 않았기 때문에 아래와 같이 명령어를 주어야합니다.

tag_message = "<|image|><|begin_of_text|>Tags with tagging which are separated with commas are : "
tag_message2 = "<|image|><|begin_of_text|>20 tags with tagging which are separated with commas are : "
caption_message = "<|image|><|begin_of_text|>Description with keywords including genre and style is : " 

message = tag_message

이제 로드된 이미지와 텍스트를 모아서, 위 선언된 processor를 활용해 인풋을 만들어줍니다. 만들어진 input의 key를 보니 아래와 같네요.

inputs = processor(images=image, text=message, return_tensors="pt").to(model.device)
for k in inputs.keys():
    print(k)

input_ids
attention_mask
pixel_values
aspect_ratio_ids
aspect_ratio_mask
cross_attention_mask

이제 위 만들어진 input을 활용해 생성하고 후처리하는 방법은 아래와 같습니다. 결과는 아래에서 한번에 살펴보겠습니다.

kwargs = {
    'max_new_tokens': 1024,
    **inputs
}

generated_ids = model.generate(**kwargs)

# Method1
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Method2
generated_text = processor.decode(generated_ids[0])

# Post processing
generated_text = generated_text.split('<|end_header_id|>')[-1].split('<|eot_id|>')[0].replace('\n', '')

print("\n\nResult::")
print(generated_text)

2. Llama Vision Instruct

위 모델은 사용해보시면 알 수 있겠지만, Instruction-tuned되지 않은 pretrained 모델이기 때문에 원하는 아웃풋을 얻기 어려울 수 있습니다.

따라서 vision과 관련된 instruction으로 instruction-tuned된 모델을 활용하기 위해 아래의 모델을 활용하겠습니다.

** https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct

모델을 로드하는 코드는 위와 동일합니다.

import torch
from transformers import AutoProcessor
from transformers import MllamaForConditionalGeneration

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_id = "/Path/To/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True
).to('cuda:0')
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

이미지도 위와 동일하게 활용할 예정입니다.

import glob
from PIL import Image

dataset_root = "/Path/To/Folder"
image_path_list = glob.glob(f"{dataset_root}/**/*.jpg", recursive=True)

image_path = image_path_list[0]

image = Image.open(image_path).convert("RGB")

이제 태그를 생성할텐데, instruction-tuned되었으므로 user의 입장에서 아래와 같은 형태로 메시지를 만들어 주어야합니다. 아래 결과에서 tag_message에 대한 결과와 caption_message에 대한 두가지 결과를 모두 살펴볼 예정입니다.

tag_message = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Please give me 10 tags with tagging which are separated with commas."}
    ]}
]
caption_message = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe with keywords including genre, style and what the picture intend to show."}
    ]}
]

message = tag_message

이제 아래와 같은 방법으로 위에서 만든 텍스트와 이미지를 활용해 인풋을 만들어줍니다. 역시나 위 선언된 processor를 활용합니다.

input_text = processor.apply_chat_template(message, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

위 만들어진 input을 활용해 생성하고 후처리하는 방법은 앞서와 같습니다.

kwargs = {
    'max_new_tokens': 1024,
    **inputs
}

generated_ids = model.generate(**kwargs)

# Method1
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Method2
generated_text = processor.decode(generated_ids[0])

# Post processing
generated_text = generated_text.split('<|end_header_id|>')[-1].split('<|eot_id|>')[0].replace('\n', '')

print("\n\nResult::")
print(generated_text)

tag_message에 대한 결과는 아래와 같습니다. 다만, 위 코드에서 후처리(Post-processing)을 할 때 special token들을 처리해주었는데, 이를 처리하지 않았을 때의 결과입니다.

Result::
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
<|image|>Please give me 10 tags with tagging which are separated with commas.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
contemporary art, abstract expressionism, oil painting, painting, art, black painting, anselm kiefer, modern art, abstract.<|eot_id|>

caption_message에 대한 결과는 아래와 같습니다. 역시나 special token들을 처리하지 않았을 때의 결과입니다.

Result::
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
<|image|>Describe with keywords including genre and style.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
This image depicts a striking abstract painting featuring a predominantly black background with a textured, scratched effect, punctuated by a vibrant orange and green spot in the top right corner. The dark, almost black hue is punctuated by subtle white and yellow specks, adding depth and visual interest to the composition. The painting's style is reminiscent of Expressionism, with bold, expressive brushstrokes and a focus on conveying emotion and mood through color and texture. The overall effect is one of dynamic energy and tension, as if the artist has distilled the essence of a powerful experience into a single, captivating image.<|eot_id|>

** https://blog.ori.co/how-to-run-llama3.2-on-a-cloud-gpu-with-transformers

** https://huggingface.co/docs/transformers/v4.47.1/model_doc/mllama

728x90

저작자표시 비영리 변경금지 (새창열림)

'Developers 공간 [Shorts] > Vision & Audio' 카테고리의 다른 글

[Audio] Python 활용해 조성 얻어내기 (0)	2025.03.17
[Generative] Huggingface 데이터 받아 사용하기 (0)	2024.12.12
[Data Science] Embedding을 2차원 Visualize하기 (0)	2024.11.10
[Generative] Python으로 FID 구하기 (0)	2024.10.15
[Audio] Python Pedalboard를 활용해 Audio를 바꾸기 (2)	2024.10.10

태그

최근글

1. What? (현상)

2. Why? (원인)

3. How? (해결책)

'Developers 공간 [Shorts] > Vision & Audio' 카테고리의 다른 글

관련글

티스토리툴바