You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I run the inference script, which is the same in README
# For T5 based model
from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor
import datasets
import json
import transformers
from PIL import Image
import torch
import os
model_type="instructblip"
model_ckpt="/nas/wutao/llms/MMICL-Instructblip-T5-xxl"
processor_ckpt = "/nas/wutao/llms/instructblip-flan-t5-xxl"
config = InstructBlipConfig.from_pretrained(model_ckpt )
device = torch.device('cuda:0')
if 'instructblip' in model_type:
model = InstructBlipForConditionalGeneration.from_pretrained(
model_ckpt,
config=config).to(device,dtype=torch.bfloat16)
image_palceholder="图"
sp = [image_palceholder]+[f"<image{i}>" for i in range(20)]
processor = InstructBlipProcessor.from_pretrained(
processor_ckpt
)
sp = sp+processor.tokenizer.additional_special_tokens[len(sp):]
processor.tokenizer.add_special_tokens({'additional_special_tokens':sp})
if model.qformer.embeddings.word_embeddings.weight.shape[0] != len(processor.qformer_tokenizer):
model.qformer.resize_token_embeddings(len(processor.qformer_tokenizer))
replace_token="".join(32*[image_palceholder])
image = Image.open ("images/flamingo_photo.png")
image1 = Image.open ("images/flamingo_cartoon.png")
image2 = Image.open ("images/flamingo_3d.png")
images = [image,image1,image2]
prompt = [f'Use the image 0: <image0>{replace_token}, image 1: <image1>{replace_token} and image 2: <image2>{replace_token} as a visual aids to help you answer the question. Question: Give the reason why image 0, image 1 and image 2 are different? Answer:']
prompt = " ".join(prompt)
inputs = processor(images=images, text=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]])
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs = inputs.to('cuda:0')
outputs = model.generate(
pixel_values = inputs['pixel_values'],
input_ids = inputs['input_ids'],
attention_mask = inputs['attention_mask'],
img_mask = inputs['img_mask'],
do_sample=False,
max_length=80,
min_length=50,
num_beams=8,
set_min_padding_size =False,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
And get bad output:
2023-12-19 15:39:32.621621: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621690: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
image 0 is a flamingo, image 1 is a flamingo and image 2 is a polygonal flamingo,...,...,...,...,...,...,...,
What's the problem? I ran the code on 1 A100, and pip packages have the same version in environment.yml
Besides, I got a lower score on MME benchmark, which is
=========== Perception ===========
total score: 1313.6530612244896
existence score: 175.0
count score: 146.66666666666666
position score: 70.0
color score: 160.0
posters score: 115.98639455782313
celebrity score: 125.0
scene score: 155.0
landmark score: 131.0
artwork score: 110.0
OCR score: 125.0
=========== Cognition ===========
total score: 275.3571428571429
The prompt is "Use the image 0: {replace_token} as a visual aid to help you answer the questions accurately. Question: {question}", which had been mentioned in previous issues.
How to solve this problem?
The text was updated successfully, but these errors were encountered:
Based on my usage, mmicl seems inclined to produce overly short sentences, which would result in it losing most of its complex reasoning capability.
If the min_length is increased to force it to produce longer sentences, it resorts to filling them with meaningless punctuation and words. For example, "a flamingo standing in the water, with a reflection of the sky in the water. it is a beautiful image of a... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [Images of... the... [Flamingo]... [Images of... [Flam the first bird in... [Images of... the first bird in."
I run the inference script, which is the same in README
And get bad output:
What's the problem? I ran the code on 1 A100, and pip packages have the same version in environment.yml
Besides, I got a lower score on MME benchmark, which is
=========== Perception ===========
total score: 1313.6530612244896
=========== Cognition ===========
total score: 275.3571428571429
The prompt is "Use the image 0: {replace_token} as a visual aid to help you answer the questions accurately. Question: {question}", which had been mentioned in previous issues.
How to solve this problem?
The text was updated successfully, but these errors were encountered: