Bad inference output #27

McCarrtney · 2023-12-19T07:44:16Z

I run the inference script, which is the same in README

# For T5 based model
from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor
import datasets
import json
import transformers
from PIL import Image
import torch
import os

model_type="instructblip"
model_ckpt="/nas/wutao/llms/MMICL-Instructblip-T5-xxl"
processor_ckpt = "/nas/wutao/llms/instructblip-flan-t5-xxl"
config = InstructBlipConfig.from_pretrained(model_ckpt )

device = torch.device('cuda:0')

if 'instructblip' in model_type:
    model = InstructBlipForConditionalGeneration.from_pretrained(
        model_ckpt,
        config=config).to(device,dtype=torch.bfloat16) 

image_palceholder="图"
sp = [image_palceholder]+[f"<image{i}>" for i in range(20)]
processor = InstructBlipProcessor.from_pretrained(
    processor_ckpt
)
sp = sp+processor.tokenizer.additional_special_tokens[len(sp):]
processor.tokenizer.add_special_tokens({'additional_special_tokens':sp})
if model.qformer.embeddings.word_embeddings.weight.shape[0] != len(processor.qformer_tokenizer):
    model.qformer.resize_token_embeddings(len(processor.qformer_tokenizer))
replace_token="".join(32*[image_palceholder])



image = Image.open ("images/flamingo_photo.png")
image1 = Image.open ("images/flamingo_cartoon.png")
image2 = Image.open ("images/flamingo_3d.png")

images = [image,image1,image2]
prompt = [f'Use the image 0: <image0>{replace_token}, image 1: <image1>{replace_token} and image 2: <image2>{replace_token} as a visual aids to help you answer the question. Question: Give the reason why image 0, image 1 and image 2 are different? Answer:']

prompt = " ".join(prompt)

inputs = processor(images=images, text=prompt, return_tensors="pt")

inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]])
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)

inputs = inputs.to('cuda:0')
outputs = model.generate(
        pixel_values = inputs['pixel_values'],
        input_ids = inputs['input_ids'],
        attention_mask = inputs['attention_mask'],
        img_mask = inputs['img_mask'],
        do_sample=False,
        max_length=80,
        min_length=50,
        num_beams=8,
        set_min_padding_size =False,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

And get bad output:

2023-12-19 15:39:32.621621: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621690: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
image 0 is a flamingo, image 1 is a flamingo and image 2 is a polygonal flamingo,...,...,...,...,...,...,...,

What's the problem? I ran the code on 1 A100, and pip packages have the same version in environment.yml
Besides, I got a lower score on MME benchmark, which is
=========== Perception ===========
total score: 1313.6530612244896

 existence  score: 175.0
 count  score: 146.66666666666666
 position  score: 70.0
 color  score: 160.0
 posters  score: 115.98639455782313
 celebrity  score: 125.0
 scene  score: 155.0
 landmark  score: 131.0
 artwork  score: 110.0
 OCR  score: 125.0

=========== Cognition ===========
total score: 275.3571428571429

 commonsense_reasoning  score: 117.85714285714286
 numerical_calculation  score: 42.5
 text_translation  score: 55.0
 code_reasoning  score: 60.0

The prompt is "Use the image 0: {replace_token} as a visual aid to help you answer the questions accurately. Question: {question}", which had been mentioned in previous issues.

How to solve this problem?

The text was updated successfully, but these errors were encountered:

Jianzhao-Huang · 2024-05-08T18:17:08Z

I've faced a similar issue to yours.

Based on my usage, mmicl seems inclined to produce overly short sentences, which would result in it losing most of its complex reasoning capability.

If the min_length is increased to force it to produce longer sentences, it resorts to filling them with meaningless punctuation and words. For example, "a flamingo standing in the water, with a reflection of the sky in the water. it is a beautiful image of a... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [Images of... the... [Flamingo]... [Images of... [Flam the first bird in... [Images of... the first bird in."

Any idea about this?
@HaozheZhao

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad inference output #27

Bad inference output #27

McCarrtney commented Dec 19, 2023

Jianzhao-Huang commented May 8, 2024

Bad inference output #27

Bad inference output #27

Comments

McCarrtney commented Dec 19, 2023

Jianzhao-Huang commented May 8, 2024