Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad inference output #27

Open
McCarrtney opened this issue Dec 19, 2023 · 1 comment
Open

Bad inference output #27

McCarrtney opened this issue Dec 19, 2023 · 1 comment

Comments

@McCarrtney
Copy link

I run the inference script, which is the same in README

# For T5 based model
from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor
import datasets
import json
import transformers
from PIL import Image
import torch
import os

model_type="instructblip"
model_ckpt="/nas/wutao/llms/MMICL-Instructblip-T5-xxl"
processor_ckpt = "/nas/wutao/llms/instructblip-flan-t5-xxl"
config = InstructBlipConfig.from_pretrained(model_ckpt )

device = torch.device('cuda:0')

if 'instructblip' in model_type:
    model = InstructBlipForConditionalGeneration.from_pretrained(
        model_ckpt,
        config=config).to(device,dtype=torch.bfloat16) 

image_palceholder="图"
sp = [image_palceholder]+[f"<image{i}>" for i in range(20)]
processor = InstructBlipProcessor.from_pretrained(
    processor_ckpt
)
sp = sp+processor.tokenizer.additional_special_tokens[len(sp):]
processor.tokenizer.add_special_tokens({'additional_special_tokens':sp})
if model.qformer.embeddings.word_embeddings.weight.shape[0] != len(processor.qformer_tokenizer):
    model.qformer.resize_token_embeddings(len(processor.qformer_tokenizer))
replace_token="".join(32*[image_palceholder])



image = Image.open ("images/flamingo_photo.png")
image1 = Image.open ("images/flamingo_cartoon.png")
image2 = Image.open ("images/flamingo_3d.png")

images = [image,image1,image2]
prompt = [f'Use the image 0: <image0>{replace_token}, image 1: <image1>{replace_token} and image 2: <image2>{replace_token} as a visual aids to help you answer the question. Question: Give the reason why image 0, image 1 and image 2 are different? Answer:']

prompt = " ".join(prompt)

inputs = processor(images=images, text=prompt, return_tensors="pt")

inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]])
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)

inputs = inputs.to('cuda:0')
outputs = model.generate(
        pixel_values = inputs['pixel_values'],
        input_ids = inputs['input_ids'],
        attention_mask = inputs['attention_mask'],
        img_mask = inputs['img_mask'],
        do_sample=False,
        max_length=80,
        min_length=50,
        num_beams=8,
        set_min_padding_size =False,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

And get bad output:

2023-12-19 15:39:32.621621: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621690: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-19 15:39:32.621699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
image 0 is a flamingo, image 1 is a flamingo and image 2 is a polygonal flamingo,...,...,...,...,...,...,...,

What's the problem? I ran the code on 1 A100, and pip packages have the same version in environment.yml
Besides, I got a lower score on MME benchmark, which is
=========== Perception ===========
total score: 1313.6530612244896

 existence  score: 175.0
 count  score: 146.66666666666666
 position  score: 70.0
 color  score: 160.0
 posters  score: 115.98639455782313
 celebrity  score: 125.0
 scene  score: 155.0
 landmark  score: 131.0
 artwork  score: 110.0
 OCR  score: 125.0

=========== Cognition ===========
total score: 275.3571428571429

 commonsense_reasoning  score: 117.85714285714286
 numerical_calculation  score: 42.5
 text_translation  score: 55.0
 code_reasoning  score: 60.0

The prompt is "Use the image 0: {replace_token} as a visual aid to help you answer the questions accurately. Question: {question}", which had been mentioned in previous issues.

How to solve this problem?

@Jianzhao-Huang
Copy link

I've faced a similar issue to yours.

Based on my usage, mmicl seems inclined to produce overly short sentences, which would result in it losing most of its complex reasoning capability.

If the min_length is increased to force it to produce longer sentences, it resorts to filling them with meaningless punctuation and words. For example, "a flamingo standing in the water, with a reflection of the sky in the water. it is a beautiful image of a... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [ellipsis]... [Images of... the... [Flamingo]... [Images of... [Flam the first bird in... [Images of... the first bird in."

Any idea about this?
@HaozheZhao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants