Skip to content

Conversation

@Solus-sano
Copy link

While validating OK-VQA results I noticed that BLIP-2 sometimes produces captions that are clearly unrelated to the input image. After tracing the preprocessing steps I found that the same image is rescaled twice

e.g.
when use blip2 to caption the first image(COCO_val2014_000000297147.jpg) in okvqa:
COCO_val2014_000000297147

the result is "a black and white image of a sculpture"

Screenshot 2025-06-29 at 01 57 01

After tracing the preprocessing steps I found that the same image is rescaled twice:

  • torchvision.transforms.ToTensor() converts the PIL.Image to float32 and divides every pixel by 255, putting values in [0, 1].
  • Blip2Processor (BlipImageProcessor) then performs its own rescaling, dividing by 255 again and pushing values into [0, 0.0039].

The model therefore “sees” an almost-black image, which explains the degraded caption quality. Transformers even emits the warning:

It looks like you are trying to rescale already rescaled images.
If the input images have pixel values between 0 and 1, set `do_rescale=False`
to avoid rescaling them again.

Proposed Change

  • set do_rescale=False when calling Blip2Processor

@ControlNet
Copy link
Owner

ControlNet commented Jun 29, 2025

I use the exactly same image and config, and seems it works well to me without rescale=False. Also there is no warning about "rescale already rescaled images".

           DEBUG    ---------------Sending code to executor:---------------              reasoner.py:113
           DEBUG    image_patch.simple_query("What is this?", qa_mode=False)             reasoner.py:114
[14:18:51] DEBUG    ---------------Received result from executor:---------------         reasoner.py:118
           DEBUG    ExecutionResult(type='continue', feedbacks=['The caption for image   reasoner.py:119
                    patch original_image is: a motorcycle is parked on the side of the                  
                    road'], variables=[], variable_names=[], final_result='')                           

But I checked in BLIP2 processor config, indeed it divide 255, and transformers documentations. It looks you're absolute correct.

So I'm not sure if this change will work well for different transformers versions. Before I merge the PR, could you please share the versions of Python, torch, transformers? My current versions are:

  • Python 3.11.13
  • torch 2.1.2
  • transformers 4.43.4

@Solus-sano
Copy link
Author

Hi — first, my apologies for raising the earlier alarm before I had walked through the full code path.
After double-checking I see that self.forward() indeed rescales tensors back to [0, 255]

if len(image) > 0 and 'float' in str(image[0].dtype) and image[0].max() <= 1:
            image = [im * 255 for im in image]

I just use self.qa() to debud and see the warning.
So the double-rescale I thought I caught is not an issue here.

My current versions are:

  • Python 3.11.13
  • torch 2.1.2
  • transformers 4.52.4

But I am encountering difficulties in reproducing the reported results (~26.8 acc on okvqa, far below the 46 – 48 % reported). Following suggestions from a related issue thread, I optimized the code generation prompt. This revision resulted in a noticeable reduction in "null" and "continue" outputs, and the accuracy improved to 34.2%. However, this is still considerably lower than the expected performance. Upon further investigation, I observed that in several "bad cases," the captions generated by BLIP are inaccurate and deviate significantly from the actual image content.

Could you please provide some advice to help me resolve this discrepancy? I am considering several possibilities:

  • Could my BLIP-2 checkpoint be corrupted?
  • Would further optimization of the prompt be sufficient to reach the reported accuracy?

@ControlNet
Copy link
Owner

After double-checking I see that self.forward() indeed rescales tensors back to [0, 255]

Thank you for confirming it. I close the pull request as it is not required.

Following suggestions from a related issue thread, I optimized the code generation prompt. This revision resulted in a noticeable reduction in "null" and "continue" outputs, and the accuracy improved to 34.2%.

For the non-RL version, it will heavily depend on the prompt quality, and it will affect the performance significantly.

Could my BLIP-2 checkpoint be corrupted?

I don't think it will be the problem.

Would further optimization of the prompt be sufficient to reach the reported accuracy?

I think yes.

Also, I need to remind that the way calculating the reported numbers in paper follows previous works i.e. excluding the runtime failure. Besides, the non-RL version has a few percentage worse than the RL version.

@ControlNet ControlNet closed this Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants