Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline #11

Solus-sano · 2025-06-28T18:28:47Z

While validating OK-VQA results I noticed that BLIP-2 sometimes produces captions that are clearly unrelated to the input image. After tracing the preprocessing steps I found that the same image is rescaled twice

e.g.
when use blip2 to caption the first image(COCO_val2014_000000297147.jpg) in okvqa:

the result is "a black and white image of a sculpture"

After tracing the preprocessing steps I found that the same image is rescaled twice:

torchvision.transforms.ToTensor() converts the PIL.Image to float32 and divides every pixel by 255, putting values in [0, 1].
Blip2Processor (BlipImageProcessor) then performs its own rescaling, dividing by 255 again and pushing values into [0, 0.0039].

The model therefore “sees” an almost-black image, which explains the degraded caption quality. Transformers even emits the warning:

It looks like you are trying to rescale already rescaled images.
If the input images have pixel values between 0 and 1, set `do_rescale=False`
to avoid rescaling them again.

Proposed Change

set do_rescale=False when calling Blip2Processor

ControlNet · 2025-06-29T04:24:39Z

I use the exactly same image and config, and seems it works well to me without rescale=False. Also there is no warning about "rescale already rescaled images".

           DEBUG    ---------------Sending code to executor:---------------              reasoner.py:113
           DEBUG    image_patch.simple_query("What is this?", qa_mode=False)             reasoner.py:114
[14:18:51] DEBUG    ---------------Received result from executor:---------------         reasoner.py:118
           DEBUG    ExecutionResult(type='continue', feedbacks=['The caption for image   reasoner.py:119
                    patch original_image is: a motorcycle is parked on the side of the                  
                    road'], variables=[], variable_names=[], final_result='')

But I checked in BLIP2 processor config, indeed it divide 255, and transformers documentations. It looks you're absolute correct.

So I'm not sure if this change will work well for different transformers versions. Before I merge the PR, could you please share the versions of Python, torch, transformers? My current versions are:

Python 3.11.13
torch 2.1.2
transformers 4.43.4

Solus-sano · 2025-06-29T19:29:55Z

Hi — first, my apologies for raising the earlier alarm before I had walked through the full code path.
After double-checking I see that self.forward() indeed rescales tensors back to [0, 255]

if len(image) > 0 and 'float' in str(image[0].dtype) and image[0].max() <= 1:
            image = [im * 255 for im in image]

I just use self.qa() to debud and see the warning.
So the double-rescale I thought I caught is not an issue here.

My current versions are:

Python 3.11.13
torch 2.1.2
transformers 4.52.4

But I am encountering difficulties in reproducing the reported results (~26.8 acc on okvqa, far below the 46 – 48 % reported). Following suggestions from a related issue thread, I optimized the code generation prompt. This revision resulted in a noticeable reduction in "null" and "continue" outputs, and the accuracy improved to 34.2%. However, this is still considerably lower than the expected performance. Upon further investigation, I observed that in several "bad cases," the captions generated by BLIP are inaccurate and deviate significantly from the actual image content.

Could you please provide some advice to help me resolve this discrepancy? I am considering several possibilities:

Could my BLIP-2 checkpoint be corrupted?
Would further optimization of the prompt be sufficient to reach the reported accuracy?

ControlNet · 2025-06-30T04:10:18Z

After double-checking I see that self.forward() indeed rescales tensors back to [0, 255]

Thank you for confirming it. I close the pull request as it is not required.

Following suggestions from a related issue thread, I optimized the code generation prompt. This revision resulted in a noticeable reduction in "null" and "continue" outputs, and the accuracy improved to 34.2%.

For the non-RL version, it will heavily depend on the prompt quality, and it will affect the performance significantly.

Could my BLIP-2 checkpoint be corrupted?

I don't think it will be the problem.

Would further optimization of the prompt be sufficient to reach the reported accuracy?

I think yes.

Also, I need to remind that the way calculating the reported numbers in paper follows previous works i.e. excluding the runtime failure. Besides, the non-RL version has a few percentage worse than the RL version.

set do_rescale=False in blip image processor.

272b0fe

ControlNet closed this Jun 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline #11

Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline #11

Uh oh!

Solus-sano commented Jun 28, 2025

Uh oh!

ControlNet commented Jun 29, 2025 •

edited

Loading

Uh oh!

Solus-sano commented Jun 29, 2025

Uh oh!

ControlNet commented Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline #11

Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline #11

Uh oh!

Conversation

Solus-sano commented Jun 28, 2025

Uh oh!

ControlNet commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Solus-sano commented Jun 29, 2025

Uh oh!

ControlNet commented Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ControlNet commented Jun 29, 2025 •

edited

Loading