Hi, thank you for your great work on TokenPacker!
I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.
Hardware Setup
Results Comparison
- Row 1: Results reported in the paper
- Row 2: Results from the released checkpoint
- Row 3+: My experiments under different settings
| Method |
TextVQA |
OCRB |
DocVQA |
MMB |
MMMU |
MME |
VQAv2 |
VizWiz |
POPE |
| Reported in paper |
68.0 |
452 |
60.2 |
67.4 |
35.4 |
1489/338 |
81.2 |
54.7 |
88.2 |
| Released checkpoint |
67.92 |
452 |
27 |
67.35 |
35.89 |
1489.02/337.5 |
81.17 |
54.63 |
88.15 |
| Exp 1 |
41.29 |
17 |
9 |
21.13 |
31.44 |
675.46/283.93 |
67.5 |
48.12 |
56.70 |
| Exp 2 |
36.53 |
14 |
8 |
20.79 |
28.89 |
653.94/248.57 |
67.12 |
48.12 |
55.6 |
| Exp 3 |
40.14 |
19 |
8 |
21.05 |
31.22 |
666.27/240.36 |
45.7 |
47.53 |
51.07 |
| Exp 4 |
40.37 |
17 |
8 |
21.21 |
30.67 |
720.37/273.21 |
45.25 |
47.92 |
58.94 |
Experiment Details
-
Exp 1
- Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
- Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
- Results far from paper/released checkpoint
-
Exp 2 (following Issue #12)
-
Exp 3
- Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
- Still far from expected results.
-
Exp 4
- Same as Exp 1, but with
deepspeed seed and dataset seed set to 2024
- Still not close to paper/released checkpoint.
Questions
- Could you clarify:
- The exact learning rate schedule and batch size settings used in pretraining/finetuning?
- Whether there are other important hyperparameters (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
- Could you also provide the pretraining dataset JSON and instruction-tuning dataset JSON?
- I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about
11,627 × 128 = 1,488,256 samples. But the actual size of the Mini Gemini instruction-tuning dataset is 1,511,341, meaning around 20k samples are missing.
- Could you provide the exact JSON datasets used, so reproduction is faithful?
Hi, thank you for your great work on TokenPacker!
I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.
Hardware Setup
Results Comparison
Experiment Details
Exp 1
Exp 2 (following Issue #12)
Exp 3
Exp 4
deepspeedseed and dataset seed set to 2024Questions
11,627 × 128 = 1,488,256samples. But the actual size of the Mini Gemini instruction-tuning dataset is1,511,341, meaning around 20k samples are missing.