| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| CHAIR metric evaluation on MSCOCO dataset using InstructBLIP and MiniGPT-4. Lower CHAIR scores indicate fewer hallucinations. | ||||
| CHAIR (MSCOCO) | CHAIR_S | 32.3 | 8.5 | -23.8 |
| CHAIR (MSCOCO) | CHAIR_I | 12.8 | 3.5 | -9.3 |
| CHAIR (MSCOCO) | CHAIR_S | 29.2 | 21.6 | -7.6 |
| POPE benchmark evaluation (Random split) measuring object hallucination accuracy. | ||||
| POPE (Random) | Accuracy | 88.57 | 91.13 | +2.56 |
| POPE (Random) | Accuracy | 86.9 | 89.2 | +2.3 |
| GPT-4V Evaluation for open-ended generation quality. | ||||
| LLaVA-Bench | Score (1-10) | 5.6 | 7.2 | +1.6 |