Accuracy scores on the
DrawEduMath dataset.
# | Model | Date | Synthetic QA | Teacher QA |
1 | Claude 3.7 Sonnet | 2025-03-05 | 0.779 | 0.700 |
2 | Gemini Flash 2.0 | 2025-03-11 | 0.797 | 0.696 |
3 | Claude 3.5 Sonnet | 2024-10-15 | 0.715 | 0.657 |
4 | Qwen 2.5 VL 72B Instruct | 2025-03-21 | 0.788 | 0.658 |
5 | Qwen VL Max | 2025-03-12 | 0.754 | 0.644 |
6 | Gemma 3 27B | 2025-03-26 | 0.706 | 0.633 |
7 | GPT-4o | 2024-10-15 | 0.722 | 0.628 |
8 | Gemini 1.5 Pro | 2024-10-11 | 0.646 | 0.490 |
9 | Llama 3.2-11B V | 2024-10-15 | 0.388 | 0.296 |
The leaderboard scores are based on similarity judgements of VLMs' answers to gold ones obtained using a Mixtral 8x22B model.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.