Accuracy scores on the
DrawEduMath dataset.
# | Model | Date | Teacher QA | Synthetic QA |
1 | Gemini Pro 2.5 Preview | 2025-04-07 | 0.762 | 0.847 |
3 | GPT 4.5 Preview | 2025-04-04 | 0.730 | 0.839 |
4 | GPT 4.1 | 2025-04-19 | 0.723 | 0.823 |
5 | OpenAI o4-mini | 2025-04-18 | 0.721 | 0.830 |
6 | Llama 4 Scout | 2025-04-18 | 0.713 | 0.736 |
7 | Claude 3.7 Sonnet | 2025-03-05 | 0.700 | 0.779 |
8 | Gemini Flash 2.0 | 2025-03-11 | 0.696 | 0.797 |
9 | Llama 4 Maverick (FP8) | 2025-04-18 | 0.677 | 0.749 |
10 | Qwen 2.5 VL 72B Instruct | 2025-03-21 | 0.658 | 0.788 |
11 | Claude 3.5 Sonnet | 2024-10-15 | 0.657 | 0.715 |
12 | Qwen VL Max | 2025-03-12 | 0.644 | 0.754 |
13 | Gemma 3 27B | 2025-03-26 | 0.633 | 0.706 |
14 | GPT-4o | 2024-10-15 | 0.628 | 0.722 | 15 | Phi 4 Multimodal Instruct | 2025-04-05 | 0.548 | 0.595 |
16 | Gemini 1.5 Pro | 2024-10-11 | 0.490 | 0.646 |
17 | Llama 3.2-11B V | 2024-10-15 | 0.296 | 0.388 |
The leaderboard scores are based on similarity judgements of VLMs' answers to gold ones obtained using a Mixtral 8x22B model.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.