DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images

Logo DrawEduMath is a dataset of images of student's handwritten responses to math problems, each with a teacher's description. Each image in our dataset is a concatenation of a math problem on the left with a student response on the right. Teachers describe the student's response to the problem, and then a model, such as GPT-4o shown here, writes QA pairs extracted from facets of the description.

Introduction

In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce Logo DrawEduMath, an English-language dataset of 2030 images of students' handwritten responses to K-12 math problems.

Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 44,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on Logo DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release Logo DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.

Leaderboard on DrawEduMath

Accuracy scores on the Logo DrawEduMath dataset.

#	Model	Date	Teacher QA	Synthetic QA
1	Gemini 3 Pro Preview	2025-12-23	0.713	-
2	Gemini 2.5 Pro Preview	2025-04-07	0.660	0.789
3	GPT 5	2025-11-01	0.640	0.796
4	Gemini 2.5 Pro	2025-10-27	0.635	0.780
5	GPT 4.5 Preview	2025-04-04	0.592	0.765
6	GPT 4.1	2025-04-19	0.581	0.743
7	Claude Opus 4.5	2025-12-23	0.578	-
8	GPT o4-mini	2025-04-18	0.572	0.757
9	GPT 5.2	2025-12-23	0.564	-
10	Gemini Flash 2.0	2025-03-11	0.545	0.704
11	Claude 3.7 Sonnet	2025-03-05	0.517	0.673
12	Claude Sonnet 4	2025-11-15	0.476	0.690
13	Claude Sonnet 4.5	2025-12-05	0.473	0.689
14	Llama 4 Scout	2025-04-18	0.445	0.610

The leaderboard scores are based on binarized accuracy (correct/incorrect) of VLM responses as judged by an ensemble of three models (Claude 3.5 Sonnet, Gemini 2.5 Pro, and GPT-4o), with the final rating determined by the mode across judges.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Overview

Logo DrawEduMath consists of 2,030 images of U.S.based students' handwritten math responses to 188 math problems spanning Grade 2 through high school. These images were initially collected on the Logo ASSISTments online learning platform, where students receive feedback from teachers on assigned work. The problems that accompany each student response are drawn from three overlapping open educational resources (OER): Eureka Math, Open Up Resources, and Illustrative Math.

In the future, we will release the dataset on Hugging Face, but in the meantime, fill out this Google form to express interest.

Key data statistics pertaining to students' math images
included in Logo DrawEduMath.

Key data statistics pertaining to the collection of
teachers' language for Logo DrawEduMath. Word counts
and text lengths are determined using white-space delineated tokens.

Examples

Here are examples of teachers' answers to a question asking about possible errors in students' responses to math problems. All three examples of students' hand-drawn responses are for the same math problem asking students to draw and shade units on fraction strips to show 4 thirds, shown on the left.

Example of teachers' answers to question about erro

Statistics

Performance of vision language models on the DrawEduMath Teacher QA benchmark. The x-axis shows evaluation dates, where each point represents a discrete model evaluation where a new model's responses were generated on that date. Scores represent binarized accuracy judged by an ensemble of Claude 3.5 Sonnet, Gemini 2.5 Pro, and GPT-4o.

The most common question types in our Logo DrawEduMath benchmark, along with examples of questions categorized within each type.
The percentages shown are the proportion of questions across all images within each QA-writing workflow (AI-generated or teacher-written).

Overall question types in our VQA benchmark

Examples of the most common question types in our Logo DrawEduMath benchmark, categorized by type.

BibTeX

@inproceedings{barallucy2024drawedumath,
  author    = {Baral, Sami and Lucy, Li and Knight, Ryan and Ng, Alice and Soldaini, Luca and Heffernan, Neil and Lo, Kyle},
  title     = {DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images},
  booktitle = {Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  year      = {2025}
}

DrawEduMath

Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images