Logo DrawEduMath

Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images

1Worcester Polytechnic Institute, 2University of California, Berkeley,
3Insource Services Inc, 4Teaching Lab, 5Allen Institute for AI
NeurIps 2024, Math AI Workshop
DrawEduMath dataset creation

Logo DrawEduMath is a dataset of images of student's handwritten responses to math problems, each with a teacher's description. Each image in our dataset is a concatenation of a math problem on the left with a student response on the right. Teachers describe the student's response to the problem, and then a model, such as GPT-4o shown here, writes QA pairs extracted from facets of the description.

Introduction

In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce Logo DrawEduMath, an English-language dataset of 2030 images of students' handwritten responses to K-12 math problems.

Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 4,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on Logo DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release LogoDrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.

Leaderboard on DrawEduMath

Accuracy Scores on the Logo DrawEduMath dataset.

# Model Date Synthetic QA Teacher QA
1 GPT-4o 2024-10-15 0.722 0.628
2 Claude 3.5 Sonnet 2024-10-15 0.715 0.657
3 Gemini 1.5 Pro 2024-10-11 0.646 0.490
4 Llama 3.2-11B V 2024-10-15 0.388 0.296

The leaderboard scores are based on the judgements using Mixtral 8x22B model.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Logo DrawEduMath Dataset

Overview

Logo DrawEduMath consists of 2,030 images of U.S.based students’ handwritten math responses to 188 math problems spanning Grade 2 through high school. These images were initially collected on the LogoASSISTments online learning platform, where students receive feedback from teachers on assigned work. The problems that accompany each student response are drawn from three overlapping1 open educational resources (OER): Eureka Math, Open Up Resources, and Illustrative Math.

You can download the dataset on Hugging Face Dataset.

data-overview

Key data statistics pertaining to students' math images
included in Logo DrawEduMath.

data-composition

Key data statistics pertaining to the collection of
teachers’ language for Logo DrawEduMath. Word counts
and text lengths are determined using white-space delineated tokens.

Examples

Examples of teacher’s answers to a question asking about possible errors in students’ responses to math problems. All three examples of students’ hand-drawn responses are for the same math problem asking students to draw and shade units on fraction strips to show 4 thirds, shown on the left.

Example of teachers' answers to question about erro

Statistics

Experiment Results

Results on Existing Vision Language Models

grade-lv

BibTeX

@inproceedings{baral2024drawedumath,
  author    = {Baral, Sami and Li, Lucy and Knight, Ryan and Ng, Alice and Soldainin, Luca and Heffernan, Neil and Lo, Kyle},
  title     = {DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images},
  booktitle = {The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24},
  year      = {2024}
}