Back to news
AnalysisJune 8, 2026· 2 min read

MIT Dataset Exposes Why LLMs Fail at Collaborative Math

CrowdMath, a new dataset of 164 expert-annotated math discussions from MIT PRIMES–Art of Problem Solving, reveals a critical gap: models predict the next post 83–88% of the time but struggle to understand what each contribution actually does in a proof.

Our Take

LLMs can follow the local flow of collaborative problem-solving but cannot grasp the functional role of individual steps—a gap that separates pattern-matching from genuine mathematical reasoning.

Why it matters

Most benchmarks test isolated problems with clean answers. Real research is messy, iterative, and depends on recognizing flawed logic, partial progress, and proof completion in context. This dataset forces the field to measure what matters.

Do this week

Research teams: audit your math reasoning evals now to see if you're only testing answer accuracy rather than error detection and proof synthesis across multi-turn discussions.

A Dataset of Real Mathematical Collaboration

Researchers at MIT released CrowdMath, a curated dataset of 164 discussion threads from the MIT PRIMES–Art of Problem Solving collaborative research program spanning 2016 to 2025. Each thread traces the path from an open problem statement to a completed proof, with expert annotations marking the functional role of each post: partial progress, proof completion, erroneous reasoning, or error identification.

The dataset stems from actual peer-reviewed work. Discussions on the AoPS platform have produced published results, making CrowdMath grounded in what mathematicians actually do rather than synthetic problem sets.

Two Different Tasks, Only One Gets Solved

Six frontier models (tested but not named in the abstract) were benchmarked on two tasks. First: predicting the next post in a discussion. Here, models achieve 83–88% accuracy, suggesting they can follow the logical progression of a mathematical argument. That sounds promising until you examine the second task.

Post-role classification—identifying whether a contribution is partial progress, a complete proof, an error, or error detection—yields only 0.42 macro-F1 for the best model. This is not a minor gap. It means state-of-the-art LLMs can often guess what comes next in a conversation but cannot reliably understand what each step contributes to the overall solution.

This distinction matters because next-post prediction is a superficial task. A model can succeed by learning statistical patterns of how discussions typically flow. Understanding functional role requires genuine comprehension of mathematical significance. A post that says "this approach won't work because X" is functionally distinct from "here's a partial result" or "I've completed the proof," even if the surface language is similar.

What to Do About It

If you are building math reasoning systems, benchmarking only on well-specified problems with final answers misses the failure mode exposed here. Your model might score well on competition problems or step-by-step proof tasks while remaining unable to track the evolution of a real proof discussion, spot errors as they emerge, or recognize when a gap has been filled.

The dataset is publicly available. Use it to test whether your fine-tuned or prompted models can identify post roles, not just predict continuation. If they cannot, you have identified a real limitation in reasoning that no amount of scaling on closed-problem benchmarks will fix.

#Research#LLM#AI Ethics
Share:
Keep reading

Related stories