Our Take
LLMs can follow the local flow of collaborative problem-solving but cannot grasp the functional role of individual steps—a gap that separates pattern-matching from genuine mathematical reasoning.
Why it matters
Most benchmarks test isolated problems with clean answers. Real research is messy, iterative, and depends on recognizing flawed logic, partial progress, and proof completion in context. This dataset forces the field to measure what matters.
Do this week
Research teams: audit your math reasoning evals now to see if you're only testing answer accuracy rather than error detection and proof synthesis across multi-turn discussions.
A Dataset of Real Mathematical Collaboration
Researchers at MIT released CrowdMath, a curated dataset of 164 discussion threads from the MIT PRIMES–Art of Problem Solving collaborative research program spanning 2016 to 2025. Each thread traces the path from an open problem statement to a completed proof, with expert annotations marking the functional role of each post: partial progress, proof completion, erroneous reasoning, or error identification.
The dataset stems from actual peer-reviewed work. Discussions on the AoPS platform have produced published results, making CrowdMath grounded in what mathematicians actually do rather than synthetic problem sets.
Two Different Tasks, Only One Gets Solved
Six frontier models (tested but not named in the abstract) were benchmarked on two tasks. First: predicting the next post in a discussion. Here, models achieve 83–88% accuracy, suggesting they can follow the logical progression of a mathematical argument. That sounds promising until you examine the second task.
Post-role classification—identifying whether a contribution is partial progress, a complete proof, an error, or error detection—yields only 0.42 macro-F1 for the best model. This is not a minor gap. It means state-of-the-art LLMs can often guess what comes next in a conversation but cannot reliably understand what each step contributes to the overall solution.
This distinction matters because next-post prediction is a superficial task. A model can succeed by learning statistical patterns of how discussions typically flow. Understanding functional role requires genuine comprehension of mathematical significance. A post that says "this approach won't work because X" is functionally distinct from "here's a partial result" or "I've completed the proof," even if the surface language is similar.
What to Do About It
If you are building math reasoning systems, benchmarking only on well-specified problems with final answers misses the failure mode exposed here. Your model might score well on competition problems or step-by-step proof tasks while remaining unable to track the evolution of a real proof discussion, spot errors as they emerge, or recognize when a gap has been filled.
The dataset is publicly available. Use it to test whether your fine-tuned or prompted models can identify post roles, not just predict continuation. If they cannot, you have identified a real limitation in reasoning that no amount of scaling on closed-problem benchmarks will fix.