MIT Dataset Exposes Why LLMs Fail at Collaborative Math

A Dataset of Real Mathematical Collaboration

Researchers at MIT released CrowdMath, a curated dataset of 164 discussion threads from the MIT PRIMES–Art of Problem Solving collaborative research program spanning 2016 to 2025. Each thread traces the path from an open problem statement to a completed proof, with expert annotations marking the functional role of each post: partial progress, proof completion, erroneous reasoning, or error identification.

The dataset stems from actual peer-reviewed work. Discussions on the AoPS platform have produced published results, making CrowdMath grounded in what mathematicians actually do rather than synthetic problem sets.

Two Different Tasks, Only One Gets Solved

Six frontier models (tested but not named in the abstract) were benchmarked on two tasks. First: predicting the next post in a discussion. Here, models achieve 83–88% accuracy, suggesting they can follow the logical progression of a mathematical argument. That sounds promising until you examine the second task.

Post-role classification—identifying whether a contribution is partial progress, a complete proof, an error, or error detection—yields only 0.42 macro-F1 for the best model. This is not a minor gap. It means state-of-the-art LLMs can often guess what comes next in a conversation but cannot reliably understand what each step contributes to the overall solution.

This distinction matters because next-post prediction is a superficial task. A model can succeed by learning statistical patterns of how discussions typically flow. Understanding functional role requires genuine comprehension of mathematical significance. A post that says "this approach won't work because X" is functionally distinct from "here's a partial result" or "I've completed the proof," even if the surface language is similar.

What to Do About It

If you are building math reasoning systems, benchmarking only on well-specified problems with final answers misses the failure mode exposed here. Your model might score well on competition problems or step-by-step proof tasks while remaining unable to track the evolution of a real proof discussion, spot errors as they emerge, or recognize when a gap has been filled.

The dataset is publicly available. Use it to test whether your fine-tuned or prompted models can identify post roles, not just predict continuation. If they cannot, you have identified a real limitation in reasoning that no amount of scaling on closed-problem benchmarks will fix.

MIT Dataset Exposes Why LLMs Fail at Collaborative Math

Our Take

Why it matters

Do this week

A Dataset of Real Mathematical Collaboration

Two Different Tasks, Only One Gets Solved

What to Do About It

Related stories

25 MLOps Guidelines for Model Deployment Now Public

Deeper transformers need smarter residual routing, not just fixed weights

macOS Agents Fail Where Linux Ones Succeed: New 421-Task Benchmark Reveals the Gap