GPT-4.1 Mini Tops English Test for Financial Cause-Effect QA

GPT-4.1 Mini Edges Out Competitors on English Financial Extraction

Team HSA_CORAL submitted a comparison of three modeling families to the FinCausal 2026 shared task, which asked systems to extract cause-effect relations from financial narratives in English and Spanish using extractive question answering. The three approaches were: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning.

GPT-4.1 Mini fine-tuned on combined English and Spanish training data achieved a tied highest score on the English subtask (4.8140 per the shared task's LLM-as-judge metric) and ranked third on Spanish (4.7753). The key finding: across all three modeling families, supervised fine-tuning delivered the largest performance gains compared to prompting and few-shot examples alone.

Fine-Tuning Wins, But the Metric Itself Is the Risk

The result confirms an established pattern: task-specific adaptation beats zero-shot or few-shot prompting for financial NLP. Cause-effect extraction in earnings reports, SEC filings, and analyst narratives requires domain knowledge and precision that general instruction-tuned models often lack. Combining English and Spanish training data during fine-tuning enabled cross-lingual transfer, helping the model perform credibly on both languages despite the Spanish gap.

However, the shared task uses an "LLM-as-judge" metric. No independent human annotators or external benchmarks are mentioned. This means the winning scores depend on whether GPT-4 (or whatever judge model was used) agrees with the extracted causal relations. That circularity is a known weakness in shared-task evaluation: you can optimize toward the judge without improving the underlying task. A team deploying this approach in production would need human-annotated validation on their own financial narratives before trusting the numbers.

Evaluate on Your Own Data Before Deploying

If you are responsible for extracting causality from financial text, the takeaway is clear: prompting alone is insufficient. Fine-tuning is worth the engineering cost. But shared-task rankings are a poor proxy for production performance. The gap between English (tied first) and Spanish (third place) also signals that multilingual transfer is not automatic; your performance on underrepresented languages may lag significantly even after fine-tuning. Test on your own annotated dataset, measure human agreement against your judge, and do not assume that a shared-task win transfers to your domain or your financial narrative corpus.

GPT-4.1 Mini Tops English Test for Financial Cause-Effect QA

Our Take

Why it matters

Do this week

GPT-4.1 Mini Edges Out Competitors on English Financial Extraction

Fine-Tuning Wins, But the Metric Itself Is the Risk

Evaluate on Your Own Data Before Deploying

Related stories

Non-observable states cut Markovian bandit regret near-logarithmic

New method lets you interpret protein AI models without exploding feature counts

Darts Adds Four Foundation Models in One Interface