Our Take
Supervised fine-tuning on combined language data outperforms prompting, but the shared-task metric (LLM-as-judge) is opaque—practitioners need independent benchmarks before betting on this for production financial extraction.
Why it matters
Financial narrative processing demands precision that general-purpose LLMs often miss; this task-specific study shows fine-tuning moves the needle, but the Spanish gap (third place) and reliance on vendor benchmarking leave real-world applicability unproven.
Do this week
Finance teams: Before adopting multilingual QA for earnings narratives, request reproducible benchmarks against your own causality annotations—shared-task scores on LLM judges don't transfer cleanly to your data.
GPT-4.1 Mini Edges Out Competitors on English Financial Extraction
Team HSA_CORAL submitted a comparison of three modeling families to the FinCausal 2026 shared task, which asked systems to extract cause-effect relations from financial narratives in English and Spanish using extractive question answering. The three approaches were: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning.
GPT-4.1 Mini fine-tuned on combined English and Spanish training data achieved a tied highest score on the English subtask (4.8140 per the shared task's LLM-as-judge metric) and ranked third on Spanish (4.7753). The key finding: across all three modeling families, supervised fine-tuning delivered the largest performance gains compared to prompting and few-shot examples alone.
Fine-Tuning Wins, But the Metric Itself Is the Risk
The result confirms an established pattern: task-specific adaptation beats zero-shot or few-shot prompting for financial NLP. Cause-effect extraction in earnings reports, SEC filings, and analyst narratives requires domain knowledge and precision that general instruction-tuned models often lack. Combining English and Spanish training data during fine-tuning enabled cross-lingual transfer, helping the model perform credibly on both languages despite the Spanish gap.
However, the shared task uses an "LLM-as-judge" metric. No independent human annotators or external benchmarks are mentioned. This means the winning scores depend on whether GPT-4 (or whatever judge model was used) agrees with the extracted causal relations. That circularity is a known weakness in shared-task evaluation: you can optimize toward the judge without improving the underlying task. A team deploying this approach in production would need human-annotated validation on their own financial narratives before trusting the numbers.
Evaluate on Your Own Data Before Deploying
If you are responsible for extracting causality from financial text, the takeaway is clear: prompting alone is insufficient. Fine-tuning is worth the engineering cost. But shared-task rankings are a poor proxy for production performance. The gap between English (tied first) and Spanish (third place) also signals that multilingual transfer is not automatic; your performance on underrepresented languages may lag significantly even after fine-tuning. Test on your own annotated dataset, measure human agreement against your judge, and do not assume that a shared-task win transfers to your domain or your financial narrative corpus.