📁 Benchmark Data | 📖 Arxiv | 🛠️ Evaluation Framework
📂 Dataset | 📝 Description |
---|---|
FinNI-eval | Evaluation set for FinNI subtask within FinTagging benchmark. |
FinCL-eval | Evaluation set for FinCL subtask within FinTagging benchmark. |
FinTagging_Original | Original benchmark dataset without preprocessing, suitable for custom research. Annotated data (benchmark_ground_truth_pipeline.json ) provided in the "annotation" folder. |
FinTagging_BIO | BIO-format dataset tailored for token-level tagging with BERT-series models. |
We benchmarked FinTagging alongside 10 cutting-edge LLMs and 3 advanced PTMs:
- 🌐 GPT-4o — OpenAI’s multimodal flagship model with structured output support.
- 🚀 DeepSeek-V3 — A MoE reasoning model with efficient inference via MLA.
- 🧠 Qwen2.5 Series — Multilingual models optimized for reasoning, coding, and math. Here, we assessed 14B, 1.5B, and 0.8B Instruct models.
- 🦙 Llama-3 Series — Meta’s open-source instruction-tuned models for long context. Here, we assessed the Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct models.
- 🧭 DeepSeek-R1 Series — RL-tuned first-gen reasoning models with zero-shot strength. Here, we only assessed the DeepSeek-R1-Distill-Qwen-32B model.
- 🧪 Gemma-2 Model — Google’s latest instruction-tuned model with open weights. Here, we only assess the gemma-2-27b-it model.
- 💎 Fino1-8B — Our in-house financial LLM with strong reasoning capability.
- 🏛️ BERT-large — The classic transformer encoder for language understanding.
- 📉 FinBERT — A financial domain-tuned BERT for sentiment analysis.
- 🧾 SECBERT — BERT model fine-tuned on SEC filings for financial disclosure tasks.
- Local Model Inference: Conducted via FinBen (VLLM framework).
- We provide task-specific evaluation scripts through our forked version of the FinBen framework, available at: https://github.com/Yan2266336/FinBen.
- For the FinNI task, you can directly execute the provided script to evaluate a variety of LLMs, including both local and API-based models.
- For the FinCL task, first run the retrieval script from the repository to obtain US-GAAP candidate concepts. Then, use our provided prompts to construct instruction-style inputs, and apply the reranking method implemented in the forked FinBen to identify the most appropriate US-GAAP concept.
- Note: Running the retrieval script requires a local installation of Elasticsearch, we provided our index document at Google Drive: https://drive.google.com/file/d/1cyMONjP9WdHtD8-WGezmgh_LNhbY3qtR/view?usp=drive_link. However, you can construct your own index document instead of using ours.
🥇 = best, 🥈 = second-best, 🥉 = third-best
Category | Models | Macro P | Macro R | Macro F1 | Micro P | Micro R | Micro F1 |
---|---|---|---|---|---|---|---|
Closed-source LLM | GPT-4o | 0.0764 🥈 | 0.0576 🥈 | 0.0508 🥈 | 0.0947 | 0.0788 | 0.0860 |
Open-source LLMs | DeepSeek-V3 | 0.0813 🥇 | 0.0696 🥇 | 0.0582 🥇 | 0.1058 | 0.1217 🥉 | 0.1132 🥉 |
DeepSeek-R1-Distill-Qwen-32B | 0.0482 🥉 | 0.0288 🥉 | 0.0266 🥉 | 0.0692 | 0.0223 | 0.0337 | |
Qwen2.5-14B-Instruct | 0.0423 | 0.0256 | 0.0235 | 0.0197 | 0.0133 | 0.0159 | |
gemma-2-27b-it | 0.0430 | 0.0273 | 0.0254 | 0.0519 | 0.0453 | 0.0483 | |
Llama-3.1-8B-Instruct | 0.0287 | 0.0152 | 0.0137 | 0.0462 | 0.0154 | 0.0231 | |
Llama-3.2-3B-Instruct | 0.0182 | 0.0109 | 0.0083 | 0.0151 | 0.0102 | 0.0121 | |
Qwen2.5-1.5B-Instruct | 0.0180 | 0.0079 | 0.0069 | 0.0248 | 0.0060 | 0.0096 | |
Qwen2.5-0.5B-Instruct | 0.0014 | 0.0003 | 0.0004 | 0.0047 | 0.0001 | 0.0002 | |
Financial LLM | Fino1-8B | 0.0299 | 0.0146 | 0.0140 | 0.0355 | 0.0133 | 0.0193 |
Fine-tuned PLMs | BERT-large | 0.0135 | 0.0200 | 0.0126 | 0.1397 🥈 | 0.1145 🥈 | 0.1259 🥈 |
FinBERT | 0.0088 | 0.0143 | 0.0087 | 0.1293 🥉 | 0.0963 | 0.1104 | |
SECBERT | 0.0308 | 0.0483 | 0.0331 | 0.2144 🥇 | 0.2146 🥇 | 0.2145 🥇 |
If you find our benchmark useful, please cite:
@misc{wang2025fintaggingllmreadybenchmarkextracting,
title={FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information},
author={Yan Wang and Yang Ren and Lingfei Qian and Xueqing Peng and Keyi Wang and Yi Han and Dongji Feng and Xiao-Yang Liu and Jimin Huang and Qianqian Xie},
year={2025},
eprint={2505.20650},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20650},
}