✨ FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information ✨

📁 Benchmark Data | 📖 Arxiv | 🛠️ Evaluation Framework

🌟 Overview

📚 Datasets Released

📂 Dataset	📝 Description
FinNI-eval	Evaluation set for FinNI subtask within FinTagging benchmark.
FinCL-eval	Evaluation set for FinCL subtask within FinTagging benchmark.
FinTagging_Original	Original benchmark dataset without preprocessing, suitable for custom research. Annotated data (`benchmark_ground_truth_pipeline.json`) provided in the "annotation" folder.
FinTagging_BIO	BIO-format dataset tailored for token-level tagging with BERT-series models.

🧑‍💻 Evaluated LLMs and PLMs

We benchmarked FinTagging alongside 10 cutting-edge LLMs and 3 advanced PTMs:

🌐 GPT-4o — OpenAI’s multimodal flagship model with structured output support.
🚀 DeepSeek-V3 — A MoE reasoning model with efficient inference via MLA.
🧠 Qwen2.5 Series — Multilingual models optimized for reasoning, coding, and math. Here, we assessed 14B, 1.5B, and 0.8B Instruct models.
🦙 Llama-3 Series — Meta’s open-source instruction-tuned models for long context. Here, we assessed the Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct models.
🧭 DeepSeek-R1 Series — RL-tuned first-gen reasoning models with zero-shot strength. Here, we only assessed the DeepSeek-R1-Distill-Qwen-32B model.
🧪 Gemma-2 Model — Google’s latest instruction-tuned model with open weights. Here, we only assess the gemma-2-27b-it model.
💎 Fino1-8B — Our in-house financial LLM with strong reasoning capability.
🏛️ BERT-large — The classic transformer encoder for language understanding.
📉 FinBERT — A financial domain-tuned BERT for sentiment analysis.
🧾 SECBERT — BERT model fine-tuned on SEC filings for financial disclosure tasks.

📌 Evaluation Methodology

Local Model Inference: Conducted via FinBen (VLLM framework).
We provide task-specific evaluation scripts through our forked version of the FinBen framework, available at: https://github.com/Yan2266336/FinBen.
For the FinNI task, you can directly execute the provided script to evaluate a variety of LLMs, including both local and API-based models.
For the FinCL task, first run the retrieval script from the repository to obtain US-GAAP candidate concepts. Then, use our provided prompts to construct instruction-style inputs, and apply the reranking method implemented in the forked FinBen to identify the most appropriate US-GAAP concept.
Note: Running the retrieval script requires a local installation of Elasticsearch, we provided our index document at Google Drive: https://drive.google.com/file/d/1cyMONjP9WdHtD8-WGezmgh_LNhbY3qtR/view?usp=drive_link. However, you can construct your own index document instead of using ours.

📊 Key Performance Metrics

Table: Overall Performance
🥇 = best, 🥈 = second-best, 🥉 = third-best

Category	Models	Macro P	Macro R	Macro F1	Micro P	Micro R	Micro F1
Closed-source LLM	GPT-4o	0.0764 🥈	0.0576 🥈	0.0508 🥈	0.0947	0.0788	0.0860
Open-source LLMs	DeepSeek-V3	0.0813 🥇	0.0696 🥇	0.0582 🥇	0.1058	0.1217 🥉	0.1132 🥉
	DeepSeek-R1-Distill-Qwen-32B	0.0482 🥉	0.0288 🥉	0.0266 🥉	0.0692	0.0223	0.0337
	Qwen2.5-14B-Instruct	0.0423	0.0256	0.0235	0.0197	0.0133	0.0159
	gemma-2-27b-it	0.0430	0.0273	0.0254	0.0519	0.0453	0.0483
	Llama-3.1-8B-Instruct	0.0287	0.0152	0.0137	0.0462	0.0154	0.0231
	Llama-3.2-3B-Instruct	0.0182	0.0109	0.0083	0.0151	0.0102	0.0121
	Qwen2.5-1.5B-Instruct	0.0180	0.0079	0.0069	0.0248	0.0060	0.0096
	Qwen2.5-0.5B-Instruct	0.0014	0.0003	0.0004	0.0047	0.0001	0.0002
Financial LLM	Fino1-8B	0.0299	0.0146	0.0140	0.0355	0.0133	0.0193
Fine-tuned PLMs	BERT-large	0.0135	0.0200	0.0126	0.1397 🥈	0.1145 🥈	0.1259 🥈
	FinBERT	0.0088	0.0143	0.0087	0.1293 🥉	0.0963	0.1104
	SECBERT	0.0308	0.0483	0.0331	0.2144 🥇	0.2146 🥇	0.2145 🥇

📖 Citation

If you find our benchmark useful, please cite:

@misc{wang2025fintaggingllmreadybenchmarkextracting,
      title={FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information}, 
      author={Yan Wang and Yang Ren and Lingfei Qian and Xueqing Peng and Keyi Wang and Yi Han and Dongji Feng and Xiao-Yang Liu and Jimin Huang and Qianqian Xie},
      year={2025},
      eprint={2505.20650},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20650}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
BERT		BERT
annotation		annotation
retrieval		retrieval
taxonomy		taxonomy
README.md		README.md
construct_train&test_data(bert).ipynb		construct_train&test_data(bert).ipynb
evaluate_NEN.ipynb		evaluate_NEN.ipynb
parse_taxonomy.ipynb		parse_taxonomy.ipynb
parse_xbrl_report.ipynb		parse_xbrl_report.ipynb
sample_pos&neg_data.ipynb		sample_pos&neg_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✨ FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information ✨

🌟 Overview

📚 Datasets Released

🧑‍💻 Evaluated LLMs and PLMs

📌 Evaluation Methodology

📊 Key Performance Metrics

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

The-FinAI/FinTagging

Folders and files

Latest commit

History

Repository files navigation

✨ FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information ✨

🌟 Overview

📚 Datasets Released

🧑‍💻 Evaluated LLMs and PLMs

📌 Evaluation Methodology

📊 Key Performance Metrics

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages