About T²-RAGBench
T²-RAGBench is a realistic and rigorous benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on financial documents combining text and tables. It contains 32,908 question-context-answer triples from 9,095 real-world financial reports, focusing on numerical reasoning and retrieval robustness.
The benchmark comprises four subsets derived from financial datasets:
- FinQA – Single-turn numerical QA from financial reports
- ConvFinQA – Multi-turn QA (first-turn only used)
- VQAonBD – Table-only data with mapped source PDFs
- TAT-DQA – Independent QA set focusing on numerical reasoning
Subset | Domain | # Documents | # QA Pairs | Avg. Tokens/Doc | Avg. Tokens/Question |
---|---|---|---|---|---|
FinQA | Finance | 2,789 | 8,281 | 950.4 | 39.2 |
ConvFinQA | Finance | 1,806 | 3,458 | 890.9 | 30.9 |
VQAonBD | Finance | 1,777 | 9,820 | 460.3 | 43.5 |
TAT-DQA | Finance | 2,723 | 11,349 | 915.3 | 31.7 |
For more details on the benchmark, please refer to our paper or code or write us an email at
t2ragbench@gmail.com
News
📝 The Paper was released on arXiv.
💻 The Code was published on Huggingface.
-
May 2025:
🚀 We officially released
T²-RAGBench
and the accompanying baseline results!📈 Extensive evaluations of RAG methods, including BaseRAG, BM25, HyDE, and summarization strategies, are now available.
Citation
@article{krempel2025t2ragbench, title={T2-RAGBench: A Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation}, author={Krempel, Johannes and Shvets, Elena and Leveling, Johannes and Ney, Hermann}, journal={ACL Findings}, year={2025} }
Submission
We warmly welcome submissions to our leaderboard, including both your own methods and contributions
showcasing the latest system performance! You can submit for all subset results or only for one. Please
refer to the Submission Guidelines below for details, and submit your results as instructed to
t2ragbench@gmail.com
.
Date | Generator | Retriever | Retrieval Method | FinQA | ConvFinQA | VQAonBD | TAT-DQA | W Avg Total | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
NM | MRR@3 | NM | MRR@3 | NM | MRR@3 | NM | MRR@3 | |||||
May 19, 2025 |
QwQ-32B
baseline |
- | Oracle Context | 72.4 | 100 | 85.4 | 100 | 69.6 | 100 | 71.1 | 100 | 72.5 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
- | Oracle Context | 79.4 | 100 | 75.8 | 100 | 68.7 | 100 | 69.2 | 100 | 72.3 |
|
||||||||||||
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
Base-RAG | 39.5 | 38.7 | 47.4 | 42.2 | 40.5 | 46.9 | 29.6 | 25.2 | 37.2 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
Hybrid BM25 | 41.7 | 40.0 | 50.3 | 43.5 | 42.2 | 43.8 | 37.4 | 29.2 | 41.3 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
Reranker | 32.4 | 29.0 | 37.3 | 32.3 | 34.8 | 39.3 | 27.0 | 22.8 | 31.8 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
HyDE | 38.4 | 35.4 | 44.8 | 39.8 | 35.1 | 39.2 | 26.7 | 20.8 | 34.0 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
Summarization | 27.3 | 47.3 | 35.2 | 52.1 | 10.6 | 35.1 | 14.6 | 24.7 | 18.8 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
Multilingual-e5-large-instruct
baseline |
SumContext | 47.2 | 47.3 | 55.5 | 52.1 | 32.5 | 35.4 | 29.1 | 24.8 | 37.4 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
Base-RAG | 39.6 | 38.7 | 48.7 | 42.4 | 41.7 | 46.9 | 27.9 | 25.2 | 37.1 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
Hybrid BM25 | 41.8 | 39.8 | 51.6 | 43.6 | 43.5 | 44.0 | 37.2 | 29.3 | 41.7 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
Reranker | 30.8 | 29.0 | 37.5 | 32.7 | 34.6 | 39.2 | 25.6 | 22.9 | 30.8 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
HyDE | 36.8 | 35.4 | 45.7 | 39.9 | 35.9 | 38.4 | 24.7 | 20.7 | 33.3 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
Summarization | 26.9 | 47.2 | 35.6 | 52.2 | 10.7 | 35.4 | 13.9 | 24.7 | 18.5 |
May 19, 2025 |
QwQ-32B
baseline |
Multilingual-e5-large-instruct
baseline |
SumContext | 45.6 | 47.3 | 56.9 | 52.2 | 33.1 | 35.4 | 27.3 | 24.7 | 36.7 |
May 19, 2025 |
QwQ-32B
baseline |
- | Pretrained-Only | 7.5 | - | 2.4 | - | 1.7 | - | 4.4 | - | 4.2 |
May 19, 2025 |
LLaMA
3.3-70B
baseline |
- | Pretrained-Only | 7.9 | - | 2.8 | 0 | 1.54 | - | 3.7 | - | 3.9 |