T²-RAGBench

Evaluating Retrieval-Augmented Generation Models for Text-Table QA

About T²-RAGBench

T²-RAGBench is a realistic and rigorous benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on financial documents combining text and tables. It contains 32,908 question-context-answer triples from 9,095 real-world financial reports, focusing on numerical reasoning and retrieval robustness.

The benchmark comprises four subsets derived from financial datasets:

  • FinQA – Single-turn numerical QA from financial reports
  • ConvFinQA – Multi-turn QA (first-turn only used)
  • VQAonBD – Table-only data with mapped source PDFs
  • TAT-DQA – Independent QA set focusing on numerical reasoning
Subset Domain # Documents # QA Pairs Avg. Tokens/Doc Avg. Tokens/Question
FinQA Finance 2,789 8,281 950.4 39.2
ConvFinQA Finance 1,806 3,458 890.9 30.9
VQAonBD Finance 1,777 9,820 460.3 43.5
TAT-DQA Finance 2,723 11,349 915.3 31.7

For more details on the benchmark, please refer to our paper or code or write us an email at t2ragbench@gmail.com

News

  • June 2025:

    📝 The Paper was released on arXiv.

    💻 The Code was published on Huggingface.

    • May 2025:

      🚀 We officially released T²-RAGBench and the accompanying baseline results!

      📈 Extensive evaluations of RAG methods, including BaseRAG, BM25, HyDE, and summarization strategies, are now available.

    Citation

                @article{krempel2025t2ragbench,
                  title={T2-RAGBench: A Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation},
                  author={Krempel, Johannes and Shvets, Elena and Leveling, Johannes and Ney, Hermann},
                  journal={ACL Findings},
                  year={2025}
                }
              

    Submission

    We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest system performance! You can submit for all subset results or only for one. Please refer to the Submission Guidelines below for details, and submit your results as instructed to t2ragbench@gmail.com.

    Leaderboard
    Date Generator Retriever Retrieval Method FinQA ConvFinQA VQAonBD TAT-DQA W Avg Total
    NM MRR@3 NM MRR@3 NM MRR@3 NM MRR@3
    May 19, 2025 QwQ-32B
    baseline
    - Oracle Context 72.4 100 85.4 100 69.6 100 71.1 100 72.5
    May 19, 2025 LLaMA 3.3-70B
    baseline
    - Oracle Context 79.4 100 75.8 100 68.7 100 69.2 100 72.3

    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Base-RAG 39.5 38.7 47.4 42.2 40.5 46.9 29.6 25.2 37.2
    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Hybrid BM25 41.7 40.0 50.3 43.5 42.2 43.8 37.4 29.2 41.3
    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Reranker 32.4 29.0 37.3 32.3 34.8 39.3 27.0 22.8 31.8
    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    HyDE 38.4 35.4 44.8 39.8 35.1 39.2 26.7 20.8 34.0
    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Summarization 27.3 47.3 35.2 52.1 10.6 35.1 14.6 24.7 18.8
    May 19, 2025 LLaMA 3.3-70B
    baseline
    Multilingual-e5-large-instruct
    baseline
    SumContext 47.2 47.3 55.5 52.1 32.5 35.4 29.1 24.8 37.4
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Base-RAG 39.6 38.7 48.7 42.4 41.7 46.9 27.9 25.2 37.1
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Hybrid BM25 41.8 39.8 51.6 43.6 43.5 44.0 37.2 29.3 41.7
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Reranker 30.8 29.0 37.5 32.7 34.6 39.2 25.6 22.9 30.8
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    HyDE 36.8 35.4 45.7 39.9 35.9 38.4 24.7 20.7 33.3
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    Summarization 26.9 47.2 35.6 52.2 10.7 35.4 13.9 24.7 18.5
    May 19, 2025 QwQ-32B
    baseline
    Multilingual-e5-large-instruct
    baseline
    SumContext 45.6 47.3 56.9 52.2 33.1 35.4 27.3 24.7 36.7
    May 19, 2025 QwQ-32B
    baseline
    - Pretrained-Only 7.5 - 2.4 - 1.7 - 4.4 - 4.2
    May 19, 2025 LLaMA 3.3-70B
    baseline
    - Pretrained-Only 7.9 - 2.8 0 1.54 - 3.7 - 3.9
    ⚙ This website is based on the layout from Bird-Bench and TableBench and adapted for T²-RAGBench!