🧠 Why do LLMs struggle with inferential questions? Because not all context is equally helpful—some sentences guide reasoning, others just add noise.
💡 What if we could measure how useful a sentence is for reasoning? This work shows that convergence—how much a sentence narrows down possible answers—plays a key role in improving inferential QA.
Large Language Models (LLMs) are powerful—but they still struggle with inferential questions 🤔 (those where answers must be reasoned, not directly found).
💡 In this project, we introduce convergence as a signal that measures how well a sentence (hint) narrows down possible answers.
- ✅ High-convergence sentences → better QA performance
- 📊 Convergence > cosine similarity for passage selection
- 🧠 Ordering sentences by convergence → even better results
├── dataset # Data preparation and evaluation utilities
│ ├── compute_similarities.py # Computes cosine similarity scores
│ ├── dataset_final.tar.gz # Ready-to-use final dataset for experiments
│ ├── make.sh # Rebuilds the dataset pipeline from scratch
│ ├── make_dataset.py # Creates dataset with convergence annotations
│ ├── merge.py # Merges intermediate outputs into final dataset
│ ├── qa.py # Runs QA evaluation pipeline
│
└── experiments # Experiment scripts used in the paper
├── convergence_vs_cosine.py # Compares convergence vs cosine similarity
├── order.py # Tests effect of sentence ordering📍 Preprocessed dataset included:
dataset/dataset_final.tar.gz
- ✅ Recommended: Use this directly
⚠️ Optional: Rebuild from scratch if needed
The dataset is derived from hint-based QA data (TriviaHG) and is designed for inferential question answering. Unlike standard QA datasets, the answer must be inferred by combining hints, not extracted from a single sentence.
Each hint has a convergence score, measuring how well it narrows down candidate answers:
- 🟢 High → strongly filters incorrect answers
- 🟡 Medium → partially informative
- 🔴 Low → weak or ambiguous
python experiments/convergence_vs_cosine.pypython experiments/order.pyYou can reproduce the paper in two ways:
This is the easiest and recommended way.
Follow the following steps:
git clone https://github.com/DataScienceUIBK/Context-Convergence-Inferential-QA.git
cd Context-Convergence-Inferential-QA
pip install termcolorYou do not need to recreate the dataset for the experiments.
python experiments/convergence_vs_cosine.py
python experiments/order.pyThis reproduces the main experimental setup using the prepared data.
Use this only if you want to regenerate the dataset yourself.
Before rebuilding anything, finish all the following steps:
Make sure HintEval is installed correctly: 👉 https://hinteval.readthedocs.io/
Then:
git clone https://github.com/DataScienceUIBK/Context-Convergence-Inferential-QA.git
cd Context-Convergence-Inferential-QA
pip install -r requirements.txtcd datasetbash make.shThis rebuilds the dataset step by step using the scripts in dataset/.
cd ..python experiments/convergence_vs_cosine.py
python experiments/order.py-
Use the provided dataset if you want the closest match to the reported results.
-
Rebuilding the dataset is mainly for transparency and regeneration.
-
Make sure HintEval is installed correctly before rebuilding.
-
The experiments in this repository correspond to the two main studies in the paper:
- convergence vs cosine similarity
- sentence ordering by convergence
- 🟢 Convergence is a strong relevance signal
- 📈 High-convergence passages → better accuracy
- ❌ Cosine similarity is not reliable
- 🔝 Ordering by convergence improves performance
- 🧭 LLMs prioritize earlier information
MIT License — see LICENSE.