π§ Why do LLMs struggle with inferential questions? Because not all context is equally helpfulβsome sentences guide reasoning, others just add noise.
π‘ What if we could measure how useful a sentence is for reasoning? This work shows that convergenceβhow much a sentence narrows down possible answersβplays a key role in improving inferential QA.
Large Language Models (LLMs) are powerfulβbut they still struggle with inferential questions π€ (those where answers must be reasoned, not directly found).
π‘ In this project, we introduce convergence as a signal that measures how well a sentence (hint) narrows down possible answers.
- β High-convergence sentences β better QA performance
- π Convergence > cosine similarity for passage selection
- π§ Ordering sentences by convergence β even better results
βββ dataset # Data preparation and evaluation utilities
β βββ compute_similarities.py # Computes cosine similarity scores
β βββ dataset_final.tar.gz # Ready-to-use final dataset for experiments
β βββ make.sh # Rebuilds the dataset pipeline from scratch
β βββ make_dataset.py # Creates dataset with convergence annotations
β βββ merge.py # Merges intermediate outputs into final dataset
β βββ qa.py # Runs QA evaluation pipeline
β
βββ experiments # Experiment scripts used in the paper
βββ convergence_vs_cosine.py # Compares convergence vs cosine similarity
βββ order.py # Tests effect of sentence orderingπ Preprocessed dataset included:
dataset/dataset_final.tar.gz
- β Recommended: Use this directly
β οΈ Optional: Rebuild from scratch if needed
The dataset is derived fromΒ hint-based QA data (TriviaHG)Β and is designed forΒ inferential question answering. Unlike standard QA datasets, the answer must beΒ inferred by combining hints, not extracted from a single sentence.
Each hint has aΒ convergence score, measuring how well it narrows down candidate answers:
- π’ High β strongly filters incorrect answers
- π‘ Medium β partially informative
- π΄ Low β weak or ambiguous
python experiments/convergence_vs_cosine.pypython experiments/order.pyYou can reproduce the paper in two ways:
This is the easiest and recommended way.
Follow the following steps:
git clone https://github.com/DataScienceUIBK/Context-Convergence-Inferential-QA.git
cd Context-Convergence-Inferential-QA
pip install termcolorYou do not need to recreate the dataset for the experiments.
python experiments/convergence_vs_cosine.py
python experiments/order.pyThis reproduces the main experimental setup using the prepared data.
Use this only if you want to regenerate the dataset yourself.
Before rebuilding anything, finish all the following steps:
Make sure HintEval is installed correctly: π https://hinteval.readthedocs.io/
Then:
git clone https://github.com/DataScienceUIBK/Context-Convergence-Inferential-QA.git
cd Context-Convergence-Inferential-QA
pip install -r requirements.txtcd datasetbash make.shThis rebuilds the dataset step by step using the scripts in dataset/.
cd ..python experiments/convergence_vs_cosine.py
python experiments/order.py-
Use the provided dataset if you want the closest match to the reported results.
-
Rebuilding the dataset is mainly for transparency and regeneration.
-
Make sure HintEval is installed correctly before rebuilding.
-
The experiments in this repository correspond to the two main studies in the paper:
- convergence vs cosine similarity
- sentence ordering by convergence
- π’ Convergence is a strong relevance signal
- π High-convergence passages β better accuracy
- β Cosine similarity is not reliable
- π Ordering by convergence improves performance
- π§ LLMs prioritize earlier information
MIT License β seeΒ LICENSE.