|
| 1 | +# BEIR Vector Search Benchmark for Dgraph |
| 2 | + |
| 3 | +This directory contains a comprehensive benchmark suite for testing Dgraph's vector search capabilities using the [BEIR](https://github.com/beir-cellar/beir) (Benchmarking IR) dataset. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The BEIR benchmark allows you to: |
| 8 | +- Test Dgraph's HNSW vector index implementation |
| 9 | +- Compare performance across different Dgraph versions |
| 10 | +- Evaluate retrieval quality using standard IR metrics (NDCG, MAP, Recall, Precision) |
| 11 | +- Experiment with different HNSW parameters |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +- **Docker**: For running Dgraph containers |
| 16 | +- **uv**: Python package manager ([installation guide](https://github.com/astral-sh/uv)) |
| 17 | +- **Python 3.12+** |
| 18 | + |
| 19 | +### Installing uv |
| 20 | + |
| 21 | +```bash |
| 22 | +curl -LsSf https://astral.sh/uv/install.sh | sh |
| 23 | +``` |
| 24 | + |
| 25 | +## Project Structure |
| 26 | + |
| 27 | +``` |
| 28 | +beir/ |
| 29 | +├── .env.example # Example environment configuration |
| 30 | +├── docker-compose.yml # Dgraph container configuration |
| 31 | +├── pyproject.toml # Python dependencies (managed by uv) |
| 32 | +├── README.md # This file |
| 33 | +│ |
| 34 | +├── config.py # Configuration management |
| 35 | +├── embeddings.py # Embedding generation with caching |
| 36 | +├── dgraph_client.py # Dgraph client wrapper |
| 37 | +├── evaluate.py # Main evaluation script |
| 38 | +├── view_results.py # Terminal-based results viewer with ASCII charts |
| 39 | +├── insert_benchmark.py # Standalone insertion benchmark |
| 40 | +├── generate_test_query.py # Generate test queries for debugging |
| 41 | +├── run_benchmark.sh # Automated benchmark runner |
| 42 | +│ |
| 43 | +├── benchmarks/ # Reference benchmark thresholds (JSON) |
| 44 | +├── data/ # BEIR datasets (auto-downloaded) |
| 45 | +└── results/ # Evaluation results (CSV) |
| 46 | +``` |
| 47 | + |
| 48 | +## Quick Start |
| 49 | + |
| 50 | +### 1. Install Dependencies |
| 51 | + |
| 52 | +```bash |
| 53 | +uv sync |
| 54 | +``` |
| 55 | + |
| 56 | +This will create a virtual environment and install all required dependencies. |
| 57 | + |
| 58 | +### 2. Configure Environment |
| 59 | + |
| 60 | +Copy the example environment file: |
| 61 | + |
| 62 | +```bash |
| 63 | +cp .env.example .env |
| 64 | +``` |
| 65 | + |
| 66 | +Edit `.env` to configure your benchmark. Key settings: |
| 67 | + |
| 68 | +```bash |
| 69 | +# Dgraph version to test |
| 70 | +DGRAPH_VERSION=v25.1.0 # Docker image tag (e.g., v25.1.0, local) |
| 71 | + |
| 72 | +# Dataset and embedding model |
| 73 | +DATASET_NAME=scifact # BEIR dataset (scifact, nfcorpus, fiqa, etc.) |
| 74 | +EMBEDDING_MODEL=all-mpnet-base-v2 # Sentence transformer model |
| 75 | + |
| 76 | +# HNSW index parameters |
| 77 | +HNSW_METRIC=euclidean # euclidean, cosine, or dotproduct |
| 78 | +HNSW_EF_SEARCH=32 # Search-time beam width |
| 79 | +HNSW_EF_CONSTRUCTION=64 # Build-time beam width |
| 80 | + |
| 81 | +# Optional description for this run |
| 82 | +DESC="Testing v25.1.0 with default params" |
| 83 | +``` |
| 84 | + |
| 85 | +See `.env.example` for all available options including experimental features. |
| 86 | + |
| 87 | +### 3. Run the Benchmark |
| 88 | + |
| 89 | +The easiest way to run the benchmark is with the provided shell script: |
| 90 | + |
| 91 | +```bash |
| 92 | +./run_benchmark.sh |
| 93 | +``` |
| 94 | + |
| 95 | +This will: |
| 96 | +1. Start Dgraph using Docker Compose |
| 97 | +2. Download the BEIR dataset (cached in `./data/`) |
| 98 | +3. Generate embeddings (cached in `~/.cache/beir_embeddings/`) |
| 99 | +4. Insert documents and run vector search queries |
| 100 | +5. Evaluate results and append to `./results/benchmark_results.csv` |
| 101 | +6. Stop Dgraph |
| 102 | + |
| 103 | +For historical reasons, please don't commit removed existing rows from `./results/benchmark_results.csv`. |
| 104 | + |
| 105 | +To keep Dgraph running after the benchmark for manual testing: |
| 106 | + |
| 107 | +```bash |
| 108 | +KEEP_RUNNING=true ./run_benchmark.sh |
| 109 | +``` |
| 110 | + |
| 111 | +**Note on Embedding Caching:** Embeddings are cached in `~/.cache/beir_embeddings/` based on dataset name and model. If you change the `EMBEDDING_MODEL`, you must manually delete the cached embeddings for that dataset: |
| 112 | + |
| 113 | +```bash |
| 114 | +# Clear all cached embeddings |
| 115 | +rm -rf ~/.cache/beir_embeddings/ |
| 116 | + |
| 117 | +# Or clear for a specific dataset |
| 118 | +rm ~/.cache/beir_embeddings/scifact_*.npz |
| 119 | +``` |
| 120 | + |
| 121 | +### 4. View Results |
| 122 | + |
| 123 | +Results are appended to `./results/benchmark_results.csv`. Use the `view_results.py` script to view them: |
| 124 | + |
| 125 | +```bash |
| 126 | +# View summary table (default) |
| 127 | +uv run python view_results.py |
| 128 | + |
| 129 | +# View with ASCII bar charts for all metrics |
| 130 | +uv run python view_results.py --bars |
| 131 | + |
| 132 | +# View specific metric |
| 133 | +uv run python view_results.py --bars --metric ndcg |
| 134 | + |
| 135 | +# View timing comparison |
| 136 | +uv run python view_results.py --timing |
| 137 | + |
| 138 | +# View only the last N results |
| 139 | +uv run python view_results.py --last 5 |
| 140 | + |
| 141 | +# Filter by platform |
| 142 | +uv run python view_results.py --platform Dgraph |
| 143 | + |
| 144 | +# Show raw table |
| 145 | +uv run python view_results.py --raw |
| 146 | +``` |
| 147 | + |
| 148 | +Example output: |
| 149 | +``` |
| 150 | +Loaded 5 result(s) from benchmark_results.csv |
| 151 | +
|
| 152 | +==================================================================================================== |
| 153 | +SUMMARY - NDCG@k |
| 154 | +==================================================================================================== |
| 155 | +Run | @1 | @3 | @5 | @10 | @100 |
| 156 | +---------------------------------+----------+----------+----------+----------+--------- |
| 157 | +Dgraph v25.1.0 | 0.6167 | 0.6310 | 0.6466 | 0.6806 | 0.7353 |
| 158 | +Dgraph v25.1.0 (higher ef) | 0.6200 | 0.6389 | 0.6521 | 0.6852 | 0.7398 |
| 159 | +``` |
| 160 | + |
| 161 | +## BEIR Datasets |
| 162 | + |
| 163 | +The framework supports any BEIR dataset. Recommended datasets for testing: |
| 164 | + |
| 165 | +| Dataset | Documents | Queries | Domain | Size | |
| 166 | +|---------|-----------|---------|--------|------| |
| 167 | +| `scifact` | 5.2K | 300 | Scientific Claims | Small | |
| 168 | +| `nfcorpus` | 3.6K | 323 | Medical | Small | |
| 169 | +| `fiqa` | 57K | 648 | Financial | Medium | |
| 170 | +| `arguana` | 8.7K | 1406 | Argumentation | Medium | |
| 171 | +| `trec-covid` | 171K | 50 | COVID-19 | Large | |
| 172 | + |
| 173 | +Datasets are automatically downloaded on first use and cached in `./data/`. |
| 174 | + |
| 175 | +## Configuration Options |
| 176 | + |
| 177 | +### Environment Variables |
| 178 | + |
| 179 | +All configuration can be set via `.env` file or environment variables: |
| 180 | + |
| 181 | +```bash |
| 182 | +# Dgraph Configuration |
| 183 | +DGRAPH_VERSION=v25.0.0 |
| 184 | +DGRAPH_HOST=localhost |
| 185 | +DGRAPH_PORT=9080 |
| 186 | + |
| 187 | +# BEIR Configuration |
| 188 | +DATASET_NAME=scifact |
| 189 | +EMBEDDING_MODEL=all-MiniLM-L6-v2 |
| 190 | +BATCH_SIZE=32 |
| 191 | + |
| 192 | +# HNSW Index Configuration |
| 193 | +HNSW_METRIC=euclidean |
| 194 | +HNSW_MAX_LEVELS=3 |
| 195 | +HNSW_EF_SEARCH=40 |
| 196 | +HNSW_EF_CONSTRUCTION=100 |
| 197 | + |
| 198 | +# Evaluation Configuration |
| 199 | +K_VALUES=1,3,5,10,100 |
| 200 | +DESC= # Optional: Add a description for this benchmark run |
| 201 | +``` |
| 202 | + |
| 203 | +### Configuration Options |
| 204 | + |
| 205 | +#### HNSW Parameters |
| 206 | + |
| 207 | +- **metric**: Distance metric (`euclidean`, `cosine`, `dotproduct`) |
| 208 | +- **maxLevels**: Maximum number of layers in HNSW graph |
| 209 | +- **efSearch**: Size of dynamic candidate list during search (higher = more accurate, slower) |
| 210 | +- **efConstruction**: Size of dynamic candidate list during construction (higher = better quality, slower build) |
| 211 | + |
| 212 | +#### Evaluation Options |
| 213 | + |
| 214 | +- **K_VALUES**: Comma-separated list of k-values for metrics (default: `1,3,5,10,100`) |
| 215 | +- **DESC**: Optional description for the benchmark run (e.g., `"Testing with higher efConstruction"`) |
| 216 | + - Included in results JSON and plot legends |
| 217 | + - Useful for tracking configuration changes across runs |
| 218 | + |
| 219 | +## Evaluation Metrics |
| 220 | + |
| 221 | +The benchmark reports standard IR metrics: |
| 222 | + |
| 223 | +- **NDCG@k**: Normalized Discounted Cumulative Gain - measures ranking quality. |
| 224 | + It compares your ranking to an ideal ranking where all relevant documents are at the top, while discounting gains from lower-ranked positions. Higher NDCG@k means relevant documents are not just retrieved, but also ranked closer to the top of the list. |
| 225 | +- **MAP@k**: Mean Average Precision - measures precision across recall levels. |
| 226 | + It averages the precision at every rank position where a relevant document appears, and then averages this over all queries. Higher MAP@k indicates that relevant documents tend to appear earlier and more consistently in the ranked results. |
| 227 | +- **Recall@k**: Fraction of relevant documents retrieved in top-k. |
| 228 | + It focuses on coverage: out of all relevant documents for a query, how many are found within the first `k` results. Higher Recall@k means the system is missing fewer relevant documents, even if some may be ranked lower. |
| 229 | +- **Precision@k**: Fraction of retrieved documents that are relevant. |
| 230 | + It focuses on purity: among the first `k` results, how many are actually relevant to the query. Higher Precision@k means users see fewer non-relevant documents in the top part of the ranking. |
| 231 | + |
| 232 | +In this benchmark, common settings are `k=10` to capture the quality of the top search results, and `k=100` to understand overall coverage. If your use case is user-facing search where only the first page matters, prioritize **Precision@k** and **NDCG@k**. If you care more about finding as many relevant items as possible (e.g., analysis or recall-oriented workflows), prioritize **Recall@k** and **MAP@k**. |
| 233 | + |
| 234 | +## Performance Analysis |
| 235 | + |
| 236 | +Results include timing information: |
| 237 | +- Embedding generation time |
| 238 | +- Document insertion time |
| 239 | +- Total search time and average per-query time |
| 240 | + |
| 241 | +Example output: |
| 242 | +``` |
| 243 | +NDCG@k: |
| 244 | + NDCG@1: 0.6890 |
| 245 | + NDCG@10: 0.7234 |
| 246 | + |
| 247 | +Timing: |
| 248 | + embedding_time_seconds: 45.23 |
| 249 | + insert_time_seconds: 12.67 |
| 250 | + search_time_seconds: 8.92 |
| 251 | + avg_query_time_seconds: 0.0297 |
| 252 | +``` |
| 253 | + |
| 254 | +## Running Manually |
| 255 | + |
| 256 | +You can also run the benchmark manually without the shell script: |
| 257 | + |
| 258 | +```bash |
| 259 | +# Start Dgraph |
| 260 | +docker-compose up -d |
| 261 | + |
| 262 | +# Wait for Dgraph to be ready |
| 263 | +curl http://localhost:8080/health |
| 264 | + |
| 265 | +# Run the evaluation |
| 266 | +uv run python evaluate.py |
| 267 | + |
| 268 | +# Stop Dgraph |
| 269 | +docker-compose down -v |
| 270 | +``` |
| 271 | + |
| 272 | +## Troubleshooting |
| 273 | + |
| 274 | +### Docker Issues |
| 275 | + |
| 276 | +If Dgraph fails to start: |
| 277 | +```bash |
| 278 | +# Check container logs |
| 279 | +docker-compose logs |
| 280 | + |
| 281 | +# Clean restart |
| 282 | +docker-compose down -v |
| 283 | +docker-compose up -d |
| 284 | +``` |
| 285 | + |
| 286 | +### Memory Issues |
| 287 | + |
| 288 | +For large datasets, you may need to increase Docker memory limits or use a smaller dataset/batch size. |
| 289 | + |
| 290 | +### Connection Issues |
| 291 | + |
| 292 | +Ensure ports 8080 and 9080 are not already in use: |
| 293 | +```bash |
| 294 | +lsof -i :8080 |
| 295 | +lsof -i :9080 |
| 296 | +``` |
| 297 | + |
| 298 | +## Extending the Benchmark |
| 299 | + |
| 300 | +### Adding New Datasets |
| 301 | + |
| 302 | +Simply change `DATASET_NAME` in `.env` to any BEIR dataset name. The dataset will be downloaded automatically. |
| 303 | + |
| 304 | +### Custom Embedding Models |
| 305 | + |
| 306 | +Change `EMBEDDING_MODEL` to any sentence-transformers model: |
| 307 | +- `all-MiniLM-L6-v2` (384 dim, fast) |
| 308 | +- `all-mpnet-base-v2` (768 dim, better quality) |
| 309 | +- `multi-qa-MiniLM-L6-cos-v1` (384 dim, optimized for QA) |
| 310 | + |
| 311 | +## References |
| 312 | + |
| 313 | +- [BEIR Benchmark](https://github.com/beir-cellar/beir) |
| 314 | +- [Dgraph Documentation](https://dgraph.io/docs/) |
| 315 | +- [Sentence Transformers](https://www.sbert.net/) |
| 316 | +- [HNSW Algorithm](https://arxiv.org/abs/1603.09320) |
| 317 | + |
| 318 | +## License |
| 319 | + |
| 320 | +This benchmark suite follows the same license as the parent repository. |
0 commit comments