Skip to content

Commit d8023f7

Browse files
feat: Add BEIR vector search benchmark suite (#84)
1 parent c5cfbba commit d8023f7

20 files changed

Lines changed: 8044 additions & 0 deletions

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,8 @@
33
Please ensure you have [Git LFS](https://git-lfs.github.com/) installed, before you clone this repository.
44
There's currently around 450MB worth data in this repository.
55

6+
To get clone with repo with references to lfs-ed files:
7+
8+
```sh
9+
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/dgraph-io/dgraph-benchmarks.git
10+
```

vector/beir/.env.example

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Dgraph Configuration
2+
DGRAPH_VERSION=v25.1.0 # any Dgraph version or local if you have a local docker build
3+
DGRAPH_HOST=localhost
4+
DGRAPH_PORT=9080
5+
6+
# BEIR Configuration
7+
DATASET_NAME=scifact
8+
EMBEDDING_MODEL=all-mpnet-base-v2 # or other models from Huggingface, e.g. all-MiniLM-L6-v2
9+
BATCH_SIZE=32
10+
11+
# HNSW Index Configuration
12+
HNSW_METRIC=euclidean # e.g. euclidean, cosine, or dotproduct
13+
HNSW_MAX_LEVELS=3
14+
HNSW_EF_SEARCH=32
15+
HNSW_EF_CONSTRUCTION=64
16+
HNSW_INDEX_TYPE=hnsw # e.g. hnsw or partionedhnsw (sic), experimental not yet released
17+
HNSW_NUM_CLUSTERS=50 # used by partitioned HNSW variants, experimental not yet released
18+
19+
# Evaluation Configuration
20+
K_VALUES=1,3,5,10,100
21+
# Set this to add a description to the results CSV
22+
DESC=
23+
24+
# v25.2.0 similar_to configuration
25+
USE_NEW_SIMILAR_TO_SYNTAX=false
26+
SEARCH_EF=0 # Override efSearch at query time (0 = use index default), increasing this will increase search time but improve recall
27+
SEARCH_DISTANCE_THRESHOLD=0.0 # Filter by distance threshold (0.0 = no filtering)

vector/beir/.gitignore

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
env/
8+
venv/
9+
.venv/
10+
.uv/
11+
12+
# BEIR data
13+
data/
14+
15+
# Embedding cache
16+
embeddings_cache/
17+
*.npz
18+
19+
# Test files
20+
test_query.dql
21+
22+
# Environment
23+
.env
24+
25+
# IDE
26+
.vscode/
27+
.idea/
28+
*.swp
29+
*.swo
30+
31+
# Jupyter
32+
.ipynb_checkpoints/
33+
34+
# OS
35+
.DS_Store
36+
Thumbs.db

vector/beir/README.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# BEIR Vector Search Benchmark for Dgraph
2+
3+
This directory contains a comprehensive benchmark suite for testing Dgraph's vector search capabilities using the [BEIR](https://github.com/beir-cellar/beir) (Benchmarking IR) dataset.
4+
5+
## Overview
6+
7+
The BEIR benchmark allows you to:
8+
- Test Dgraph's HNSW vector index implementation
9+
- Compare performance across different Dgraph versions
10+
- Evaluate retrieval quality using standard IR metrics (NDCG, MAP, Recall, Precision)
11+
- Experiment with different HNSW parameters
12+
13+
## Prerequisites
14+
15+
- **Docker**: For running Dgraph containers
16+
- **uv**: Python package manager ([installation guide](https://github.com/astral-sh/uv))
17+
- **Python 3.12+**
18+
19+
### Installing uv
20+
21+
```bash
22+
curl -LsSf https://astral.sh/uv/install.sh | sh
23+
```
24+
25+
## Project Structure
26+
27+
```
28+
beir/
29+
├── .env.example # Example environment configuration
30+
├── docker-compose.yml # Dgraph container configuration
31+
├── pyproject.toml # Python dependencies (managed by uv)
32+
├── README.md # This file
33+
34+
├── config.py # Configuration management
35+
├── embeddings.py # Embedding generation with caching
36+
├── dgraph_client.py # Dgraph client wrapper
37+
├── evaluate.py # Main evaluation script
38+
├── view_results.py # Terminal-based results viewer with ASCII charts
39+
├── insert_benchmark.py # Standalone insertion benchmark
40+
├── generate_test_query.py # Generate test queries for debugging
41+
├── run_benchmark.sh # Automated benchmark runner
42+
43+
├── benchmarks/ # Reference benchmark thresholds (JSON)
44+
├── data/ # BEIR datasets (auto-downloaded)
45+
└── results/ # Evaluation results (CSV)
46+
```
47+
48+
## Quick Start
49+
50+
### 1. Install Dependencies
51+
52+
```bash
53+
uv sync
54+
```
55+
56+
This will create a virtual environment and install all required dependencies.
57+
58+
### 2. Configure Environment
59+
60+
Copy the example environment file:
61+
62+
```bash
63+
cp .env.example .env
64+
```
65+
66+
Edit `.env` to configure your benchmark. Key settings:
67+
68+
```bash
69+
# Dgraph version to test
70+
DGRAPH_VERSION=v25.1.0 # Docker image tag (e.g., v25.1.0, local)
71+
72+
# Dataset and embedding model
73+
DATASET_NAME=scifact # BEIR dataset (scifact, nfcorpus, fiqa, etc.)
74+
EMBEDDING_MODEL=all-mpnet-base-v2 # Sentence transformer model
75+
76+
# HNSW index parameters
77+
HNSW_METRIC=euclidean # euclidean, cosine, or dotproduct
78+
HNSW_EF_SEARCH=32 # Search-time beam width
79+
HNSW_EF_CONSTRUCTION=64 # Build-time beam width
80+
81+
# Optional description for this run
82+
DESC="Testing v25.1.0 with default params"
83+
```
84+
85+
See `.env.example` for all available options including experimental features.
86+
87+
### 3. Run the Benchmark
88+
89+
The easiest way to run the benchmark is with the provided shell script:
90+
91+
```bash
92+
./run_benchmark.sh
93+
```
94+
95+
This will:
96+
1. Start Dgraph using Docker Compose
97+
2. Download the BEIR dataset (cached in `./data/`)
98+
3. Generate embeddings (cached in `~/.cache/beir_embeddings/`)
99+
4. Insert documents and run vector search queries
100+
5. Evaluate results and append to `./results/benchmark_results.csv`
101+
6. Stop Dgraph
102+
103+
For historical reasons, please don't commit removed existing rows from `./results/benchmark_results.csv`.
104+
105+
To keep Dgraph running after the benchmark for manual testing:
106+
107+
```bash
108+
KEEP_RUNNING=true ./run_benchmark.sh
109+
```
110+
111+
**Note on Embedding Caching:** Embeddings are cached in `~/.cache/beir_embeddings/` based on dataset name and model. If you change the `EMBEDDING_MODEL`, you must manually delete the cached embeddings for that dataset:
112+
113+
```bash
114+
# Clear all cached embeddings
115+
rm -rf ~/.cache/beir_embeddings/
116+
117+
# Or clear for a specific dataset
118+
rm ~/.cache/beir_embeddings/scifact_*.npz
119+
```
120+
121+
### 4. View Results
122+
123+
Results are appended to `./results/benchmark_results.csv`. Use the `view_results.py` script to view them:
124+
125+
```bash
126+
# View summary table (default)
127+
uv run python view_results.py
128+
129+
# View with ASCII bar charts for all metrics
130+
uv run python view_results.py --bars
131+
132+
# View specific metric
133+
uv run python view_results.py --bars --metric ndcg
134+
135+
# View timing comparison
136+
uv run python view_results.py --timing
137+
138+
# View only the last N results
139+
uv run python view_results.py --last 5
140+
141+
# Filter by platform
142+
uv run python view_results.py --platform Dgraph
143+
144+
# Show raw table
145+
uv run python view_results.py --raw
146+
```
147+
148+
Example output:
149+
```
150+
Loaded 5 result(s) from benchmark_results.csv
151+
152+
====================================================================================================
153+
SUMMARY - NDCG@k
154+
====================================================================================================
155+
Run | @1 | @3 | @5 | @10 | @100
156+
---------------------------------+----------+----------+----------+----------+---------
157+
Dgraph v25.1.0 | 0.6167 | 0.6310 | 0.6466 | 0.6806 | 0.7353
158+
Dgraph v25.1.0 (higher ef) | 0.6200 | 0.6389 | 0.6521 | 0.6852 | 0.7398
159+
```
160+
161+
## BEIR Datasets
162+
163+
The framework supports any BEIR dataset. Recommended datasets for testing:
164+
165+
| Dataset | Documents | Queries | Domain | Size |
166+
|---------|-----------|---------|--------|------|
167+
| `scifact` | 5.2K | 300 | Scientific Claims | Small |
168+
| `nfcorpus` | 3.6K | 323 | Medical | Small |
169+
| `fiqa` | 57K | 648 | Financial | Medium |
170+
| `arguana` | 8.7K | 1406 | Argumentation | Medium |
171+
| `trec-covid` | 171K | 50 | COVID-19 | Large |
172+
173+
Datasets are automatically downloaded on first use and cached in `./data/`.
174+
175+
## Configuration Options
176+
177+
### Environment Variables
178+
179+
All configuration can be set via `.env` file or environment variables:
180+
181+
```bash
182+
# Dgraph Configuration
183+
DGRAPH_VERSION=v25.0.0
184+
DGRAPH_HOST=localhost
185+
DGRAPH_PORT=9080
186+
187+
# BEIR Configuration
188+
DATASET_NAME=scifact
189+
EMBEDDING_MODEL=all-MiniLM-L6-v2
190+
BATCH_SIZE=32
191+
192+
# HNSW Index Configuration
193+
HNSW_METRIC=euclidean
194+
HNSW_MAX_LEVELS=3
195+
HNSW_EF_SEARCH=40
196+
HNSW_EF_CONSTRUCTION=100
197+
198+
# Evaluation Configuration
199+
K_VALUES=1,3,5,10,100
200+
DESC= # Optional: Add a description for this benchmark run
201+
```
202+
203+
### Configuration Options
204+
205+
#### HNSW Parameters
206+
207+
- **metric**: Distance metric (`euclidean`, `cosine`, `dotproduct`)
208+
- **maxLevels**: Maximum number of layers in HNSW graph
209+
- **efSearch**: Size of dynamic candidate list during search (higher = more accurate, slower)
210+
- **efConstruction**: Size of dynamic candidate list during construction (higher = better quality, slower build)
211+
212+
#### Evaluation Options
213+
214+
- **K_VALUES**: Comma-separated list of k-values for metrics (default: `1,3,5,10,100`)
215+
- **DESC**: Optional description for the benchmark run (e.g., `"Testing with higher efConstruction"`)
216+
- Included in results JSON and plot legends
217+
- Useful for tracking configuration changes across runs
218+
219+
## Evaluation Metrics
220+
221+
The benchmark reports standard IR metrics:
222+
223+
- **NDCG@k**: Normalized Discounted Cumulative Gain - measures ranking quality.
224+
It compares your ranking to an ideal ranking where all relevant documents are at the top, while discounting gains from lower-ranked positions. Higher NDCG@k means relevant documents are not just retrieved, but also ranked closer to the top of the list.
225+
- **MAP@k**: Mean Average Precision - measures precision across recall levels.
226+
It averages the precision at every rank position where a relevant document appears, and then averages this over all queries. Higher MAP@k indicates that relevant documents tend to appear earlier and more consistently in the ranked results.
227+
- **Recall@k**: Fraction of relevant documents retrieved in top-k.
228+
It focuses on coverage: out of all relevant documents for a query, how many are found within the first `k` results. Higher Recall@k means the system is missing fewer relevant documents, even if some may be ranked lower.
229+
- **Precision@k**: Fraction of retrieved documents that are relevant.
230+
It focuses on purity: among the first `k` results, how many are actually relevant to the query. Higher Precision@k means users see fewer non-relevant documents in the top part of the ranking.
231+
232+
In this benchmark, common settings are `k=10` to capture the quality of the top search results, and `k=100` to understand overall coverage. If your use case is user-facing search where only the first page matters, prioritize **Precision@k** and **NDCG@k**. If you care more about finding as many relevant items as possible (e.g., analysis or recall-oriented workflows), prioritize **Recall@k** and **MAP@k**.
233+
234+
## Performance Analysis
235+
236+
Results include timing information:
237+
- Embedding generation time
238+
- Document insertion time
239+
- Total search time and average per-query time
240+
241+
Example output:
242+
```
243+
NDCG@k:
244+
NDCG@1: 0.6890
245+
NDCG@10: 0.7234
246+
247+
Timing:
248+
embedding_time_seconds: 45.23
249+
insert_time_seconds: 12.67
250+
search_time_seconds: 8.92
251+
avg_query_time_seconds: 0.0297
252+
```
253+
254+
## Running Manually
255+
256+
You can also run the benchmark manually without the shell script:
257+
258+
```bash
259+
# Start Dgraph
260+
docker-compose up -d
261+
262+
# Wait for Dgraph to be ready
263+
curl http://localhost:8080/health
264+
265+
# Run the evaluation
266+
uv run python evaluate.py
267+
268+
# Stop Dgraph
269+
docker-compose down -v
270+
```
271+
272+
## Troubleshooting
273+
274+
### Docker Issues
275+
276+
If Dgraph fails to start:
277+
```bash
278+
# Check container logs
279+
docker-compose logs
280+
281+
# Clean restart
282+
docker-compose down -v
283+
docker-compose up -d
284+
```
285+
286+
### Memory Issues
287+
288+
For large datasets, you may need to increase Docker memory limits or use a smaller dataset/batch size.
289+
290+
### Connection Issues
291+
292+
Ensure ports 8080 and 9080 are not already in use:
293+
```bash
294+
lsof -i :8080
295+
lsof -i :9080
296+
```
297+
298+
## Extending the Benchmark
299+
300+
### Adding New Datasets
301+
302+
Simply change `DATASET_NAME` in `.env` to any BEIR dataset name. The dataset will be downloaded automatically.
303+
304+
### Custom Embedding Models
305+
306+
Change `EMBEDDING_MODEL` to any sentence-transformers model:
307+
- `all-MiniLM-L6-v2` (384 dim, fast)
308+
- `all-mpnet-base-v2` (768 dim, better quality)
309+
- `multi-qa-MiniLM-L6-cos-v1` (384 dim, optimized for QA)
310+
311+
## References
312+
313+
- [BEIR Benchmark](https://github.com/beir-cellar/beir)
314+
- [Dgraph Documentation](https://dgraph.io/docs/)
315+
- [Sentence Transformers](https://www.sbert.net/)
316+
- [HNSW Algorithm](https://arxiv.org/abs/1603.09320)
317+
318+
## License
319+
320+
This benchmark suite follows the same license as the parent repository.

vector/beir/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""BEIR benchmark suite for Dgraph vector search."""
2+
3+
__version__ = "0.1.0"

0 commit comments

Comments
 (0)