Benchmarking Recall and Latency
Source:vignettes/benchmarking-recall-and-latency.Rmd
benchmarking-recall-and-latency.RmdbigANNOY includes exported benchmark helpers so you can
measure three related things with the same interface:
- index build time
- search time
- optional Euclidean recall against an exact
bigKNNbaseline - comparison against direct
RcppAnnoy - scaling with data volume and generated index size
This vignette shows how to use those helpers for both quick one-off runs and small parameter sweeps.
What the Benchmark Helpers Do
The package currently exports four benchmark functions:
-
benchmark_annoy_bigmatrix()for one build-and-search configuration -
benchmark_annoy_recall_suite()for a grid ofn_treesandsearch_ksettings on the same dataset -
benchmark_annoy_vs_rcppannoy()for a direct comparison between the package’sbigmemoryworkflow and a denseRcppAnnoybaseline -
benchmark_annoy_volume_suite()for scaling studies across larger synthetic data sizes
These helpers can work with:
- synthetic data generated on the fly
- user-supplied dense matrices
-
big.matrixinputs, descriptors, descriptor paths, and external pointers
They can also write summaries to CSV so results can be saved outside the current R session, and the comparison helpers add byte-oriented fields for the reference data, query data, Annoy index file, and total persisted artifacts.
Create a Benchmark Workspace
We will write any temporary benchmark files into a dedicated directory so the workflow is easy to inspect.
bench_dir <- tempfile("bigannoy-benchmark-")
dir.create(bench_dir, recursive = TRUE, showWarnings = FALSE)
bench_dir
#> [1] "/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmpbvo2EP/bigannoy-benchmark-31cd4781d2d"A Single Synthetic Benchmark Run
The simplest benchmark call uses synthetic data. This is useful when
you want a quick sense of how build and search times respond to
n_trees, search_k, and the problem
dimensions.
single_csv <- file.path(bench_dir, "single.csv")
single <- benchmark_annoy_bigmatrix(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 3L,
n_trees = 10L,
search_k = 50L,
exact = FALSE,
path_dir = bench_dir,
output_path = single_csv,
load_mode = "eager"
)
single$summary
#> metric backend filebacked self_search load_mode n_ref n_query n_dim k
#> 1 euclidean cpp FALSE FALSE eager 200 20 6 3
#> n_trees search_k build_threads build_elapsed search_elapsed exact_elapsed
#> 1 10 50 -1 0.203 0 NA
#> recall_at_k index_id
#> 1 NA annoy-20260327005516-1c5bd3cc4ee2The returned object contains more than just the summary row.
names(single)
#> [1] "summary" "params" "index_path" "metadata_path"
#> [5] "exact_available" "validation"
single$params
#> metric backend filebacked self_search load_mode n_ref n_query n_dim k
#> 1 euclidean cpp FALSE FALSE eager 200 20 6 3
#> n_trees search_k build_threads
#> 1 10 50 -1
single$exact_available
#> [1] FALSEBecause exact = FALSE, the benchmark skips the exact
bigKNN comparison and focuses only on the approximate Annoy
path.
Validation Is Part of the Benchmark Workflow
The benchmark helpers also validate the built Annoy index before measuring the search step. That helps ensure the timing result corresponds to a usable, reopenable index rather than a partially successful build.
single$validation$valid
#> [1] TRUE
single$validation$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 TRUE error
#> 7 file_mtime TRUE warning
#> 8 load TRUE errorThe same summary is also written to CSV when output_path
is supplied.
read.csv(single_csv, stringsAsFactors = FALSE)
#> metric backend filebacked self_search load_mode n_ref n_query n_dim k
#> 1 euclidean cpp FALSE FALSE eager 200 20 6 3
#> n_trees search_k build_threads build_elapsed search_elapsed exact_elapsed
#> 1 10 50 -1 0.203 0 NA
#> recall_at_k index_id
#> 1 NA annoy-20260327005516-1c5bd3cc4ee2External-Query Versus Self-Search Benchmarks
One subtle but important detail is how synthetic data generation works:
- if
x = NULLandqueryis omitted, the benchmark generates a separate synthetic query matrix - if
x = NULLandquery = NULLis supplied explicitly, the benchmark runs self-search on the reference matrix
That difference is reflected in the self_search and
n_query fields.
external_run <- benchmark_annoy_bigmatrix(
n_ref = 120L,
n_query = 12L,
n_dim = 5L,
k = 3L,
n_trees = 8L,
exact = FALSE,
path_dir = bench_dir
)
self_run <- benchmark_annoy_bigmatrix(
n_ref = 120L,
query = NULL,
n_dim = 5L,
k = 3L,
n_trees = 8L,
exact = FALSE,
path_dir = bench_dir
)
shape_cols <- c("self_search", "n_ref", "n_query", "k")
rbind(
external = external_run[["summary"]][, shape_cols],
self = self_run[["summary"]][, shape_cols]
)
#> self_search n_ref n_query k
#> external FALSE 120 12 3
#> self TRUE 120 120 3That distinction matters when you are benchmarking workflows that mirror either training-set neighbour search or truly external query traffic.
Benchmark a Recall Suite Across Parameter Grids
For tuning work, a single benchmark point is usually not enough. The
suite helper runs a grid of n_trees and
search_k values on the same dataset so you can compare
trade-offs more systematically.
suite_csv <- file.path(bench_dir, "suite.csv")
suite <- benchmark_annoy_recall_suite(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 3L,
n_trees = c(5L, 10L),
search_k = c(-1L, 50L),
exact = FALSE,
path_dir = bench_dir,
output_path = suite_csv,
load_mode = "eager"
)
suite$summary
#> metric backend filebacked self_search load_mode n_ref n_query n_dim k
#> 1 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 2 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 3 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 4 euclidean cpp FALSE FALSE eager 200 20 6 3
#> n_trees search_k build_threads build_elapsed search_elapsed exact_elapsed
#> 1 5 -1 -1 0.005 0.000 NA
#> 2 5 50 -1 0.005 0.001 NA
#> 3 10 -1 -1 0.006 0.000 NA
#> 4 10 50 -1 0.006 0.000 NA
#> recall_at_k index_id
#> 1 NA annoy-20260327005516-8ca097928d75
#> 2 NA annoy-20260327005516-8ca097928d75
#> 3 NA annoy-20260327005516-1c5bd3cc4ee2
#> 4 NA annoy-20260327005516-1c5bd3cc4ee2Each row corresponds to one (n_trees, search_k)
configuration on the same underlying benchmark dataset.
The saved CSV contains the same summary table.
read.csv(suite_csv, stringsAsFactors = FALSE)
#> metric backend filebacked self_search load_mode n_ref n_query n_dim k
#> 1 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 2 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 3 euclidean cpp FALSE FALSE eager 200 20 6 3
#> 4 euclidean cpp FALSE FALSE eager 200 20 6 3
#> n_trees search_k build_threads build_elapsed search_elapsed exact_elapsed
#> 1 5 -1 -1 0.005 0.000 NA
#> 2 5 50 -1 0.005 0.001 NA
#> 3 10 -1 -1 0.006 0.000 NA
#> 4 10 50 -1 0.006 0.000 NA
#> recall_at_k index_id
#> 1 NA annoy-20260327005516-8ca097928d75
#> 2 NA annoy-20260327005516-8ca097928d75
#> 3 NA annoy-20260327005516-1c5bd3cc4ee2
#> 4 NA annoy-20260327005516-1c5bd3cc4ee2Optional Exact Recall Against bigKNN
For Euclidean workloads, the benchmark helpers can optionally compare
Annoy results against the exact bigKNN baseline and
report:
exact_elapsedrecall_at_k
That comparison is only available when the runtime package
bigKNN is installed.
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
exact_run <- benchmark_annoy_bigmatrix(
n_ref = 150L,
n_query = 15L,
n_dim = 5L,
k = 3L,
n_trees = 10L,
search_k = 50L,
metric = "euclidean",
exact = TRUE,
path_dir = bench_dir
)
exact_run$exact_available
exact_run$summary[, c("build_elapsed", "search_elapsed", "exact_elapsed", "recall_at_k")]
} else {
"Exact baseline example skipped because bigKNN is not installed."
}
#> build_elapsed search_elapsed exact_elapsed recall_at_k
#> 1 0.006 0 0.003 0.9555556This is the most direct way to answer the practical question, “How much search speed am I buying, and what recall do I lose in return?”
Benchmark User-Supplied Data
Synthetic data is convenient, but real benchmarking usually needs real data. Both benchmark helpers can accept user-supplied reference and query inputs.
ref <- matrix(rnorm(80 * 4), nrow = 80, ncol = 4)
query <- matrix(rnorm(12 * 4), nrow = 12, ncol = 4)
user_run <- benchmark_annoy_bigmatrix(
x = ref,
query = query,
k = 3L,
n_trees = 12L,
search_k = 40L,
exact = FALSE,
filebacked = TRUE,
path_dir = bench_dir,
load_mode = "eager"
)
user_run$summary[, c(
"filebacked",
"self_search",
"n_ref",
"n_query",
"n_dim",
"build_elapsed",
"search_elapsed"
)]
#> filebacked self_search n_ref n_query n_dim build_elapsed search_elapsed
#> 1 TRUE FALSE 80 12 4 0.005 0When filebacked = TRUE, dense reference inputs are first
converted into a file-backed big.matrix before the Annoy
build starts. That can be useful when you want the benchmark workflow to
resemble the package’s real persisted data path more closely.
Compare bigANNOY with Direct RcppAnnoy
When you want to understand the cost of the
bigmemory-oriented wrapper itself, the most useful
benchmark is not an exact Euclidean baseline. It is a direct comparison
with plain RcppAnnoy, using the same synthetic dataset, the
same metric, the same n_trees, and the same
search_k.
That is what benchmark_annoy_vs_rcppannoy()
provides.
compare_csv <- file.path(bench_dir, "compare.csv")
compare_run <- benchmark_annoy_vs_rcppannoy(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 3L,
n_trees = 10L,
search_k = 50L,
exact = FALSE,
path_dir = bench_dir,
output_path = compare_csv,
load_mode = "eager"
)
compare_run$summary[, c(
"implementation",
"reference_storage",
"n_ref",
"n_query",
"n_dim",
"total_data_bytes",
"index_bytes",
"build_elapsed",
"search_elapsed"
)]
#> implementation reference_storage n_ref n_query n_dim total_data_bytes
#> 1 bigANNOY bigmatrix 200 20 6 10560
#> 2 RcppAnnoy dense_matrix 200 20 6 10560
#> index_bytes build_elapsed search_elapsed
#> 1 35840 0.006 0.001
#> 2 35840 0.003 0.001This benchmark is useful for a different question from the earlier exact baseline:
-
benchmark_annoy_bigmatrix()asks how approximate Annoy behaves on a given dataset and, optionally, how much recall it loses against exactbigKNN -
benchmark_annoy_vs_rcppannoy()asks how much overhead or benefit comes from the package’sbigmemoryand persistence workflow relative to directRcppAnnoy
The output also includes data-volume fields:
-
ref_bytes: estimated bytes in the reference matrix -
query_bytes: estimated bytes in the query matrix -
total_data_bytes: reference plus effective query volume -
index_bytes: bytes in the saved Annoy index -
metadata_bytes: bytes in the sidecar metadata file -
artifact_bytes: persisted Annoy artifacts written by the workflow
The generated CSV contains the same comparison table.
read.csv(compare_csv, stringsAsFactors = FALSE)[, c(
"implementation",
"ref_bytes",
"query_bytes",
"index_bytes",
"metadata_bytes",
"artifact_bytes"
)]
#> implementation ref_bytes query_bytes index_bytes metadata_bytes
#> 1 bigANNOY 9600 960 35840 1188
#> 2 RcppAnnoy 9600 960 35840 0
#> artifact_bytes
#> 1 37028
#> 2 35840In practice, the comparison table helps answer two operational questions:
- Is
bigANNOYclose enough to plainRcppAnnoyon build and search speed for this workload? - How large is the persisted Annoy index relative to the input data volume?
Benchmark Scaling by Data Volume
A single comparison point is useful, but it does not tell you whether
the wrapper overhead stays modest as the problem gets larger. The volume
suite runs the same bigANNOY versus RcppAnnoy
comparison across a grid of synthetic data sizes.
volume_csv <- file.path(bench_dir, "volume.csv")
volume_run <- benchmark_annoy_volume_suite(
n_ref = c(200L, 500L),
n_query = 20L,
n_dim = c(6L, 12L),
k = 3L,
n_trees = 10L,
search_k = 50L,
exact = FALSE,
path_dir = bench_dir,
output_path = volume_csv,
load_mode = "eager"
)
volume_run$summary[, c(
"implementation",
"n_ref",
"n_dim",
"total_data_bytes",
"index_bytes",
"build_elapsed",
"search_elapsed"
)]
#> implementation n_ref n_dim total_data_bytes index_bytes build_elapsed
#> 1 bigANNOY 200 6 10560 35840 0.006
#> 2 RcppAnnoy 200 6 10560 35840 0.003
#> 3 bigANNOY 200 12 21120 36864 0.006
#> 4 RcppAnnoy 200 12 21120 36864 0.003
#> 5 bigANNOY 500 6 24960 89440 0.010
#> 6 RcppAnnoy 500 6 24960 89440 0.008
#> 7 bigANNOY 500 12 49920 99072 0.010
#> 8 RcppAnnoy 500 12 49920 99072 0.008
#> search_elapsed
#> 1 0.001
#> 2 0.001
#> 3 0.001
#> 4 0.001
#> 5 0.000
#> 6 0.000
#> 7 0.000
#> 8 0.001This kind of table is especially useful when you want to prepare a more formal benchmark note for a package release or for internal performance regression tracking:
- it shows how build time changes as reference size grows
- it shows how query time changes as dimension grows
- it shows whether index size scales roughly as expected with data volume
- it makes the
bigANNOYversus directRcppAnnoygap visible across more than one benchmark point
Interpreting the Main Summary Columns
The most useful summary fields are:
-
build_elapsed: time spent creating the Annoy index -
search_elapsed: time spent running the search step -
exact_elapsed: time spent on the exact Euclidean baseline, when available -
recall_at_k: average overlap with the exact top-kneighbours -
implementation: whether the row came frombigANNOYor directRcppAnnoy -
n_trees: index quality/size control at build time -
search_k: query effort control at search time -
self_search: whether the benchmark searched the reference rows against themselves -
filebacked: whether dense reference data was converted into a file-backedbig.matrix -
ref_bytes,query_bytes, andindex_bytes: the rough data and artifact volume associated with the benchmark
In practice:
- raise
search_kfirst when recall is too low - increase
n_treeswhen higher search budgets alone are not enough - compare
search_elapsedandrecall_at_ktogether instead of optimizing either one in isolation - use
benchmark_annoy_vs_rcppannoy()when you want to reason about package overhead rather than approximate-versus-exact quality - use
benchmark_annoy_volume_suite()when you need a more formal scaling table for release notes or internal reports
Installed Benchmark Runner
The package also installs a command-line benchmark script. That is convenient when you want to run a benchmark outside an interactive R session or save CSV output from shell scripts.
The installed path is:
system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")
#> [1] "/private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpLS2XEU/temp_libpath123416a4ba7e8/bigANNOY/benchmarks/benchmark_annoy.R"Example single-run command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
--mode=single \
--n_ref=5000 \
--n_query=500 \
--n_dim=50 \
--k=20 \
--n_trees=100 \
--search_k=5000 \
--load_mode=eagerExample suite command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
--mode=suite \
--n_ref=5000 \
--n_query=500 \
--n_dim=50 \
--k=20 \
--suite_trees=10,50,100 \
--suite_search_k=-1,2000,10000 \
--output_path=/tmp/bigannoy_suite.csvExample direct-comparison command:
Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
--mode=compare \
--n_ref=5000 \
--n_query=500 \
--n_dim=50 \
--k=20 \
--n_trees=100 \
--search_k=5000 \
--load_mode=eagerExample volume-suite command:
Recommended Workflow
A practical tuning workflow usually looks like this:
- start with a small single benchmark to confirm dimensions and plumbing
- switch to a suite over a small
n_treesbysearch_kgrid - enable exact Euclidean benchmarking when
bigKNNis available - compare recall and latency together
- repeat the same workflow on user-supplied data before drawing conclusions
Recap
bigANNOY’s benchmark helpers are designed to make
performance work part of the normal package workflow, not a separate ad
hoc script:
-
benchmark_annoy_bigmatrix()for one configuration -
benchmark_annoy_recall_suite()for parameter sweeps -
benchmark_annoy_vs_rcppannoy()for direct implementation comparison -
benchmark_annoy_volume_suite()for speed and size scaling studies - optional exact recall against
bigKNN - CSV output for saved summaries
- support for both synthetic and user-supplied data
The next vignette to read after this one is usually Metrics and Tuning, which goes deeper on how to choose metrics and search/build controls.