bigANNOY exposes two kinds of choices that matter in
practice:
- the metric, which defines what “near” means
- the tuning controls, which trade build cost, search cost, and search quality against one another
This vignette walks through both with small concrete examples and then ends with a lightweight tuning workflow you can reuse on your own data.
A Small Dataset for Metric Comparisons
To make metric behavior easier to see, we will use a tiny reference set with a few deliberately different vector directions and magnitudes.
tune_dir <- tempfile("bigannoy-tuning-")
dir.create(tune_dir, recursive = TRUE, showWarnings = FALSE)
ref_labels <- c(
"unit_x",
"double_x",
"unit_y",
"tilted_x",
"unit_z",
"diag_xy"
)
ref_dense <- matrix(
c(
1.0, 0.0, 0.0,
2.0, 0.0, 0.0,
0.0, 1.0, 0.0,
0.8, 0.2, 0.0,
0.0, 0.0, 1.0,
1.0, 1.0, 0.0
),
ncol = 3,
byrow = TRUE
)
query_dense <- matrix(
c(
1.0, 0.0, 0.0,
0.9, 0.1, 0.0
),
ncol = 3,
byrow = TRUE
)
ref_big <- as.big.matrix(ref_dense)
data.frame(
index = seq_along(ref_labels),
label = ref_labels,
ref_dense,
row.names = NULL
)
#> index label X1 X2 X3
#> 1 1 unit_x 1.0 0.0 0
#> 2 2 double_x 2.0 0.0 0
#> 3 3 unit_y 0.0 1.0 0
#> 4 4 tilted_x 0.8 0.2 0
#> 5 5 unit_z 0.0 0.0 1
#> 6 6 diag_xy 1.0 1.0 0Supported Metrics
bigANNOY currently supports:
"euclidean""angular""manhattan""dot"
The most important rule of thumb is that distances are only directly comparable within the same metric. A Euclidean distance and an angular distance are not on the same scale and should not be interpreted as if they meant the same thing.
Compare Metrics on the Same Queries
Here is the same search performed under all four metrics.
metric_table <- do.call(
rbind,
lapply(c("euclidean", "angular", "manhattan", "dot"), function(metric) {
index_path <- file.path(tune_dir, sprintf("%s.ann", metric))
idx <- annoy_build_bigmatrix(
ref_big,
path = index_path,
metric = metric,
n_trees = 20L,
seed = 123L,
load_mode = "eager"
)
res <- annoy_search_bigmatrix(
idx,
query = query_dense,
k = 2L,
search_k = 100L
)
data.frame(
metric = metric,
q1_top1 = ref_labels[res$index[1, 1]],
q1_distance = round(res$distance[1, 1], 3),
q2_top1 = ref_labels[res$index[2, 1]],
q2_distance = round(res$distance[2, 1], 3),
stringsAsFactors = FALSE
)
})
)
metric_table
#> metric q1_top1 q1_distance q2_top1 q2_distance
#> 1 euclidean unit_x 0 tilted_x 0.141
#> 2 angular unit_x 0 unit_x 0.111
#> 3 manhattan unit_x 0 tilted_x 0.200
#> 4 dot double_x 2 double_x 1.800Even on this toy example, the metric choice changes how rows are ranked.
The practical interpretation is:
- use
"euclidean"when straight-line distance in the original space is what you care about, and especially when you want the most direct comparison withbigKNN - use
"angular"when vector direction matters more than magnitude - use
"manhattan"when coordinatewise absolute deviations are a more natural notion of difference than Euclidean distance - use
"dot"when inner-product style ranking is closer to the scoring rule you want
For non-Euclidean metrics, treat the returned distance
matrix as the Annoy-backend distance for that metric rather than as
something you can compare directly to Euclidean values.
Build-Time Controls
The most important build-time controls are:
n_treesseedbuild_threadsblock_sizeload_mode
n_trees
n_trees is the main quality-versus-build-cost knob at
index build time.
- more trees usually improve search quality
- more trees usually increase build time and index size
- very small trees are useful for quick experiments but often not for final production settings
seed
seed makes index construction reproducible. This is
especially useful when you are benchmarking different settings and want
to reduce one source of variation between runs.
build_threads
build_threads is passed to the native C++ backend.
-
-1Lmeans “use Annoy’s default” - positive integers request an explicit build-thread count
- the debug-only R backend ignores this control
block_size
block_size controls how many rows are processed per
streamed block while building and searching. This is mostly an
execution-behavior knob, not a quality knob.
- smaller blocks can reduce transient memory pressure
- larger blocks can reduce overhead in some workloads
load_mode
load_mode controls session behavior, not search
quality:
-
"lazy"delays opening the native handle until first search -
"eager"opens the handle immediately
Here is a simple side-by-side example.
lazy_index <- annoy_build_bigmatrix(
ref_big,
path = file.path(tune_dir, "lazy.ann"),
metric = "euclidean",
n_trees = 8L,
seed = 123L,
load_mode = "lazy"
)
eager_index <- annoy_build_bigmatrix(
ref_big,
path = file.path(tune_dir, "eager.ann"),
metric = "euclidean",
n_trees = 25L,
seed = 123L,
load_mode = "eager"
)
c(
lazy_loaded = annoy_is_loaded(lazy_index),
eager_loaded = annoy_is_loaded(eager_index)
)
#> lazy_loaded eager_loaded
#> FALSE TRUEQuery-Time Controls
The most important search-time controls are:
ksearch_kblock_sizeprefault
k
k is simply the number of neighbours you want returned.
It changes the shape of the result and the amount of work the search
must do.
search_k
search_k is the main quality-versus-search-cost knob at
query time.
- larger values usually improve search quality
- larger values usually increase search time
-
-1Llets Annoy use its default search budget
When you start tuning, this is usually the first knob to increase.
block_size
At search time, block_size controls how many query rows
are processed per block. As with build-time blocking, this affects
execution behavior more than quality.
prefault
prefault controls how the persisted Annoy index is
loaded by the native backend. It can be useful for repeated search
workloads on some platforms, but it is not guaranteed to have the same
effect everywhere.
reopened <- annoy_open_index(
eager_index$path,
prefault = TRUE,
load_mode = "eager"
)
result <- annoy_search_bigmatrix(
reopened,
query = query_dense,
k = 2L,
search_k = 100L,
prefault = TRUE
)Because prefault depends on platform and OS support, it
is best treated as a workload-specific optimization rather than as a
universal default.
Use the Benchmark Helpers to Tune n_trees and search_k
Once you know which metric is appropriate, the next question is
usually how far to push n_trees and
search_k.
The benchmark helpers are the easiest way to study that trade-off.
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
tuning_suite <- benchmark_annoy_recall_suite(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 3L,
n_trees = c(5L, 20L),
search_k = c(-1L, 50L, 200L),
metric = "euclidean",
exact = TRUE,
path_dir = tune_dir
)
tuning_suite$summary[, c(
"n_trees",
"search_k",
"build_elapsed",
"search_elapsed",
"recall_at_k"
)]
} else {
tuning_suite <- benchmark_annoy_recall_suite(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 3L,
n_trees = c(5L, 20L),
search_k = c(-1L, 50L, 200L),
metric = "euclidean",
exact = FALSE,
path_dir = tune_dir
)
tuning_suite$summary[, c(
"n_trees",
"search_k",
"build_elapsed",
"search_elapsed"
)]
}
#> n_trees search_k build_elapsed search_elapsed recall_at_k
#> 1 5 -1 0.006 0.000 0.6833333
#> 2 5 50 0.006 0.000 0.9666667
#> 3 5 200 0.006 0.001 1.0000000
#> 4 20 -1 0.010 0.001 0.9666667
#> 5 20 50 0.010 0.001 0.9666667
#> 6 20 200 0.010 0.000 1.0000000That table is the practical center of most tuning work:
- if recall is available, compare it against search time
- if recall is not available yet, compare build and search timing first
- only benchmark metrics against each other when those metrics make sense for the same modelling problem
Package-Level Defaults
bigANNOY also exposes a few package options that are
useful in repeated tuning sessions.
list(
block_size_default = getOption("bigANNOY.block_size", 1024L),
progress_default = getOption("bigANNOY.progress", FALSE),
backend_default = getOption("bigANNOY.backend", "cpp")
)
#> $block_size_default
#> [1] 1024
#>
#> $progress_default
#> [1] FALSE
#>
#> $backend_default
#> [1] "cpp"In practice:
- set
options(bigANNOY.block_size = ...)when you want a session-wide block size default - set
options(bigANNOY.progress = TRUE)when you want progress messages during long runs - keep the native C++ backend as the default for real performance work
A Practical Tuning Pattern
A useful workflow is:
- choose the metric that best matches the meaning of similarity in your data
- start with moderate
n_treesand a modestsearch_k - benchmark a small grid of
n_treesbysearch_k - increase
search_kfirst if quality is too low - rebuild with more trees when higher search budgets alone are not enough
- revisit
block_size,load_mode, andprefaultonly after the main quality-versus-latency trade-off is understood
Recap
The most important ideas in bigANNOY tuning are:
- metric choice comes first
-
n_treesmostly controls build-time quality investment -
search_kmostly controls query-time quality investment -
block_size,load_mode, andprefaultmostly affect execution behavior rather than neighbour semantics - Euclidean tuning is the easiest place to start when you want an
exact baseline with
bigKNN
The next vignette after this one is usually Validation and Sharing Indexes, which focuses on sidecar metadata, persisted files, and safe reuse across sessions.