Approximate nearest-neighbour search for bigmemory matrices with Annoy
Frédéric Bertrand
The bigANNOY package provides approximate nearest-neighbour search specialised for bigmemory::big.matrix objects through persisted Annoy indexes. It keeps the reference data in bigmemory storage during build and query workflows, supports repeated-query sessions through explicit open/load helpers, and can stream neighbour indices and distances directly into destination big.matrix objects.
Current features include:
- native C++ bigmemory-backed build and search paths, with an R backend kept as a debug-only fallback,
- persisted Annoy indexes plus sidecar metadata for safe reopen and validation,
- Euclidean, angular, Manhattan, and dot-product Annoy metrics,
- self-search and external-query workflows on dense matrices,
big.matrixobjects, descriptors, descriptor paths, and external pointers, - streamed output into file-backed or in-memory
big.matrixdestinations, - explicit lifecycle helpers such as
annoy_open_index(),annoy_load_bigmatrix(),annoy_is_loaded(),annoy_close_index(), andannoy_validate_index(), and - benchmark helpers that can compare approximate Euclidean search against the exact
bigKNNbaseline whenbigKNNis available.
These workflows make bigANNOY useful both as a standalone approximate search package and as the ANN side of an exact-versus-approximate evaluation pipeline built around bigKNN.
Installation
The package is currently easiest to install from GitHub:
# install.packages("remotes")
remotes::install_github("fbertran/bigANNOY")If you prefer a local source install, clone the repository and run:
Options
The package defines a small set of runtime options:
| Option | Default value | Description |
|---|---|---|
bigANNOY.block_size |
1024L |
Default number of rows processed per build/search block. |
bigANNOY.progress |
FALSE |
Emit simple progress messages during long-running builds, searches, and benchmarks. |
bigANNOY.backend |
"cpp" |
Backend request. "cpp" uses the native compiled backend, "auto" falls back when compiled symbols are not loaded, and "r" forces the debug-only R backend. |
All options can be changed with options() at runtime. For example, options(bigANNOY.block_size = 2048L) increases the default block size used by the build and search helpers.
Examples
The examples below use a small Euclidean reference matrix so the returned neighbours are easy to inspect.
Build and query an Annoy index
library(bigmemory)
library(bigANNOY)
reference <- as.big.matrix(matrix(
c(0, 0,
1, 0,
0, 1,
1, 1,
2, 2),
ncol = 2,
byrow = TRUE
))
query <- matrix(
c(0.1, 0.1,
1.8, 1.9),
ncol = 2,
byrow = TRUE
)
index <- annoy_build_bigmatrix(
reference,
path = tempfile(fileext = ".ann"),
metric = "euclidean",
n_trees = 20L,
seed = 123L,
load_mode = "eager"
)
result <- annoy_search_bigmatrix(
index,
query = query,
k = 2L,
search_k = 100L
)
result$index
round(result$distance, 3)Reopen and validate a persisted index
reopened <- annoy_open_index(index$path, load_mode = "lazy")
annoy_is_loaded(reopened)
report <- annoy_validate_index(
reopened,
strict = TRUE,
load = TRUE
)
report$valid
annoy_is_loaded(reopened)Stream results into bigmemory outputs
index_store <- big.matrix(nrow(query), 2L, type = "integer")
distance_store <- big.matrix(nrow(query), 2L, type = "double")
annoy_search_bigmatrix(
index,
query = query,
k = 2L,
xpIndex = index_store,
xpDistance = distance_store
)
bigmemory::as.matrix(index_store)
round(bigmemory::as.matrix(distance_store), 3)Benchmark approximate Euclidean search
benchmark_annoy_bigmatrix(
n_ref = 2000L,
n_query = 200L,
n_dim = 20L,
k = 10L,
n_trees = 50L,
search_k = 1000L,
metric = "euclidean",
exact = TRUE
)If bigKNN is installed, the Euclidean benchmark helpers also report exact search timing and recall against the exact baseline.
Installed Benchmark Runner
An installed command-line benchmark script is also available at:
system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")Example single-run command:
Vignettes
The package now ships with focused vignettes for the main workflows:
getting-started-bigannoypersistent-indexes-and-lifecyclefile-backed-bigmemory-workflowsbenchmarking-recall-and-latencymetrics-and-tuningvalidation-and-sharing-indexesbigannoy-vs-bigknn
Together they cover the basic ANN workflow, loaded-index lifecycle, file-backed bigmemory usage, benchmarking and recall evaluation, tuning, validation and sharing of persisted indexes, and the relationship between approximate bigANNOY search and exact bigKNN search.