Approximate nearest-neighbour search for bigmemory matrices with Annoy

Frédéric Bertrand

The bigANNOY package provides approximate nearest-neighbour search specialised for bigmemory::big.matrix objects through persisted Annoy indexes. It keeps the reference data in bigmemory storage during build and query workflows, supports repeated-query sessions through explicit open/load helpers, and can stream neighbour indices and distances directly into destination big.matrix objects.

Current features include:

native C++ bigmemory-backed build and search paths, with an R backend kept as a debug-only fallback,
persisted Annoy indexes plus sidecar metadata for safe reopen and validation,
Euclidean, angular, Manhattan, and dot-product Annoy metrics,
self-search and external-query workflows on dense matrices, big.matrix objects, descriptors, descriptor paths, and external pointers,
streamed output into file-backed or in-memory big.matrix destinations,
explicit lifecycle helpers such as annoy_open_index(), annoy_load_bigmatrix(), annoy_is_loaded(), annoy_close_index(), and annoy_validate_index(), and
benchmark helpers that can compare approximate Euclidean search against the exact bigKNN baseline when bigKNN is available.

These workflows make bigANNOY useful both as a standalone approximate search package and as the ANN side of an exact-versus-approximate evaluation pipeline built around bigKNN.

Installation

The package is currently easiest to install from GitHub:

# install.packages("remotes")
remotes::install_github("fbertran/bigANNOY")

If you prefer a local source install, clone the repository and run:

R CMD build bigANNOY
R CMD INSTALL bigANNOY_0.3.0.tar.gz

Options

The package defines a small set of runtime options:

Option	Default value	Description
`bigANNOY.block_size`	`1024L`	Default number of rows processed per build/search block.
`bigANNOY.progress`	`FALSE`	Emit simple progress messages during long-running builds, searches, and benchmarks.
`bigANNOY.backend`	`"cpp"`	Backend request. `"cpp"` uses the native compiled backend, `"auto"` falls back when compiled symbols are not loaded, and `"r"` forces the debug-only R backend.

All options can be changed with options() at runtime. For example, options(bigANNOY.block_size = 2048L) increases the default block size used by the build and search helpers.

Examples

The examples below use a small Euclidean reference matrix so the returned neighbours are easy to inspect.

Build and query an Annoy index

library(bigmemory)
library(bigANNOY)

reference <- as.big.matrix(matrix(
  c(0, 0,
    1, 0,
    0, 1,
    1, 1,
    2, 2),
  ncol = 2,
  byrow = TRUE
))

query <- matrix(
  c(0.1, 0.1,
    1.8, 1.9),
  ncol = 2,
  byrow = TRUE
)

index <- annoy_build_bigmatrix(
  reference,
  path = tempfile(fileext = ".ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

result <- annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  search_k = 100L
)

result$index
round(result$distance, 3)

Reopen and validate a persisted index

reopened <- annoy_open_index(index$path, load_mode = "lazy")

annoy_is_loaded(reopened)

report <- annoy_validate_index(
  reopened,
  strict = TRUE,
  load = TRUE
)

report$valid
annoy_is_loaded(reopened)

Stream results into bigmemory outputs

index_store <- big.matrix(nrow(query), 2L, type = "integer")
distance_store <- big.matrix(nrow(query), 2L, type = "double")

annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  xpIndex = index_store,
  xpDistance = distance_store
)

bigmemory::as.matrix(index_store)
round(bigmemory::as.matrix(distance_store), 3)

Benchmark approximate Euclidean search

benchmark_annoy_bigmatrix(
  n_ref = 2000L,
  n_query = 200L,
  n_dim = 20L,
  k = 10L,
  n_trees = 50L,
  search_k = 1000L,
  metric = "euclidean",
  exact = TRUE
)

If bigKNN is installed, the Euclidean benchmark helpers also report exact search timing and recall against the exact baseline.

Installed Benchmark Runner

An installed command-line benchmark script is also available at:

system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")

Example single-run command:

Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
  --mode=single \
  --n_ref=5000 \
  --n_query=500 \
  --n_dim=50 \
  --k=20 \
  --n_trees=100 \
  --search_k=5000 \
  --load_mode=eager

Vignettes

The package now ships with focused vignettes for the main workflows:

getting-started-bigannoy
persistent-indexes-and-lifecycle
file-backed-bigmemory-workflows
benchmarking-recall-and-latency
metrics-and-tuning
validation-and-sharing-indexes
bigannoy-vs-bigknn

Together they cover the basic ANN workflow, loaded-index lifecycle, file-backed bigmemory usage, benchmarking and recall evaluation, tuning, validation and sharing of persisted indexes, and the relationship between approximate bigANNOY search and exact bigKNN search.