Getting Started with bigANNOY
Source:vignettes/getting-started-bigannoy.Rmd
getting-started-bigannoy.RmdbigANNOY is an approximate nearest-neighbour package for
bigmemory::big.matrix data. It builds a persisted Annoy
index from a reference matrix, searches that index with either
self-search or external queries, and returns results in a shape aligned
with bigKNN.
This vignette walks through the first workflow most users need:
- create a small reference matrix
- build an index on disk
- run self-search and external-query search
- inspect the returned neighbours and distances
- reopen and validate the index in a later step
The examples are intentionally small, but the same API is designed
for larger file-backed big.matrix inputs.
Create a Small Reference Matrix
bigANNOY is built around
bigmemory::big.matrix, so we will start from a dense matrix
and convert it into a big.matrix.
ref_dense <- matrix(
c(
0.0, 0.1, 0.2, 0.3,
0.1, 0.0, 0.1, 0.2,
0.2, 0.1, 0.0, 0.1,
1.0, 1.1, 1.2, 1.3,
1.1, 1.0, 1.1, 1.2,
1.2, 1.1, 1.0, 1.1,
3.0, 3.1, 3.2, 3.3,
3.1, 3.0, 3.1, 3.2
),
ncol = 4,
byrow = TRUE
)
ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
#> [1] 8 4The reference matrix has 8 rows and 4 columns. Each row is a candidate neighbour in the final search results.
Build the First Annoy Index
annoy_build_bigmatrix() streams the reference rows into
a persisted Annoy index and writes a sidecar metadata file next to
it.
index_path <- tempfile(fileext = ".ann")
index <- annoy_build_bigmatrix(
ref_big,
path = index_path,
n_trees = 20L,
metric = "euclidean",
seed = 123L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmp3YPF1B/file321022f28c75.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmp3YPF1B/file321022f28c75.ann.meta
#> index_id: annoy-20260327005524-d4a614637b45
#> metric: euclidean
#> trees: 20
#> items: 8
#> dimension: 4
#> build_seed: 123
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 2816
#> file_md5: d4a614637b45839a9eb126d130a96397
#> prefault: FALSEA few details are worth noticing:
- the Annoy index lives on disk at
index$path - metadata is written to
index$metadata_path -
load_mode = "lazy"means the object is initially metadata-only - the native handle is loaded automatically on first search
You can check the current loaded state directly.
annoy_is_loaded(index)
#> [1] FALSERun a Self-Search
With query = NULL, annoy_search_bigmatrix()
searches the indexed reference rows against themselves. In self-search
mode, the nearest neighbour for each row is another row, not the row
itself.
self_result <- annoy_search_bigmatrix(
index,
k = 2L,
search_k = 100L
)
self_result$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 3
#> [3,] 2 1
#> [4,] 5 6
#> [5,] 4 6
#> [6,] 5 4
#> [7,] 8 4
#> [8,] 7 4
round(self_result$distance, 3)
#> [,1] [,2]
#> [1,] 0.2 0.346
#> [2,] 0.2 0.200
#> [3,] 0.2 0.346
#> [4,] 0.2 0.346
#> [5,] 0.2 0.200
#> [6,] 0.2 0.346
#> [7,] 0.2 4.000
#> [8,] 0.2 3.904Because the first search loads the lazy index, the handle is now available for reuse.
annoy_is_loaded(index)
#> [1] TRUEThe result object follows the same high-level shape as
bigKNN:
str(self_result, max.level = 1)
#> List of 8
#> $ index : int [1:8, 1:2] 2 1 2 5 4 5 8 7 3 3 ...
#> $ distance: num [1:8, 1:2] 0.2 0.2 0.2 0.2 0.2 ...
#> $ k : int 2
#> $ metric : chr "euclidean"
#> $ n_ref : int 8
#> $ n_query : int 8
#> $ exact : logi FALSE
#> $ backend : chr "annoy"In particular:
-
indexis a 1-based integer matrix -
distanceis a double matrix -
k,metric,n_ref, andn_querydescribe the search -
exactis alwaysFALSEforbigANNOY -
backendis"annoy"
Search with an External Query Matrix
External queries are often the more common workflow in practice. Here we build a small dense query matrix with rows close to the first, middle, and final clusters in the reference data.
query_dense <- matrix(
c(
0.05, 0.05, 0.15, 0.25,
1.05, 1.05, 1.10, 1.25,
3.05, 3.05, 3.15, 3.25
),
ncol = 4,
byrow = TRUE
)
query_result <- annoy_search_bigmatrix(
index,
query = query_dense,
k = 3L,
search_k = 100L
)
query_result$index
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 5 4 6
#> [3,] 7 8 4
round(query_result$distance, 3)
#> [,1] [,2] [,3]
#> [1,] 0.100 0.100 0.265
#> [2,] 0.087 0.132 0.240
#> [3,] 0.100 0.100 3.951The three query rows each return three approximate neighbours from the indexed reference matrix. For small examples like this one, the results will typically look exact, but the important point is that the API stays the same for larger problems where approximate search is preferable.
Tune the Main Search Controls
Two arguments matter most when you begin tuning:
-
n_treescontrols index quality and index size at build time -
search_kcontrols search effort at query time
As a starting point:
- increase
search_kfirst if recall looks too low - rebuild with more
n_treeswhen query-time tuning alone is not enough - keep
metric = "euclidean"when you want the most direct comparison withbigKNN
The package also supports "angular",
"manhattan", and "dot" metrics, but Euclidean
is usually the easiest place to begin.
Stream Results into big.matrix Outputs
For larger workloads, you may not want to keep neighbour matrices in
ordinary R memory. bigANNOY can write directly into
destination big.matrix objects.
index_out <- big.matrix(nrow(query_dense), 2L, type = "integer")
distance_out <- big.matrix(nrow(query_dense), 2L, type = "double")
streamed <- annoy_search_bigmatrix(
index,
query = query_dense,
k = 2L,
xpIndex = index_out,
xpDistance = distance_out
)
bigmemory::as.matrix(index_out)
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 5 4
#> [3,] 7 8
round(bigmemory::as.matrix(distance_out), 3)
#> [,1] [,2]
#> [1,] 0.100 0.100
#> [2,] 0.087 0.132
#> [3,] 0.100 0.100The returned object still reports the same metadata, but the actual
neighbour matrices live in the destination big.matrix
containers.
Reopen and Validate a Persisted Index
One of the main v3 improvements is explicit index lifecycle support. You can close a loaded handle, reopen the same index from disk, and validate its metadata before reuse.
annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSE
reopened <- annoy_open_index(index$path, load_mode = "eager")
annoy_is_loaded(reopened)
#> [1] TRUEValidation checks the recorded metadata against the current Annoy file and can also verify that the index loads successfully.
validation <- annoy_validate_index(reopened, strict = TRUE, load = TRUE)
validation$valid
#> [1] TRUE
validation$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 TRUE error
#> 7 file_mtime TRUE warning
#> 8 load TRUE errorThis is especially helpful when you want to reuse an index across
sessions or share the .ann file and its .meta
sidecar with someone else.
What Inputs Are Accepted?
For the quick start above we used:
- a
big.matrixreference - a dense matrix query
- in-memory
big.matrixdestinations for streamed outputs
The package also accepts:
- external pointers to
big.matrixobjects -
big.matrixdescriptor objects - descriptor file paths
-
query = NULLfor self-search
That broader file-backed workflow is covered in the dedicated
vignette on bigmemory persistence and descriptors.
Recap
You have now seen the full first-run workflow:
- create a
big.matrixreference - build a persisted Annoy index
- search it in self-search and external-query modes
- stream results into destination
big.matrixobjects when needed - reopen, validate, and reuse the index
From here, the most useful next steps are:
- Persistent Indexes and Lifecycle for eager/lazy loading and explicit close and reopen workflows
- File-Backed bigmemory Workflows for descriptor files and on-disk matrices
-
Benchmarking Recall and Latency for
benchmark_annoy_bigmatrix()andbenchmark_annoy_recall_suite()