Skip to contents

bigANNOY is an approximate nearest-neighbour package for bigmemory::big.matrix data. It builds a persisted Annoy index from a reference matrix, searches that index with either self-search or external queries, and returns results in a shape aligned with bigKNN.

This vignette walks through the first workflow most users need:

  1. create a small reference matrix
  2. build an index on disk
  3. run self-search and external-query search
  4. inspect the returned neighbours and distances
  5. reopen and validate the index in a later step

The examples are intentionally small, but the same API is designed for larger file-backed big.matrix inputs.

Load the Packages

Create a Small Reference Matrix

bigANNOY is built around bigmemory::big.matrix, so we will start from a dense matrix and convert it into a big.matrix.

ref_dense <- matrix(
  c(
    0.0, 0.1, 0.2, 0.3,
    0.1, 0.0, 0.1, 0.2,
    0.2, 0.1, 0.0, 0.1,
    1.0, 1.1, 1.2, 1.3,
    1.1, 1.0, 1.1, 1.2,
    1.2, 1.1, 1.0, 1.1,
    3.0, 3.1, 3.2, 3.3,
    3.1, 3.0, 3.1, 3.2
  ),
  ncol = 4,
  byrow = TRUE
)

ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
#> [1] 8 4

The reference matrix has 8 rows and 4 columns. Each row is a candidate neighbour in the final search results.

Build the First Annoy Index

annoy_build_bigmatrix() streams the reference rows into a persisted Annoy index and writes a sidecar metadata file next to it.

index_path <- tempfile(fileext = ".ann")

index <- annoy_build_bigmatrix(
  ref_big,
  path = index_path,
  n_trees = 20L,
  metric = "euclidean",
  seed = 123L,
  load_mode = "lazy"
)

index
#> <bigannoy_index>
#>   path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmp3YPF1B/file321022f28c75.ann
#>   metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmp3YPF1B/file321022f28c75.ann.meta
#>   index_id: annoy-20260327005524-d4a614637b45
#>   metric: euclidean
#>   trees: 20
#>   items: 8
#>   dimension: 4
#>   build_seed: 123
#>   build_threads: -1
#>   build_backend: cpp
#>   load_mode: lazy
#>   loaded: FALSE
#>   file_size: 2816
#>   file_md5: d4a614637b45839a9eb126d130a96397
#>   prefault: FALSE

A few details are worth noticing:

  • the Annoy index lives on disk at index$path
  • metadata is written to index$metadata_path
  • load_mode = "lazy" means the object is initially metadata-only
  • the native handle is loaded automatically on first search

You can check the current loaded state directly.

annoy_is_loaded(index)
#> [1] FALSE

With query = NULL, annoy_search_bigmatrix() searches the indexed reference rows against themselves. In self-search mode, the nearest neighbour for each row is another row, not the row itself.

self_result <- annoy_search_bigmatrix(
  index,
  k = 2L,
  search_k = 100L
)

self_result$index
#>      [,1] [,2]
#> [1,]    2    3
#> [2,]    1    3
#> [3,]    2    1
#> [4,]    5    6
#> [5,]    4    6
#> [6,]    5    4
#> [7,]    8    4
#> [8,]    7    4
round(self_result$distance, 3)
#>      [,1]  [,2]
#> [1,]  0.2 0.346
#> [2,]  0.2 0.200
#> [3,]  0.2 0.346
#> [4,]  0.2 0.346
#> [5,]  0.2 0.200
#> [6,]  0.2 0.346
#> [7,]  0.2 4.000
#> [8,]  0.2 3.904

Because the first search loads the lazy index, the handle is now available for reuse.

annoy_is_loaded(index)
#> [1] TRUE

The result object follows the same high-level shape as bigKNN:

str(self_result, max.level = 1)
#> List of 8
#>  $ index   : int [1:8, 1:2] 2 1 2 5 4 5 8 7 3 3 ...
#>  $ distance: num [1:8, 1:2] 0.2 0.2 0.2 0.2 0.2 ...
#>  $ k       : int 2
#>  $ metric  : chr "euclidean"
#>  $ n_ref   : int 8
#>  $ n_query : int 8
#>  $ exact   : logi FALSE
#>  $ backend : chr "annoy"

In particular:

  • index is a 1-based integer matrix
  • distance is a double matrix
  • k, metric, n_ref, and n_query describe the search
  • exact is always FALSE for bigANNOY
  • backend is "annoy"

Search with an External Query Matrix

External queries are often the more common workflow in practice. Here we build a small dense query matrix with rows close to the first, middle, and final clusters in the reference data.

query_dense <- matrix(
  c(
    0.05, 0.05, 0.15, 0.25,
    1.05, 1.05, 1.10, 1.25,
    3.05, 3.05, 3.15, 3.25
  ),
  ncol = 4,
  byrow = TRUE
)

query_result <- annoy_search_bigmatrix(
  index,
  query = query_dense,
  k = 3L,
  search_k = 100L
)

query_result$index
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    5    4    6
#> [3,]    7    8    4
round(query_result$distance, 3)
#>       [,1]  [,2]  [,3]
#> [1,] 0.100 0.100 0.265
#> [2,] 0.087 0.132 0.240
#> [3,] 0.100 0.100 3.951

The three query rows each return three approximate neighbours from the indexed reference matrix. For small examples like this one, the results will typically look exact, but the important point is that the API stays the same for larger problems where approximate search is preferable.

Tune the Main Search Controls

Two arguments matter most when you begin tuning:

  • n_trees controls index quality and index size at build time
  • search_k controls search effort at query time

As a starting point:

  • increase search_k first if recall looks too low
  • rebuild with more n_trees when query-time tuning alone is not enough
  • keep metric = "euclidean" when you want the most direct comparison with bigKNN

The package also supports "angular", "manhattan", and "dot" metrics, but Euclidean is usually the easiest place to begin.

Stream Results into big.matrix Outputs

For larger workloads, you may not want to keep neighbour matrices in ordinary R memory. bigANNOY can write directly into destination big.matrix objects.

index_out <- big.matrix(nrow(query_dense), 2L, type = "integer")
distance_out <- big.matrix(nrow(query_dense), 2L, type = "double")

streamed <- annoy_search_bigmatrix(
  index,
  query = query_dense,
  k = 2L,
  xpIndex = index_out,
  xpDistance = distance_out
)

bigmemory::as.matrix(index_out)
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    5    4
#> [3,]    7    8
round(bigmemory::as.matrix(distance_out), 3)
#>       [,1]  [,2]
#> [1,] 0.100 0.100
#> [2,] 0.087 0.132
#> [3,] 0.100 0.100

The returned object still reports the same metadata, but the actual neighbour matrices live in the destination big.matrix containers.

Reopen and Validate a Persisted Index

One of the main v3 improvements is explicit index lifecycle support. You can close a loaded handle, reopen the same index from disk, and validate its metadata before reuse.

annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSE

reopened <- annoy_open_index(index$path, load_mode = "eager")
annoy_is_loaded(reopened)
#> [1] TRUE

Validation checks the recorded metadata against the current Annoy file and can also verify that the index loads successfully.

validation <- annoy_validate_index(reopened, strict = TRUE, load = TRUE)

validation$valid
#> [1] TRUE
validation$checks[, c("check", "passed", "severity")]
#>        check passed severity
#> 1 index_file   TRUE    error
#> 2     metric   TRUE    error
#> 3 dimensions   TRUE    error
#> 4      items   TRUE    error
#> 5  file_size   TRUE    error
#> 6   file_md5   TRUE    error
#> 7 file_mtime   TRUE  warning
#> 8       load   TRUE    error

This is especially helpful when you want to reuse an index across sessions or share the .ann file and its .meta sidecar with someone else.

What Inputs Are Accepted?

For the quick start above we used:

  • a big.matrix reference
  • a dense matrix query
  • in-memory big.matrix destinations for streamed outputs

The package also accepts:

  • external pointers to big.matrix objects
  • big.matrix descriptor objects
  • descriptor file paths
  • query = NULL for self-search

That broader file-backed workflow is covered in the dedicated vignette on bigmemory persistence and descriptors.

Recap

You have now seen the full first-run workflow:

  1. create a big.matrix reference
  2. build a persisted Annoy index
  3. search it in self-search and external-query modes
  4. stream results into destination big.matrix objects when needed
  5. reopen, validate, and reuse the index

From here, the most useful next steps are: