Persistent Indexes and Lifecycle
Source:vignettes/persistent-indexes-and-lifecycle.Rmd
persistent-indexes-and-lifecycle.RmdbigANNOY v3 adds explicit index lifecycle support around
persisted Annoy files. That makes it possible to:
- build an index once and reopen it later
- choose whether a reopened index should load eagerly or lazily
- check whether a native handle is currently live
- close loaded handles explicitly in long sessions
- validate the Annoy file against its recorded metadata before reuse
This vignette focuses on those operational workflows rather than on search quality or benchmark tuning.
Why Lifecycle Management Matters
Annoy indexes are stored on disk. In practice, that means the useful object is not just the result of a single build call, but a persisted pair:
- the
.annindex file - the
.metasidecar metadata file
The bigannoy_index object returned by
bigANNOY is a session-level wrapper around those files. It
remembers the key metadata and can optionally hold a live native handle
for faster repeated searches within the same R session.
Build an Index in Lazy Mode
We will create a small reference matrix, write the Annoy index into a temporary directory, and keep the returned object in lazy mode so the first search is what loads the live handle.
artifact_dir <- file.path(tempdir(), "bigannoy-lifecycle")
dir.create(artifact_dir, recursive = TRUE, showWarnings = FALSE)
ref_dense <- matrix(
c(
0.0, 0.1, 0.2,
0.1, 0.0, 0.1,
0.2, 0.1, 0.0,
1.0, 1.1, 1.2,
1.1, 1.0, 1.1,
1.2, 1.1, 1.0
),
ncol = 3,
byrow = TRUE
)
ref_big <- as.big.matrix(ref_dense)
index_path <- file.path(artifact_dir, "ref.ann")
metadata_path <- paste0(index_path, ".meta")
index <- annoy_build_bigmatrix(
ref_big,
path = index_path,
n_trees = 25L,
metric = "euclidean",
seed = 123L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBy4zub/bigannoy-lifecycle/ref.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBy4zub/bigannoy-lifecycle/ref.ann.meta
#> index_id: annoy-20260327005528-2a8b6582c143
#> metric: euclidean
#> trees: 25
#> items: 6
#> dimension: 3
#> build_seed: 123
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 2968
#> file_md5: 2a8b6582c143e941abc77e79789e227e
#> prefault: FALSEThe returned object points to the persisted files, but the native handle is not loaded yet.
annoy_is_loaded(index)
#> [1] FALSE
file.exists(index$path)
#> [1] TRUE
file.exists(index$metadata_path)
#> [1] TRUEInspect the Sidecar Metadata
The sidecar metadata file is meant to support safe reopen and validation workflows. It records the metric, dimension, item count, build settings, and a small file signature for the persisted Annoy file.
metadata <- read.dcf(index$metadata_path)
metadata[, c(
"index_id",
"metric",
"n_dim",
"n_ref",
"n_trees",
"build_seed",
"build_backend",
"file_size",
"file_md5"
)]
#> index_id metric n_dim n_ref n_trees
#> [1,] "annoy-20260327005528-2a8b6582c143" "euclidean" "3" "6" "25"
#> [2,] "annoy-20260327005528-2a8b6582c143" "euclidean" "3" "6" "25"
#> [3,] "annoy-20260327005528-2a8b6582c143" "euclidean" "3" "6" "25"
#> build_seed build_backend file_size file_md5
#> [1,] "123" "cpp" "2968" "2a8b6582c143e941abc77e79789e227e"
#> [2,] "123" "cpp" "2968" "2a8b6582c143e941abc77e79789e227e"
#> [3,] "123" "cpp" "2968" "2a8b6582c143e941abc77e79789e227e"The important point is not the exact formatting of the metadata file, but that the persisted index is now self-describing enough to be reopened and checked in later sessions.
Lazy Loading Versus Eager Loading
There are two lifecycle modes:
-
"lazy"keeps only metadata in memory until the first search -
"eager"loads a native handle immediately when the index object is created or reopened
The index we just built is lazy.
annoy_is_loaded(index)
#> [1] FALSEThe first search loads the handle automatically.
first_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
annoy_is_loaded(index)
#> [1] TRUE
first_result$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 3
#> [3,] 2 1
#> [4,] 5 6
#> [5,] 4 6
#> [6,] 5 4
round(first_result$distance, 3)
#> [,1] [,2]
#> [1,] 0.173 0.283
#> [2,] 0.173 0.173
#> [3,] 0.173 0.283
#> [4,] 0.173 0.283
#> [5,] 0.173 0.173
#> [6,] 0.173 0.283Once the handle is loaded, repeated searches in the same session can reuse it.
second_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
identical(first_result$index, second_result$index)
#> [1] TRUE
all.equal(first_result$distance, second_result$distance)
#> [1] TRUEValidate Without Loading
Validation and loading are related, but they are not the same thing. Sometimes you want to confirm that the metadata and file signature still look right without paying the cost of loading the native handle yet.
annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSE
validation_no_load <- annoy_validate_index(
index,
strict = TRUE,
load = FALSE
)
validation_no_load$valid
#> [1] TRUE
validation_no_load$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 TRUE error
#> 7 file_mtime TRUE warning
annoy_is_loaded(index)
#> [1] FALSEBecause load = FALSE, the validation report checks the
recorded metadata against the current file without changing the loaded
state of the object.
Validate and Load Explicitly
If you do want validation to also confirm that the Annoy index can be
opened successfully, set load = TRUE.
validation_with_load <- annoy_validate_index(
index,
strict = TRUE,
load = TRUE
)
validation_with_load$valid
#> [1] TRUE
tail(validation_with_load$checks[, c("check", "passed", "severity")], 2L)
#> check passed severity
#> 7 file_mtime TRUE warning
#> 8 load TRUE error
annoy_is_loaded(index)
#> [1] TRUEThis is a useful pattern before long-running queries or before handing a reopened index to downstream analysis code.
Close a Loaded Handle Explicitly
Explicit close support is helpful in long R sessions, in tests, and in code that wants deterministic control over when handles are released.
annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSEThe persisted .ann file is still there, so the next
search can load it again.
reload_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
annoy_is_loaded(index)
#> [1] TRUE
reload_result$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 3
#> [3,] 2 1
#> [4,] 5 6
#> [5,] 4 6
#> [6,] 5 4Reopen the Same Index in a New Object
The more important persistence workflow is reopening the same files
into a new bigannoy_index object. This is what a later R
session would typically do.
annoy_open_index() and
annoy_load_bigmatrix() both support this pattern. The main
distinction is semantic: annoy_load_bigmatrix() is a
friendlier name when you are thinking in terms of bigmemory
workflows, while annoy_open_index() makes the
persisted-index lifecycle more explicit.
reopened_lazy <- annoy_open_index(
path = index$path,
load_mode = "lazy"
)
reopened_eager <- annoy_load_bigmatrix(
path = index$path,
load_mode = "eager"
)
annoy_is_loaded(reopened_lazy)
#> [1] FALSE
annoy_is_loaded(reopened_eager)
#> [1] TRUEThe eager reopen path loads immediately. The lazy reopen path waits until first use.
reopened_result <- annoy_search_bigmatrix(
reopened_lazy,
k = 2L,
search_k = 100L
)
annoy_is_loaded(reopened_lazy)
#> [1] TRUE
reopened_result$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 3
#> [3,] 2 1
#> [4,] 5 6
#> [5,] 4 6
#> [6,] 5 4Lifecycle State Lives in the Session Object
The persisted files are shared, but loaded-state tracking is per-object and per-session. Closing one in-memory object does not invalidate another object that already opened the same index.
annoy_close_index(reopened_lazy)
c(
original = annoy_is_loaded(index),
reopened_lazy = annoy_is_loaded(reopened_lazy),
reopened_eager = annoy_is_loaded(reopened_eager)
)
#> original reopened_lazy reopened_eager
#> TRUE FALSE TRUEThis is a useful mental model:
- the
.annfile is the durable asset - the
bigannoy_indexobject is the session-level controller - the loaded handle is cached inside that controller only for the current session
What Happens If Validation Fails?
In normal workflows,
annoy_validate_index(..., strict = TRUE) is the safest
default because it stops immediately when critical checks fail. If you
want a diagnostic report instead of an error, use
strict = FALSE.
report <- annoy_validate_index(
reopened_eager,
strict = FALSE,
load = FALSE
)
report$valid
#> [1] TRUE
report$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 TRUE error
#> 7 file_mtime TRUE warningThat pattern is especially helpful when you are writing higher-level code that wants to display a validation report before deciding whether to rebuild or reload an index.
Recommended Workflow
For most projects, a sensible lifecycle pattern looks like this:
- build the index once with
annoy_build_bigmatrix() - keep the
.annfile and the.metafile together - reopen with
annoy_open_index()orannoy_load_bigmatrix()in later sessions - run
annoy_validate_index()before important downstream work - use lazy loading for lighter startup or eager loading for repeated search sessions
- call
annoy_close_index()when you want explicit control over loaded handles
Recap
bigANNOY v3 turns persisted Annoy files into a more
explicit lifecycle:
- build once, reopen later
- choose eager or lazy loading
- test loaded state with
annoy_is_loaded() - close handles with
annoy_close_index() - validate persisted files with
annoy_validate_index()
The next vignette to read after this one is usually File-Backed bigmemory Workflows, which focuses on descriptor files, file-backed matrices, and streamed output destinations.