Validation and Sharing Indexes
Source:vignettes/validation-and-sharing-indexes.Rmd
validation-and-sharing-indexes.RmdPersisted indexes are most useful when they can be reopened safely later or shared with collaborators without guessing how they were created.
bigANNOY v3 addresses that problem with two ideas:
- each Annoy index has a sidecar metadata file
- persisted indexes can be checked with
annoy_validate_index()before use
This vignette focuses on those operational safeguards.
Create a Small Persisted Example
We will build a small Euclidean Annoy index and keep all of its files inside a temporary working directory.
share_dir <- tempfile("bigannoy-share-")
dir.create(share_dir, recursive = TRUE, showWarnings = FALSE)
ref_dense <- matrix(
c(
0.0, 0.0,
1.0, 0.0,
0.0, 1.0,
1.0, 1.0
),
ncol = 2,
byrow = TRUE
)
ref_big <- as.big.matrix(ref_dense)
index_path <- file.path(share_dir, "ref.ann")
index <- annoy_build_bigmatrix(
ref_big,
path = index_path,
n_trees = 20L,
metric = "euclidean",
seed = 77L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpS6pDWM/bigannoy-share-32733dac60b5/ref.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpS6pDWM/bigannoy-share-32733dac60b5/ref.ann.meta
#> index_id: annoy-20260327005530-c5f081efa033
#> metric: euclidean
#> trees: 20
#> items: 4
#> dimension: 2
#> build_seed: 77
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 1056
#> file_md5: c5f081efa03365a2af6692ec9950b0f9
#> prefault: FALSEAt this point the key persisted assets are:
- the Annoy index file at
index$path - the sidecar metadata file at
index$metadata_path
What the Metadata Records
The metadata file is a small DCF document that records enough information to make later reopen and validation steps safer.
metadata <- read.dcf(index$metadata_path)
metadata[, c(
"metadata_version",
"package_version",
"annoy_version",
"index_id",
"metric",
"n_dim",
"n_ref",
"n_trees",
"build_seed",
"build_threads",
"build_backend",
"file_size",
"file_mtime",
"file_md5",
"load_mode",
"index_file"
)]
#> metadata_version package_version annoy_version
#> [1,] "3" "0.3.0" "0"
#> [2,] "3" "0.3.0" "0"
#> [3,] "3" "0.3.0" "23"
#> index_id metric n_dim n_ref n_trees
#> [1,] "annoy-20260327005530-c5f081efa033" "euclidean" "2" "4" "20"
#> [2,] "annoy-20260327005530-c5f081efa033" "euclidean" "2" "4" "20"
#> [3,] "annoy-20260327005530-c5f081efa033" "euclidean" "2" "4" "20"
#> build_seed build_threads build_backend file_size file_mtime
#> [1,] "77" "-1" "cpp" "1056" "2026-03-27T00:55:30Z"
#> [2,] "77" "-1" "cpp" "1056" "2026-03-27T00:55:30Z"
#> [3,] "77" "-1" "cpp" "1056" "2026-03-27T00:55:30Z"
#> file_md5 load_mode index_file
#> [1,] "c5f081efa03365a2af6692ec9950b0f9" "lazy" "ref.ann"
#> [2,] "c5f081efa03365a2af6692ec9950b0f9" "lazy" "ref.ann"
#> [3,] "c5f081efa03365a2af6692ec9950b0f9" "lazy" "ref.ann"The most important fields operationally are:
-
metric,n_dim, andn_ref, which describe what the index represents -
file_size,file_mtime, andfile_md5, which summarize the current Annoy file -
index_file, which records the expected basename of the.annfile -
index_id, which gives the persisted artifact a stable identifier
Validate Before You Use a Persisted Index
The safest default is to validate a reopened or long-lived index before using it for important downstream work.
validation <- annoy_validate_index(
index,
strict = TRUE,
load = TRUE
)
validation$valid
#> [1] TRUE
validation$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 TRUE error
#> 7 file_mtime TRUE warning
#> 8 load TRUE errorWith strict = TRUE, any failed error-severity check
stops immediately. With load = TRUE, validation also
confirms that the index can actually be opened successfully.
What Counts as an Error Versus a Warning
Not every check has the same severity:
- checksum and file-size mismatches are treated as errors
- metric, dimension, and item-count mismatches are treated as errors
- file modification time is currently treated as a warning
That distinction is visible in the validation report.
Reopen the Index as a Separate Session Object
In a later R session, you would normally reattach the persisted index
with annoy_open_index() or
annoy_load_bigmatrix().
reopened <- annoy_open_index(
path = index$path,
load_mode = "lazy"
)
annoy_is_loaded(reopened)
#> [1] FALSE
annoy_validate_index(reopened, strict = TRUE, load = TRUE)$valid
#> [1] TRUE
annoy_is_loaded(reopened)
#> [1] TRUEThis gives you a clean session-level controller around the same persisted files. The reopened object can now be searched, validated again, or explicitly closed.
Sharing Checklist
When sharing an index with another user, machine, or later analysis step, keep the following artifacts together:
- the
.annfile - the
.metasidecar file - any
bigmemorydescriptor files needed to reconstruct the reference or query workflow around the index
In practice, it is best to think of the .ann and
.meta files as one unit.
Simulate Sharing by Copying the Persisted Files
To mimic transferring an index to another location, we will copy both files into a separate directory and reopen the copy.
shared_dir <- tempfile("bigannoy-shared-copy-")
dir.create(shared_dir, recursive = TRUE, showWarnings = FALSE)
shared_index_path <- file.path(shared_dir, basename(index$path))
shared_metadata_path <- file.path(shared_dir, basename(index$metadata_path))
file.copy(index$path, shared_index_path, overwrite = TRUE)
#> [1] TRUE
file.copy(index$metadata_path, shared_metadata_path, overwrite = TRUE)
#> [1] TRUE
shared <- annoy_open_index(
path = shared_index_path,
load_mode = "lazy"
)
shared_report <- annoy_validate_index(
shared,
strict = TRUE,
load = TRUE
)
shared_report$valid
#> [1] TRUEThis is the basic “ship the index and reopen it elsewhere” workflow.
Non-Strict Validation for Diagnostics
Sometimes you do not want an immediate error. You want a report first so you can inspect what failed and decide whether to stop, rebuild, or repair the metadata.
To demonstrate that path, we will deliberately corrupt the copied metadata by replacing the recorded checksum with a wrong value.
bad_metadata <- read.dcf(shared_metadata_path)
bad_metadata[1L, "file_md5"] <- "corrupted"
write.dcf(as.data.frame(bad_metadata, stringsAsFactors = FALSE), file = shared_metadata_path)
shared_bad <- annoy_open_index(shared_index_path, load_mode = "lazy")
bad_report <- annoy_validate_index(
shared_bad,
strict = FALSE,
load = FALSE
)
bad_report$valid
#> [1] FALSE
bad_report$checks[, c("check", "passed", "severity")]
#> check passed severity
#> 1 index_file TRUE error
#> 2 metric TRUE error
#> 3 dimensions TRUE error
#> 4 items TRUE error
#> 5 file_size TRUE error
#> 6 file_md5 FALSE error
#> 7 file_mtime TRUE warningThis pattern is especially helpful in higher-level tools that want to show a validation report instead of terminating immediately.
Strict Validation as a Gate
For production-style workflows, strict = TRUE is usually
the better default because it turns a failed validation into an
immediate hard stop.
strict_error <- tryCatch(
{
annoy_validate_index(shared_bad, strict = TRUE, load = FALSE)
NULL
},
error = function(e) conditionMessage(e)
)
strict_error
#> [1] "Recorded file checksum matches the current Annoy file checksum."The exact message may vary depending on which error-severity check fails first, but the key point is that the corrupted metadata is no longer silently accepted.
A Common Sharing Pitfall: Renaming Only the .ann File
The metadata records the expected basename of the Annoy file in
index_file. That means you should generally keep the
.ann file and the .meta file paired and
consistent.
If you rename the .ann file without updating or
regenerating the metadata, annoy_open_index() will reject
the mismatch.
renamed_path <- file.path(shared_dir, "renamed.ann")
file.copy(shared_index_path, renamed_path, overwrite = TRUE)
#> [1] TRUE
rename_error <- tryCatch(
{
annoy_open_index(renamed_path, metadata_path = shared_metadata_path)
NULL
},
error = function(e) conditionMessage(e)
)
rename_error
#> [1] "Annoy metadata does not match the supplied index file path"That guard is useful because it prevents accidentally pairing the wrong Annoy file with the wrong metadata file.
Recommended Sharing Pattern
For practical collaboration, a good pattern is:
- build the index with
annoy_build_bigmatrix() - keep the generated
.annfile and.metafile together - move or copy them as a pair
- reopen with
annoy_open_index()orannoy_load_bigmatrix() - run
annoy_validate_index()before important analysis - only trust the index for downstream search once validation passes
If your larger workflow depends on file-backed bigmemory
data, keep the descriptor files alongside the matrices they describe as
well.
Recap
bigANNOY v3 makes persisted indexes safer to reuse and
share by giving them:
- a sidecar metadata file
- a stable index identifier
- recorded file signatures and build settings
- explicit validation with strict and non-strict modes
The practical takeaway is simple: treat the .ann file
and the .meta file as a pair, reopen them intentionally,
and validate before you trust them.