File-Backed bigmemory Workflows
Source:vignettes/file-backed-bigmemory-workflows.Rmd
file-backed-bigmemory-workflows.RmdOne of the main goals of bigANNOY is to work comfortably
with bigmemory data that already lives on disk. Instead of
forcing a large reference matrix through dense in-memory copies, the
package can build and query Annoy indexes directly from file-backed
big.matrix objects and their descriptors.
This vignette focuses on the most common disk-oriented workflows:
- building from a file-backed reference matrix
- querying with descriptor objects and descriptor file paths
- streaming neighbour results into file-backed destination matrices
- working with separated-column
big.matrixquery layouts
Create a Small File-Backed Workspace
For reproducibility, we will create all backing files inside a temporary directory. In real work this would usually be a project directory or a shared data location.
workspace_dir <- tempfile("bigannoy-filebacked-")
dir.create(workspace_dir, recursive = TRUE, showWarnings = FALSE)
make_filebacked_matrix <- function(values, type, backingpath, name) {
bm <- filebacked.big.matrix(
nrow = nrow(values),
ncol = ncol(values),
type = type,
backingfile = sprintf("%s.bin", name),
descriptorfile = sprintf("%s.desc", name),
backingpath = backingpath
)
bm[,] <- values
bm
}Build a File-Backed Reference Matrix
We will create a reference dataset and store it in a file-backed
big.matrix. The corresponding descriptor file is what lets
later R sessions reattach to the same on-disk data.
ref_dense <- matrix(
c(
0.0, 0.0,
5.0, 0.0,
0.0, 5.0,
5.0, 5.0,
9.0, 9.0
),
ncol = 2,
byrow = TRUE
)
ref_fb <- make_filebacked_matrix(
values = ref_dense,
type = "double",
backingpath = workspace_dir,
name = "ref"
)
ref_desc <- describe(ref_fb)
ref_desc_path <- file.path(workspace_dir, "ref.desc")
file.exists(ref_desc_path)
#> [1] TRUE
dim(ref_fb)
#> [1] 5 2At this point we have:
- a file-backed data file at
ref.bin - a descriptor file at
ref.desc - a
big.matrixobject currently attached in this R session
Build an Annoy Index from a Descriptor Path
The simplest persisted workflow is to build directly from the
descriptor file path instead of from the live big.matrix
object. That mirrors how later sessions typically work.
index_path <- file.path(workspace_dir, "ref.ann")
index <- annoy_build_bigmatrix(
x = ref_desc_path,
path = index_path,
n_trees = 25L,
metric = "euclidean",
seed = 99L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmpv5dHE7/bigannoy-filebacked-31fc114bd668/ref.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//Rtmpv5dHE7/bigannoy-filebacked-31fc114bd668/ref.ann.meta
#> index_id: annoy-20260327005522-c2eab887babb
#> metric: euclidean
#> trees: 25
#> items: 5
#> dimension: 2
#> build_seed: 99
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 2496
#> file_md5: c2eab887babb44656824ba64e545d6c9
#> prefault: FALSEThis pattern is useful because the build call no longer depends on a particular in-memory object being alive. As long as the descriptor can be reattached, the reference matrix can be used.
Accepted File-Oriented Input Forms
For x, query, xpIndex, and
xpDistance, bigANNOY accepts several
bigmemory-oriented forms:
- a live
big.matrix - an external pointer to a
big.matrix - a
big.matrix.descriptorobject - a descriptor file path
For queries only, a dense numeric matrix is also accepted.
That flexibility matters most in persisted workflows where one part of the pipeline writes descriptors and another part reattaches them later.
Query with a File-Backed big.matrix
Now we will create a file-backed query matrix and search the persisted Annoy index against it.
query_dense <- matrix(
c(
0.2, 0.1,
4.7, 5.1
),
ncol = 2,
byrow = TRUE
)
query_fb <- make_filebacked_matrix(
values = query_dense,
type = "double",
backingpath = workspace_dir,
name = "query"
)
query_result_big <- annoy_search_bigmatrix(
index,
query = query_fb,
k = 2L,
search_k = 100L
)
query_result_big$index
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3
round(query_result_big$distance, 3)
#> [,1] [,2]
#> [1,] 0.224 4.801
#> [2,] 0.316 4.701The query matrix itself is file-backed, but the search call looks the
same as it would for an in-memory big.matrix.
Query with a Descriptor Object and a Descriptor Path
The same persisted query data can be supplied through its descriptor object or through the descriptor file path. This is often the most convenient way to reattach query data across sessions.
query_desc <- describe(query_fb)
query_desc_path <- file.path(workspace_dir, "query.desc")
query_result_desc <- annoy_search_bigmatrix(
index,
query = query_desc,
k = 2L,
search_k = 100L
)
query_result_path <- annoy_search_bigmatrix(
index,
query = query_desc_path,
k = 2L,
search_k = 100L
)
query_result_desc$index
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3
query_result_path$index
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3These should match the result obtained from the live
big.matrix query.
Stream Results into File-Backed Destination Matrices
Large search results can be expensive to keep in ordinary R memory.
To avoid that, bigANNOY can stream neighbour ids and
distances directly into destination big.matrix objects.
For file-backed workflows, this means you can keep both the inputs and the outputs on disk.
index_store <- filebacked.big.matrix(
nrow = nrow(query_dense),
ncol = 2L,
type = "integer",
backingfile = "nn_index.bin",
descriptorfile = "nn_index.desc",
backingpath = workspace_dir
)
distance_store <- filebacked.big.matrix(
nrow = nrow(query_dense),
ncol = 2L,
type = "double",
backingfile = "nn_distance.bin",
descriptorfile = "nn_distance.desc",
backingpath = workspace_dir
)
streamed_result <- annoy_search_bigmatrix(
index,
query = query_desc,
k = 2L,
xpIndex = describe(index_store),
xpDistance = file.path(workspace_dir, "nn_distance.desc")
)
bigmemory::as.matrix(index_store)
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3
round(bigmemory::as.matrix(distance_store), 3)
#> [,1] [,2]
#> [1,] 0.224 4.801
#> [2,] 0.316 4.701The important practical details are:
-
xpIndexmust be integer-compatible -
xpDistancemust be double-compatible - both destination matrices must have shape
n_query x k -
xpDistancecan only be supplied whenxpIndexis also supplied
Reattach the Output Files Later
Because the result matrices are file-backed, they can be reattached
later in the same way as any other bigmemory artifact.
index_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_index.desc"))
distance_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_distance.desc"))
bigmemory::as.matrix(index_store_again)
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3
round(bigmemory::as.matrix(distance_store_again), 3)
#> [,1] [,2]
#> [1,] 0.224 4.801
#> [2,] 0.316 4.701That is useful in longer pipelines where one step performs ANN search and a later step consumes the neighbour graph or distance matrix.
Separated-Column Query Matrices
bigANNOY also supports separated-column
big.matrix layouts. These are not necessarily file-backed,
but they are common in bigmemory workflows and are worth
knowing about because they use a different memory layout from the usual
contiguous matrix case.
query_sep <- big.matrix(
nrow = nrow(query_dense),
ncol = ncol(query_dense),
type = "double",
separated = TRUE
)
query_sep[,] <- query_dense
sep_result <- annoy_search_bigmatrix(
index,
query = describe(query_sep),
k = 2L,
search_k = 100L
)
sep_result$index
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 4 3
round(sep_result$distance, 3)
#> [,1] [,2]
#> [1,] 0.224 4.801
#> [2,] 0.316 4.701For the same query values, the separated-column result should match the ordinary file-backed query result.
Persisted Reference, Persisted Index, Persisted Outputs
Taken together, the main file-backed pattern looks like this:
- store the reference data in a file-backed
big.matrix - keep the descriptor alongside the backing file
- build the Annoy index from the descriptor path
- query using either a live
big.matrix, a descriptor object, or a descriptor path - write neighbour results into file-backed destination matrices when result size matters
This is often the most practical way to use bigANNOY in
large-data settings, because every major artifact in the workflow can be
reopened later.
Practical Tips
- Keep descriptor files with their corresponding backing files.
- Keep the
.annfile with its.metasidecar file. - Use descriptor paths when you want to decouple one R session from another.
- Use streamed outputs when
n_query x kis too large to hold comfortably in ordinary R matrices. - Use the lifecycle helpers from the persistence vignette when you want to reopen and validate the Annoy index itself across sessions.
Recap
This vignette covered the main bigmemory persistence
features in bigANNOY:
- file-backed reference matrices
- descriptor-object and descriptor-path queries
- streamed file-backed outputs
- reattachment of persisted outputs
- separated-column query support
The natural next vignette after this one is Benchmarking Recall and Latency, which shows how to evaluate these workflows against runtime and quality targets.