Prepared References for Repeated Exact Search
Source:vignettes/bigknn-prepared-search.Rmd
bigknn-prepared-search.RmdPrepared references let bigKNN cache metric-specific
information about a fixed reference matrix and reuse it across later
exact searches. They are the right tool when the reference data stays
put but queries arrive in batches over time.
This article walks through that pattern end to end:
- build a file-backed reference matrix
- prepare it once for cosine distance
- reuse the prepared object across multiple query batches
- stream prepared results into destination
big.matrixobjects - persist the prepared cache to disk and reload it later
When prepared references help
Prepared references are most useful when:
- the reference matrix stays fixed
- you need to answer many exact query batches
- you want to persist the cache between sessions
They do not change the search result. The advantage is that repeated searches can reuse cached row-wise quantities instead of recomputing them every time.
Build a File-Backed Reference
For this vignette we will use a file-backed big.matrix,
because persisted prepared caches are easiest to demonstrate when the
reference can be reattached through files on disk.
scratch_dir <- file.path(tempdir(), "bigknn-prepared-search")
dir.create(scratch_dir, recursive = TRUE, showWarnings = FALSE)
reference_points <- data.frame(
id = paste0("r", 1:8),
x1 = c(1, 1, 2, 2, 3, 3, 4, 4),
x2 = c(1, 2, 1, 2, 2, 3, 3, 4),
x3 = c(0.5, 0.5, 1.0, 1.0, 1.5, 1.5, 2.0, 2.5)
)
reference <- filebacked.big.matrix(
nrow = nrow(reference_points),
ncol = 3,
type = "double",
backingfile = "reference.bin",
descriptorfile = "reference.desc",
backingpath = scratch_dir
)
reference[,] <- as.matrix(reference_points[c("x1", "x2", "x3")])
query_batch_a <- matrix(
c(1.1, 1.2, 0.5,
2.7, 2.2, 1.4),
ncol = 3,
byrow = TRUE
)
query_batch_b <- matrix(
c(3.6, 3.1, 1.9,
1.5, 1.8, 0.8),
ncol = 3,
byrow = TRUE
)
query_ids_a <- c("a1", "a2")
query_ids_b <- c("b1", "b2")
reference_points
#> id x1 x2 x3
#> 1 r1 1 1 0.5
#> 2 r2 1 2 0.5
#> 3 r3 2 1 1.0
#> 4 r4 2 2 1.0
#> 5 r5 3 2 1.5
#> 6 r6 3 3 1.5
#> 7 r7 4 3 2.0
#> 8 r8 4 4 2.5All rows are non-zero, which matters because cosine distance requires non-zero reference and query vectors.
Building a prepared reference with
knn_prepare_bigmatrix()
prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
prepared
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUEInternally, a prepared object stores:
- the external pointer to the reference matrix
- the chosen
metric - a metric-specific numeric
row_cache - cached dimensions and execution metadata
The print method keeps that summary compact:
summary(prepared)
#> $metric
#> [1] "cosine"
#>
#> $block_size
#> [1] 1024
#>
#> $n_ref
#> [1] 8
#>
#> $n_col
#> [1] 3
#>
#> $validated
#> [1] TRUE
#>
#> $cache_path
#> NULL
length(prepared$row_cache)
#> [1] 8
head(prepared$row_cache, 4)
#> [1] 1.500000 2.291288 2.449490 3.000000For cosine distance, row_cache contains row-wise
quantities that are reused during later searches. In normal workflows
you rarely need to manipulate it directly; it is included here so you
can see that a prepared object is more than just a wrapper around the
original big.matrix.
Reusing it with knn_search_prepared()
batch_a_result <- knn_search_prepared(
prepared,
query = query_batch_a,
k = 2,
exclude_self = FALSE
)
batch_b_result <- knn_search_prepared(
prepared,
query = query_batch_b,
k = 2,
exclude_self = FALSE
)
batch_a_result
#> <bigknn_knn_result>
#> metric: cosine
#> k: 2
#> queries: 2
#> references: 8
#> backend: bruteforce
knn_table(batch_a_result, query_ids = query_ids_a, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 a1 1 r6 0.00172560
#> 2 a1 2 r1 0.00172560
#> 3 a2 1 r7 0.00069773
#> 4 a2 2 r5 0.00399290
knn_table(batch_b_result, query_ids = query_ids_b, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 b1 1 r7 0.0019579
#> 2 b1 2 r8 0.0029916
#> 3 b2 1 r6 0.0037227
#> 4 b2 2 r1 0.0037227The result contract is the same as knn_bigmatrix(). The
difference is that the reference preparation step has already been done,
so you can reuse the same prepared object across many query
batches.
To make that explicit, we can compare a prepared search with the one-shot API:
direct_batch_a <- knn_bigmatrix(
reference,
query = query_batch_a,
k = 2,
metric = "cosine",
exclude_self = FALSE
)
identical(batch_a_result$index, direct_batch_a$index)
#> [1] TRUE
all.equal(batch_a_result$distance, direct_batch_a$distance)
#> [1] TRUEPrepared search is therefore an ergonomics and performance feature, not a different search algorithm.
Streaming prepared results with
knn_search_stream_prepared()
If you want the prepared search to write directly into destination
big.matrix objects, use
knn_search_stream_prepared(). This is helpful when the
query set is larger or when you want to keep results in shared-memory or
file-backed structures instead of dense R matrices.
index_store <- big.matrix(nrow(query_batch_b), 2, type = "integer")
distance_store <- big.matrix(nrow(query_batch_b), 2, type = "double")
streamed_batch_b <- knn_search_stream_prepared(
prepared,
query = query_batch_b,
xpIndex = index_store,
xpDistance = distance_store,
k = 2,
exclude_self = FALSE
)
bigmemory::as.matrix(streamed_batch_b$index)
#> [,1] [,2]
#> [1,] 7 8
#> [2,] 6 1
round(bigmemory::as.matrix(streamed_batch_b$distance), 6)
#> [,1] [,2]
#> [1,] 0.001958 0.002992
#> [2,] 0.003723 0.003723
all.equal(bigmemory::as.matrix(streamed_batch_b$distance), batch_b_result$distance)
#> [1] TRUEThe neighbour indices and distances are the same as the in-memory prepared search; the only difference is where the results land.
Persisting caches with cache_path
Prepared references can be serialized with cache_path,
which is useful when a project repeatedly opens the same file-backed
reference over many sessions.
cache_path <- file.path(scratch_dir, "prepared-cosine-cache.rds")
prepared_cached <- knn_prepare_bigmatrix(
reference,
metric = "cosine",
cache_path = cache_path
)
prepared_cached
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUE
#> cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpiUFJSb/bigknn-prepared-search/prepared-cosine-cache.rds
file.exists(cache_path)
#> [1] TRUEPersisted prepared references are especially helpful for long-running projects and reproducible pipelines.
Reloading with knn_load_prepared()
loaded <- knn_load_prepared(cache_path)
loaded
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUE
#> cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpiUFJSb/bigknn-prepared-search/prepared-cosine-cache.rdsknn_load_prepared() restores the cached metadata and
reattaches the underlying big.matrix through its stored
descriptor. That means the prepared cache is tied to the original
reference backing files: if those files move or disappear, the cache can
no longer be reattached.
Validating with knn_validate_prepared()
Validation is usually worth calling after loading a cache from disk, or any time you want to confirm that the descriptor, cached dimensions, and row cache still match the underlying reference.
isTRUE(knn_validate_prepared(loaded))
#> [1] TRUEOnce the cache has been loaded and validated, it behaves like any other prepared reference:
loaded_batch_b <- knn_search_prepared(
loaded,
query = query_batch_b,
k = 2,
exclude_self = FALSE
)
identical(loaded_batch_b$index, batch_b_result$index)
#> [1] TRUE
all.equal(loaded_batch_b$distance, batch_b_result$distance)
#> [1] TRUECommon failure modes and how to avoid them
- Reusing a cache after the reference data changed: rebuild the prepared object whenever the underlying reference matrix is modified.
- Missing or moved backing files: persisted caches rely on the stored
big.matrixdescriptor, so the reference files need to remain accessible. - Zero-norm rows with cosine distance: keep
validate = TRUEwhen preparing cosine references so incompatible rows are caught early. - Overusing one-shot search: if the reference is fixed and many query
batches are coming, switch from repeated
knn_bigmatrix()calls toknn_prepare_bigmatrix()plusknn_search_prepared().
Prepared references are a small API feature with a big practical payoff: you do the setup work once, and then exact search against the same reference becomes easier to repeat, easier to stream, and easier to persist.