Skip to contents

Prepared references let bigKNN cache metric-specific information about a fixed reference matrix and reuse it across later exact searches. They are the right tool when the reference data stays put but queries arrive in batches over time.

This article walks through that pattern end to end:

  • build a file-backed reference matrix
  • prepare it once for cosine distance
  • reuse the prepared object across multiple query batches
  • stream prepared results into destination big.matrix objects
  • persist the prepared cache to disk and reload it later

When prepared references help

Prepared references are most useful when:

  • the reference matrix stays fixed
  • you need to answer many exact query batches
  • you want to persist the cache between sessions

They do not change the search result. The advantage is that repeated searches can reuse cached row-wise quantities instead of recomputing them every time.

Build a File-Backed Reference

For this vignette we will use a file-backed big.matrix, because persisted prepared caches are easiest to demonstrate when the reference can be reattached through files on disk.

scratch_dir <- file.path(tempdir(), "bigknn-prepared-search")
dir.create(scratch_dir, recursive = TRUE, showWarnings = FALSE)

reference_points <- data.frame(
  id = paste0("r", 1:8),
  x1 = c(1, 1, 2, 2, 3, 3, 4, 4),
  x2 = c(1, 2, 1, 2, 2, 3, 3, 4),
  x3 = c(0.5, 0.5, 1.0, 1.0, 1.5, 1.5, 2.0, 2.5)
)

reference <- filebacked.big.matrix(
  nrow = nrow(reference_points),
  ncol = 3,
  type = "double",
  backingfile = "reference.bin",
  descriptorfile = "reference.desc",
  backingpath = scratch_dir
)

reference[,] <- as.matrix(reference_points[c("x1", "x2", "x3")])

query_batch_a <- matrix(
  c(1.1, 1.2, 0.5,
    2.7, 2.2, 1.4),
  ncol = 3,
  byrow = TRUE
)

query_batch_b <- matrix(
  c(3.6, 3.1, 1.9,
    1.5, 1.8, 0.8),
  ncol = 3,
  byrow = TRUE
)

query_ids_a <- c("a1", "a2")
query_ids_b <- c("b1", "b2")

reference_points
#>   id x1 x2  x3
#> 1 r1  1  1 0.5
#> 2 r2  1  2 0.5
#> 3 r3  2  1 1.0
#> 4 r4  2  2 1.0
#> 5 r5  3  2 1.5
#> 6 r6  3  3 1.5
#> 7 r7  4  3 2.0
#> 8 r8  4  4 2.5

All rows are non-zero, which matters because cosine distance requires non-zero reference and query vectors.

Building a prepared reference with knn_prepare_bigmatrix()

prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
prepared
#> <bigknn_prepared>
#>   metric: cosine
#>   block_size: 1024
#>   shape: 8 x 3
#>   validated: TRUE

Internally, a prepared object stores:

  • the external pointer to the reference matrix
  • the chosen metric
  • a metric-specific numeric row_cache
  • cached dimensions and execution metadata

The print method keeps that summary compact:

summary(prepared)
#> $metric
#> [1] "cosine"
#> 
#> $block_size
#> [1] 1024
#> 
#> $n_ref
#> [1] 8
#> 
#> $n_col
#> [1] 3
#> 
#> $validated
#> [1] TRUE
#> 
#> $cache_path
#> NULL
length(prepared$row_cache)
#> [1] 8
head(prepared$row_cache, 4)
#> [1] 1.500000 2.291288 2.449490 3.000000

For cosine distance, row_cache contains row-wise quantities that are reused during later searches. In normal workflows you rarely need to manipulate it directly; it is included here so you can see that a prepared object is more than just a wrapper around the original big.matrix.

Reusing it with knn_search_prepared()

batch_a_result <- knn_search_prepared(
  prepared,
  query = query_batch_a,
  k = 2,
  exclude_self = FALSE
)

batch_b_result <- knn_search_prepared(
  prepared,
  query = query_batch_b,
  k = 2,
  exclude_self = FALSE
)

batch_a_result
#> <bigknn_knn_result>
#>   metric: cosine
#>   k: 2
#>   queries: 2
#>   references: 8
#>   backend: bruteforce
knn_table(batch_a_result, query_ids = query_ids_a, ref_ids = reference_points$id)
#>   query rank neighbor   distance
#> 1    a1    1       r6 0.00172560
#> 2    a1    2       r1 0.00172560
#> 3    a2    1       r7 0.00069773
#> 4    a2    2       r5 0.00399290
knn_table(batch_b_result, query_ids = query_ids_b, ref_ids = reference_points$id)
#>   query rank neighbor  distance
#> 1    b1    1       r7 0.0019579
#> 2    b1    2       r8 0.0029916
#> 3    b2    1       r6 0.0037227
#> 4    b2    2       r1 0.0037227

The result contract is the same as knn_bigmatrix(). The difference is that the reference preparation step has already been done, so you can reuse the same prepared object across many query batches.

To make that explicit, we can compare a prepared search with the one-shot API:

direct_batch_a <- knn_bigmatrix(
  reference,
  query = query_batch_a,
  k = 2,
  metric = "cosine",
  exclude_self = FALSE
)

identical(batch_a_result$index, direct_batch_a$index)
#> [1] TRUE
all.equal(batch_a_result$distance, direct_batch_a$distance)
#> [1] TRUE

Prepared search is therefore an ergonomics and performance feature, not a different search algorithm.

Streaming prepared results with knn_search_stream_prepared()

If you want the prepared search to write directly into destination big.matrix objects, use knn_search_stream_prepared(). This is helpful when the query set is larger or when you want to keep results in shared-memory or file-backed structures instead of dense R matrices.

index_store <- big.matrix(nrow(query_batch_b), 2, type = "integer")
distance_store <- big.matrix(nrow(query_batch_b), 2, type = "double")

streamed_batch_b <- knn_search_stream_prepared(
  prepared,
  query = query_batch_b,
  xpIndex = index_store,
  xpDistance = distance_store,
  k = 2,
  exclude_self = FALSE
)

bigmemory::as.matrix(streamed_batch_b$index)
#>      [,1] [,2]
#> [1,]    7    8
#> [2,]    6    1
round(bigmemory::as.matrix(streamed_batch_b$distance), 6)
#>          [,1]     [,2]
#> [1,] 0.001958 0.002992
#> [2,] 0.003723 0.003723
all.equal(bigmemory::as.matrix(streamed_batch_b$distance), batch_b_result$distance)
#> [1] TRUE

The neighbour indices and distances are the same as the in-memory prepared search; the only difference is where the results land.

Persisting caches with cache_path

Prepared references can be serialized with cache_path, which is useful when a project repeatedly opens the same file-backed reference over many sessions.

cache_path <- file.path(scratch_dir, "prepared-cosine-cache.rds")

prepared_cached <- knn_prepare_bigmatrix(
  reference,
  metric = "cosine",
  cache_path = cache_path
)

prepared_cached
#> <bigknn_prepared>
#>   metric: cosine
#>   block_size: 1024
#>   shape: 8 x 3
#>   validated: TRUE
#>   cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpiUFJSb/bigknn-prepared-search/prepared-cosine-cache.rds
file.exists(cache_path)
#> [1] TRUE

Persisted prepared references are especially helpful for long-running projects and reproducible pipelines.

Reloading with knn_load_prepared()

loaded <- knn_load_prepared(cache_path)
loaded
#> <bigknn_prepared>
#>   metric: cosine
#>   block_size: 1024
#>   shape: 8 x 3
#>   validated: TRUE
#>   cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpiUFJSb/bigknn-prepared-search/prepared-cosine-cache.rds

knn_load_prepared() restores the cached metadata and reattaches the underlying big.matrix through its stored descriptor. That means the prepared cache is tied to the original reference backing files: if those files move or disappear, the cache can no longer be reattached.

Validating with knn_validate_prepared()

Validation is usually worth calling after loading a cache from disk, or any time you want to confirm that the descriptor, cached dimensions, and row cache still match the underlying reference.

isTRUE(knn_validate_prepared(loaded))
#> [1] TRUE

Once the cache has been loaded and validated, it behaves like any other prepared reference:

loaded_batch_b <- knn_search_prepared(
  loaded,
  query = query_batch_b,
  k = 2,
  exclude_self = FALSE
)

identical(loaded_batch_b$index, batch_b_result$index)
#> [1] TRUE
all.equal(loaded_batch_b$distance, batch_b_result$distance)
#> [1] TRUE

Common failure modes and how to avoid them

  • Reusing a cache after the reference data changed: rebuild the prepared object whenever the underlying reference matrix is modified.
  • Missing or moved backing files: persisted caches rely on the stored big.matrix descriptor, so the reference files need to remain accessible.
  • Zero-norm rows with cosine distance: keep validate = TRUE when preparing cosine references so incompatible rows are caught early.
  • Overusing one-shot search: if the reference is fixed and many query batches are coming, switch from repeated knn_bigmatrix() calls to knn_prepare_bigmatrix() plus knn_search_prepared().

Prepared references are a small API feature with a big practical payoff: you do the setup work once, and then exact search against the same reference becomes easier to repeat, easier to stream, and easier to persist.