Exact nearest-neighbour search and graph construction for bigmemory matrices
Frédéric Bertrand
The bigKNN package provides exact nearest-neighbour and radius-search routines specialised for bigmemory::big.matrix objects. It keeps the reference data in bigmemory storage, supports repeated-query workflows through prepared references, streams results into destination big.matrix objects when needed, and builds exact neighbour graphs directly from the same reference matrices.
In addition to exact k-NN and radius search, the package includes:
- prepared references for repeated exact search,
- execution plans for memory-aware block sizing and thread policy,
- streamed and resumable exact jobs for large outputs,
- exact kNN, mutual kNN, shared-nearest-neighbour, and radius graph builders,
- sparse-query support through the R API, and
- exact evaluation helpers for approximate-search candidate sets.
These workflows make bigKNN useful both as a standalone exact-search package and as an exact ground-truth backend for approximate methods implemented elsewhere.
Installation
The package is currently easiest to install from GitHub:
# install.packages("remotes")
remotes::install_github("fbertran/bigKNN")If you prefer a local source install, clone the repository and run:
Options
The package defines a small set of runtime options:
| Option | Default value | Description |
|---|---|---|
bigKNN.block_size |
1024L |
Default number of rows processed per query/reference block. |
bigKNN.progress |
FALSE |
Emit simple progress messages during long-running neighbour computations. |
bigKNN.num_threads |
1L |
Default thread request used by knn_plan_bigmatrix() when num_threads is not set explicitly. |
All options can be changed with options(). For example, options(bigKNN.block_size = 2048L) increases the default streaming/search block size.
Examples
The examples below use a small 2D reference so the returned neighbours are easy to inspect.
library(bigmemory)
library(bigKNN)
#> Error in `library()`:
#> ! there is no package called 'bigKNN'
reference <- as.big.matrix(matrix(
c(1, 1,
2, 1,
1, 2,
2, 2,
3, 2,
4, 3),
ncol = 2,
byrow = TRUE
))
query <- matrix(
c(1.2, 1.1,
2.8, 2.2),
ncol = 2,
byrow = TRUE
)Exact k-NN and radius search
knn_result <- knn_bigmatrix(
reference,
query = query,
k = 2,
exclude_self = FALSE
)
#> Error in `knn_bigmatrix()`:
#> ! could not find function "knn_bigmatrix"
knn_result$index
#> Error:
#> ! object 'knn_result' not found
round(knn_result$distance, 3)
#> Error:
#> ! object 'knn_result' not found
radius_result <- radius_bigmatrix(
reference,
query = query,
radius = 1.15,
exclude_self = FALSE
)
#> Error in `radius_bigmatrix()`:
#> ! could not find function "radius_bigmatrix"
radius_result$n_match
#> Error:
#> ! object 'radius_result' not found
radius_result$offset
#> Error:
#> ! object 'radius_result' not foundPrepared references, plans, and streamed output
prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
#> Error in `knn_prepare_bigmatrix()`:
#> ! could not find function "knn_prepare_bigmatrix"
prepared_search <- knn_search_prepared(
prepared,
query = query,
k = 2,
exclude_self = FALSE
)
#> Error in `knn_search_prepared()`:
#> ! could not find function "knn_search_prepared"
prepared_search$index
#> Error:
#> ! object 'prepared_search' not found
round(prepared_search$distance, 4)
#> Error:
#> ! object 'prepared_search' not found
plan <- knn_plan_bigmatrix(
reference,
metric = "euclidean",
memory_budget = "64KB",
num_threads = 2L,
progress = FALSE
)
#> Error in `knn_plan_bigmatrix()`:
#> ! could not find function "knn_plan_bigmatrix"
plan
#> Error:
#> ! object 'plan' not found
index_store <- big.matrix(nrow(query), 2, type = "integer")
distance_store <- big.matrix(nrow(query), 2, type = "double")
knn_stream_bigmatrix(
reference,
query = query,
xpIndex = index_store,
xpDistance = distance_store,
k = 2,
plan = plan,
exclude_self = FALSE
)
#> Error in `knn_stream_bigmatrix()`:
#> ! could not find function "knn_stream_bigmatrix"
bigmemory::as.matrix(index_store)
#> [,1] [,2]
#> [1,] NA NA
#> [2,] NA NA
round(bigmemory::as.matrix(distance_store), 3)
#> [,1] [,2]
#> [1,] NA NA
#> [2,] NA NAExact graph construction
knn_graph_bigmatrix(
reference,
k = 1,
format = "edge_list",
symmetrize = "union",
include_distance = FALSE
)
#> Error in `knn_graph_bigmatrix()`:
#> ! could not find function "knn_graph_bigmatrix"Evaluation helpers for approximate search
bigKNN can also act as the exact baseline for approximate-search pipelines. The usual pattern is:
- compute exact neighbours on an evaluation subset with
knn_bigmatrix(), - compare candidate indices with
recall_against_exact(), and - rerank wider candidate pools exactly with
rerank_candidates_bigmatrix().
exact <- knn_bigmatrix(
reference,
query = query,
k = 2,
exclude_self = FALSE
)
#> Error in `knn_bigmatrix()`:
#> ! could not find function "knn_bigmatrix"
candidate_pool <- rbind(
c(6L, exact$index[1, ]),
c(1L, exact$index[2, ])
)
#> Error:
#> ! object 'exact' not found
recall_against_exact(exact, candidate_pool[, 1:2, drop = FALSE], k = 2)
#> Error in `recall_against_exact()`:
#> ! could not find function "recall_against_exact"
reranked <- rerank_candidates_bigmatrix(
reference,
query = query,
candidate_index = candidate_pool,
top_k = 2
)
#> Error in `rerank_candidates_bigmatrix()`:
#> ! could not find function "rerank_candidates_bigmatrix"
reranked$index
#> Error:
#> ! object 'reranked' not found
recall_against_exact(exact, reranked, k = 2)
#> Error in `recall_against_exact()`:
#> ! could not find function "recall_against_exact"Vignettes
The package now ships with focused vignettes for the main workflows:
bigknn-quickstartbigknn-prepared-searchbigknn-streaming-and-plansbigknn-graphsbigknn-evaluating-approximate-searchbigknn-resumable-jobs
Together they cover the basic exact-search API, repeated-query preparation, streamed outputs, exact graph construction, exact-vs-ANN evaluation, and checkpointed long-running jobs.