Skip to contents

Exact nearest-neighbour search and graph construction for bigmemory matrices

Frédéric Bertrand

The bigKNN package provides exact nearest-neighbour and radius-search routines specialised for bigmemory::big.matrix objects. It keeps the reference data in bigmemory storage, supports repeated-query workflows through prepared references, streams results into destination big.matrix objects when needed, and builds exact neighbour graphs directly from the same reference matrices.

In addition to exact k-NN and radius search, the package includes:

  • prepared references for repeated exact search,
  • execution plans for memory-aware block sizing and thread policy,
  • streamed and resumable exact jobs for large outputs,
  • exact kNN, mutual kNN, shared-nearest-neighbour, and radius graph builders,
  • sparse-query support through the R API, and
  • exact evaluation helpers for approximate-search candidate sets.

These workflows make bigKNN useful both as a standalone exact-search package and as an exact ground-truth backend for approximate methods implemented elsewhere.

Installation

The package is currently easiest to install from GitHub:

# install.packages("remotes")
remotes::install_github("fbertran/bigKNN")

If you prefer a local source install, clone the repository and run:

R CMD build bigKNN
R CMD INSTALL bigKNN_0.3.0.tar.gz

Options

The package defines a small set of runtime options:

Option Default value Description
bigKNN.block_size 1024L Default number of rows processed per query/reference block.
bigKNN.progress FALSE Emit simple progress messages during long-running neighbour computations.
bigKNN.num_threads 1L Default thread request used by knn_plan_bigmatrix() when num_threads is not set explicitly.

All options can be changed with options(). For example, options(bigKNN.block_size = 2048L) increases the default streaming/search block size.

Examples

The examples below use a small 2D reference so the returned neighbours are easy to inspect.

library(bigmemory)
library(bigKNN)
#> Error in `library()`:
#> ! there is no package called 'bigKNN'

reference <- as.big.matrix(matrix(
  c(1, 1,
    2, 1,
    1, 2,
    2, 2,
    3, 2,
    4, 3),
  ncol = 2,
  byrow = TRUE
))

query <- matrix(
  c(1.2, 1.1,
    2.8, 2.2),
  ncol = 2,
  byrow = TRUE
)
knn_result <- knn_bigmatrix(
  reference,
  query = query,
  k = 2,
  exclude_self = FALSE
)
#> Error in `knn_bigmatrix()`:
#> ! could not find function "knn_bigmatrix"

knn_result$index
#> Error:
#> ! object 'knn_result' not found
round(knn_result$distance, 3)
#> Error:
#> ! object 'knn_result' not found

radius_result <- radius_bigmatrix(
  reference,
  query = query,
  radius = 1.15,
  exclude_self = FALSE
)
#> Error in `radius_bigmatrix()`:
#> ! could not find function "radius_bigmatrix"

radius_result$n_match
#> Error:
#> ! object 'radius_result' not found
radius_result$offset
#> Error:
#> ! object 'radius_result' not found

Prepared references, plans, and streamed output

prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
#> Error in `knn_prepare_bigmatrix()`:
#> ! could not find function "knn_prepare_bigmatrix"
prepared_search <- knn_search_prepared(
  prepared,
  query = query,
  k = 2,
  exclude_self = FALSE
)
#> Error in `knn_search_prepared()`:
#> ! could not find function "knn_search_prepared"

prepared_search$index
#> Error:
#> ! object 'prepared_search' not found
round(prepared_search$distance, 4)
#> Error:
#> ! object 'prepared_search' not found

plan <- knn_plan_bigmatrix(
  reference,
  metric = "euclidean",
  memory_budget = "64KB",
  num_threads = 2L,
  progress = FALSE
)
#> Error in `knn_plan_bigmatrix()`:
#> ! could not find function "knn_plan_bigmatrix"

plan
#> Error:
#> ! object 'plan' not found

index_store <- big.matrix(nrow(query), 2, type = "integer")
distance_store <- big.matrix(nrow(query), 2, type = "double")

knn_stream_bigmatrix(
  reference,
  query = query,
  xpIndex = index_store,
  xpDistance = distance_store,
  k = 2,
  plan = plan,
  exclude_self = FALSE
)
#> Error in `knn_stream_bigmatrix()`:
#> ! could not find function "knn_stream_bigmatrix"

bigmemory::as.matrix(index_store)
#>      [,1] [,2]
#> [1,]   NA   NA
#> [2,]   NA   NA
round(bigmemory::as.matrix(distance_store), 3)
#>      [,1] [,2]
#> [1,]   NA   NA
#> [2,]   NA   NA

Exact graph construction

knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list",
  symmetrize = "union",
  include_distance = FALSE
)
#> Error in `knn_graph_bigmatrix()`:
#> ! could not find function "knn_graph_bigmatrix"

bigKNN can also act as the exact baseline for approximate-search pipelines. The usual pattern is:

  1. compute exact neighbours on an evaluation subset with knn_bigmatrix(),
  2. compare candidate indices with recall_against_exact(), and
  3. rerank wider candidate pools exactly with rerank_candidates_bigmatrix().
exact <- knn_bigmatrix(
  reference,
  query = query,
  k = 2,
  exclude_self = FALSE
)
#> Error in `knn_bigmatrix()`:
#> ! could not find function "knn_bigmatrix"

candidate_pool <- rbind(
  c(6L, exact$index[1, ]),
  c(1L, exact$index[2, ])
)
#> Error:
#> ! object 'exact' not found

recall_against_exact(exact, candidate_pool[, 1:2, drop = FALSE], k = 2)
#> Error in `recall_against_exact()`:
#> ! could not find function "recall_against_exact"

reranked <- rerank_candidates_bigmatrix(
  reference,
  query = query,
  candidate_index = candidate_pool,
  top_k = 2
)
#> Error in `rerank_candidates_bigmatrix()`:
#> ! could not find function "rerank_candidates_bigmatrix"

reranked$index
#> Error:
#> ! object 'reranked' not found
recall_against_exact(exact, reranked, k = 2)
#> Error in `recall_against_exact()`:
#> ! could not find function "recall_against_exact"

Vignettes

The package now ships with focused vignettes for the main workflows:

  • bigknn-quickstart
  • bigknn-prepared-search
  • bigknn-streaming-and-plans
  • bigknn-graphs
  • bigknn-evaluating-approximate-search
  • bigknn-resumable-jobs

Together they cover the basic exact-search API, repeated-query preparation, streamed outputs, exact graph construction, exact-vs-ANN evaluation, and checkpointed long-running jobs.