Quick Start with bigKNN

bigKNN provides exact nearest-neighbour and radius-search routines that work directly on bigmemory::big.matrix references. This quickstart walks through a small end-to-end example: create a reference matrix, run exact k-nearest neighbour and radius queries, and interpret the objects that come back.

Create a Small Reference Matrix

The reference data for bigKNN lives in a bigmemory::big.matrix. For this quickstart we will use six points in two dimensions and keep separate labels so that the returned row indices are easy to read.

reference_points <- data.frame(
  id = paste0("p", 1:6),
  x1 = c(1, 2, 1, 2, 3, 4),
  x2 = c(1, 1, 2, 2, 2, 3)
)

query_points <- data.frame(
  id = c("q1", "q2"),
  x1 = c(1.2, 2.8),
  x2 = c(1.1, 2.2)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
query_matrix <- as.matrix(query_points[c("x1", "x2")])

reference_points
#>   id x1 x2
#> 1 p1  1  1
#> 2 p2  2  1
#> 3 p3  1  2
#> 4 p4  2  2
#> 5 p5  3  2
#> 6 p6  4  3
query_points
#>   id  x1  x2
#> 1 q1 1.2 1.1
#> 2 q2 2.8 2.2

In the examples below, reference is the big.matrix searched by bigKNN, while query_matrix is an ordinary dense R matrix. The same APIs also accept queries stored in another big.matrix, and the v3 search paths accept sparse query matrices as well.

First Exact `k`-Nearest-Neighbour Search

With query = NULL, knn_bigmatrix() performs a self-search. By default exclude_self = TRUE in that case, so each row does not return itself as its own nearest neighbour.

self_knn <- knn_bigmatrix(reference, k = 2)
self_knn
#> <bigknn_knn_result>
#>   metric: euclidean
#>   k: 2
#>   queries: 6
#>   references: 6
#>   backend: bruteforce

The raw result stores two n_query x k matrices:

index: 1-based row indices into the reference matrix
distance: the corresponding exact distances

self_knn$index
#>      [,1] [,2]
#> [1,]    2    3
#> [2,]    1    4
#> [3,]    1    4
#> [4,]    2    3
#> [5,]    4    2
#> [6,]    5    4
round(self_knn$distance, 3)
#>       [,1]  [,2]
#> [1,] 1.000 1.000
#> [2,] 1.000 1.000
#> [3,] 1.000 1.000
#> [4,] 1.000 1.000
#> [5,] 1.000 1.414
#> [6,] 1.414 2.236

A small helper table makes the same result easier to read:

knn_table(self_knn, query_ids = reference_points$id, ref_ids = reference_points$id)
#>    query rank neighbor distance
#> 1     p1    1       p2    1.000
#> 2     p1    2       p3    1.000
#> 3     p2    1       p1    1.000
#> 4     p2    2       p4    1.000
#> 5     p3    1       p1    1.000
#> 6     p3    2       p4    1.000
#> 7     p4    1       p2    1.000
#> 8     p4    2       p3    1.000
#> 9     p5    1       p4    1.000
#> 10    p5    2       p2    1.414
#> 11    p6    1       p5    1.414
#> 12    p6    2       p4    2.236

Searching New Query Points

To search new observations against the same reference, pass them through the query argument. Here we ask for the three nearest reference rows for two new points.

query_knn <- knn_bigmatrix(
  reference,
  query = query_matrix,
  k = 3,
  exclude_self = FALSE
)

query_knn
#> <bigknn_knn_result>
#>   metric: euclidean
#>   k: 3
#>   queries: 2
#>   references: 6
#>   backend: bruteforce
knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id)
#>   query rank neighbor distance
#> 1    q1    1       p1   0.2236
#> 2    q1    2       p2   0.8062
#> 3    q1    3       p3   0.9220
#> 4    q2    1       p5   0.2828
#> 5    q2    2       p4   0.8246
#> 6    q2    3       p6   1.4420

The returned indices are still row numbers in reference. That makes the object compact and easy to use in later workflows, while a lookup table like reference_points can translate those indices into human-readable labels when needed.

First Radius Search

Radius search returns every neighbour within a fixed distance threshold instead of a fixed k. This is useful when the local neighbourhood size should vary by query.

radius_result <- radius_bigmatrix(
  reference,
  query = query_matrix,
  radius = 1.15,
  exclude_self = FALSE
)

radius_result
#> <bigknn_radius_result>
#>   metric: euclidean
#>   radius: 1.15
#>   queries: 2
#>   references: 6
#>   matches: 5
radius_result$n_match
#> [1] 3 2
radius_result$offset
#> [1] 1 4 6

radius_bigmatrix() uses a flattened output format:

index and distance hold all matches back-to-back
n_match gives the number of matches per query
offset tells you where each query’s slice starts and ends

For query i, the matching rows live in index[offset[i]:(offset[i + 1] - 1)], with the same slice in distance.

radius_slice(radius_result, 1, reference_points$id)
#>   neighbor distance
#> 1       p1   0.2236
#> 2       p2   0.8062
#> 3       p3   0.9220
radius_slice(radius_result, 2, reference_points$id)
#>   neighbor distance
#> 1       p5   0.2828
#> 2       p4   0.8246

If you only need the counts, count_within_radius_bigmatrix() avoids returning the flattened neighbour vectors.

count_within_radius_bigmatrix(
  reference,
  query = query_matrix,
  radius = 1.15,
  exclude_self = FALSE
)
#> [1] 3 2

Choosing a Metric

bigKNN currently supports three exact metrics:

"euclidean" for ordinary Euclidean distance
"sqeuclidean" for squared Euclidean distance
"cosine" for cosine distance, defined as 1 - cosine similarity

The squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location.

metric_summary <- do.call(rbind, lapply(
  c("euclidean", "sqeuclidean", "cosine"),
  function(metric) {
    result <- knn_bigmatrix(
      reference,
      query = query_matrix,
      k = 1,
      metric = metric,
      exclude_self = FALSE
    )

    data.frame(
      metric = metric,
      query = query_points$id,
      nearest = reference_points$id[result$index[, 1]],
      distance = signif(result$distance[, 1], 4),
      row.names = NULL
    )
  }
))

metric_summary
#>        metric query nearest  distance
#> 1   euclidean    q1      p1 0.2236000
#> 2   euclidean    q2      p5 0.2828000
#> 3 sqeuclidean    q1      p1 0.0500000
#> 4 sqeuclidean    q2      p5 0.0800000
#> 5      cosine    q1      p1 0.0009438
#> 6      cosine    q2      p6 0.0002524

In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data.

Where to Go Next

This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are: