bigKNN provides exact nearest-neighbour and
radius-search routines that work directly on
bigmemory::big.matrix references. This quickstart walks
through a small end-to-end example: create a reference matrix, run exact
k-nearest neighbour and radius queries, and interpret the
objects that come back.
Create a Small Reference Matrix
The reference data for bigKNN lives in a
bigmemory::big.matrix. For this quickstart we will use six
points in two dimensions and keep separate labels so that the returned
row indices are easy to read.
reference_points <- data.frame(
id = paste0("p", 1:6),
x1 = c(1, 2, 1, 2, 3, 4),
x2 = c(1, 1, 2, 2, 2, 3)
)
query_points <- data.frame(
id = c("q1", "q2"),
x1 = c(1.2, 2.8),
x2 = c(1.1, 2.2)
)
reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
query_matrix <- as.matrix(query_points[c("x1", "x2")])
reference_points
#> id x1 x2
#> 1 p1 1 1
#> 2 p2 2 1
#> 3 p3 1 2
#> 4 p4 2 2
#> 5 p5 3 2
#> 6 p6 4 3
query_points
#> id x1 x2
#> 1 q1 1.2 1.1
#> 2 q2 2.8 2.2In the examples below, reference is the
big.matrix searched by bigKNN, while
query_matrix is an ordinary dense R matrix. The same APIs
also accept queries stored in another big.matrix, and the
v3 search paths accept sparse query matrices as well.
First Exact k-Nearest-Neighbour Search
With query = NULL, knn_bigmatrix() performs
a self-search. By default exclude_self = TRUE in that case,
so each row does not return itself as its own nearest neighbour.
self_knn <- knn_bigmatrix(reference, k = 2)
self_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 2
#> queries: 6
#> references: 6
#> backend: bruteforceThe raw result stores two n_query x k matrices:
-
index: 1-based row indices into the reference matrix -
distance: the corresponding exact distances
self_knn$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 4
#> [3,] 1 4
#> [4,] 2 3
#> [5,] 4 2
#> [6,] 5 4
round(self_knn$distance, 3)
#> [,1] [,2]
#> [1,] 1.000 1.000
#> [2,] 1.000 1.000
#> [3,] 1.000 1.000
#> [4,] 1.000 1.000
#> [5,] 1.000 1.414
#> [6,] 1.414 2.236A small helper table makes the same result easier to read:
knn_table(self_knn, query_ids = reference_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 p1 1 p2 1.000
#> 2 p1 2 p3 1.000
#> 3 p2 1 p1 1.000
#> 4 p2 2 p4 1.000
#> 5 p3 1 p1 1.000
#> 6 p3 2 p4 1.000
#> 7 p4 1 p2 1.000
#> 8 p4 2 p3 1.000
#> 9 p5 1 p4 1.000
#> 10 p5 2 p2 1.414
#> 11 p6 1 p5 1.414
#> 12 p6 2 p4 2.236Searching New Query Points
To search new observations against the same reference, pass them
through the query argument. Here we ask for the three
nearest reference rows for two new points.
query_knn <- knn_bigmatrix(
reference,
query = query_matrix,
k = 3,
exclude_self = FALSE
)
query_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 2
#> references: 6
#> backend: bruteforce
knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 q1 1 p1 0.2236
#> 2 q1 2 p2 0.8062
#> 3 q1 3 p3 0.9220
#> 4 q2 1 p5 0.2828
#> 5 q2 2 p4 0.8246
#> 6 q2 3 p6 1.4420The returned indices are still row numbers in reference.
That makes the object compact and easy to use in later workflows, while
a lookup table like reference_points can translate those
indices into human-readable labels when needed.
First Radius Search
Radius search returns every neighbour within a fixed distance
threshold instead of a fixed k. This is useful when the
local neighbourhood size should vary by query.
radius_result <- radius_bigmatrix(
reference,
query = query_matrix,
radius = 1.15,
exclude_self = FALSE
)
radius_result
#> <bigknn_radius_result>
#> metric: euclidean
#> radius: 1.15
#> queries: 2
#> references: 6
#> matches: 5
radius_result$n_match
#> [1] 3 2
radius_result$offset
#> [1] 1 4 6radius_bigmatrix() uses a flattened output format:
-
indexanddistancehold all matches back-to-back -
n_matchgives the number of matches per query -
offsettells you where each query’s slice starts and ends
For query i, the matching rows live in
index[offset[i]:(offset[i + 1] - 1)], with the same slice
in distance.
radius_slice(radius_result, 1, reference_points$id)
#> neighbor distance
#> 1 p1 0.2236
#> 2 p2 0.8062
#> 3 p3 0.9220
radius_slice(radius_result, 2, reference_points$id)
#> neighbor distance
#> 1 p5 0.2828
#> 2 p4 0.8246If you only need the counts,
count_within_radius_bigmatrix() avoids returning the
flattened neighbour vectors.
count_within_radius_bigmatrix(
reference,
query = query_matrix,
radius = 1.15,
exclude_self = FALSE
)
#> [1] 3 2Choosing a Metric
bigKNN currently supports three exact metrics:
-
"euclidean"for ordinary Euclidean distance -
"sqeuclidean"for squared Euclidean distance -
"cosine"for cosine distance, defined as1 - cosine similarity
The squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location.
metric_summary <- do.call(rbind, lapply(
c("euclidean", "sqeuclidean", "cosine"),
function(metric) {
result <- knn_bigmatrix(
reference,
query = query_matrix,
k = 1,
metric = metric,
exclude_self = FALSE
)
data.frame(
metric = metric,
query = query_points$id,
nearest = reference_points$id[result$index[, 1]],
distance = signif(result$distance[, 1], 4),
row.names = NULL
)
}
))
metric_summary
#> metric query nearest distance
#> 1 euclidean q1 p1 0.2236000
#> 2 euclidean q2 p5 0.2828000
#> 3 sqeuclidean q1 p1 0.0500000
#> 4 sqeuclidean q2 p5 0.0800000
#> 5 cosine q1 p1 0.0009438
#> 6 cosine q2 p6 0.0002524In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data.
Where to Go Next
This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are:
-
knn_prepare_bigmatrix()andknn_search_prepared()for repeated exact queries against the same reference -
knn_plan_bigmatrix(),knn_stream_bigmatrix(), andradius_stream_bigmatrix()for memory-aware and streamed workflows -
knn_graph_bigmatrix(),mutual_knn_graph_bigmatrix(),snn_graph_bigmatrix(), andradius_graph_bigmatrix()for exact graph construction -
recall_against_exact()andrerank_candidates_bigmatrix()whenbigKNNis being used as the exact ground-truth engine for approximate search