Using bigKNN as Exact Ground Truth

bigKNN is a natural exact backend for evaluating approximate nearest-neighbour systems. It gives you two especially useful tools:

exact top-k neighbours that act as trusted ground truth
exact reranking of approximate candidate sets against the original big.matrix reference

This vignette stays package-independent on purpose. Instead of depending on a specific ANN backend, it uses a small exact reference set plus a hand-built candidate matrix that behaves like approximate output.

Why exact ground truth matters

Approximate search is usually evaluated against an exact baseline. Without that baseline, it is hard to answer basic questions:

Are the returned neighbours actually correct?
Is the approximate system missing true near neighbours, or just ranking them oddly?
Does widening the ANN candidate pool improve top-k quality enough to justify the extra cost?

Exact search is also useful as a debugging tool. If an ANN system suddenly behaves differently after a tuning change, exact neighbours give you a stable reference point for comparing what changed.

Build a Small Exact Evaluation Set

We will use eight reference rows and three queries. The example is small so the exact neighbours and the approximate candidate sets remain easy to inspect.

reference_points <- data.frame(
  id = paste0("r", 1:8),
  x1 = c(0.0, 0.2, 1.0, 1.2, 3.0, 3.2, 4.0, 4.2),
  x2 = c(0.0, 0.1, 0.9, 1.0, 3.0, 3.1, 4.0, 4.1),
  x3 = c(0.5, 0.4, 1.2, 1.1, 2.8, 2.9, 3.8, 3.9)
)

query_points <- data.frame(
  id = paste0("q", 1:3),
  x1 = c(0.1, 1.1, 3.1),
  x2 = c(0.0, 1.0, 3.0),
  x3 = c(0.45, 1.15, 2.85)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2", "x3")]))
query_matrix <- as.matrix(query_points[c("x1", "x2", "x3")])

reference_points
#>   id  x1  x2  x3
#> 1 r1 0.0 0.0 0.5
#> 2 r2 0.2 0.1 0.4
#> 3 r3 1.0 0.9 1.2
#> 4 r4 1.2 1.0 1.1
#> 5 r5 3.0 3.0 2.8
#> 6 r6 3.2 3.1 2.9
#> 7 r7 4.0 4.0 3.8
#> 8 r8 4.2 4.1 3.9
query_points
#>   id  x1 x2   x3
#> 1 q1 0.1  0 0.45
#> 2 q2 1.1  1 1.15
#> 3 q3 3.1  3 2.85

Producing exact neighbours with `bigKNN`

The exact reference result is just the exact k-NN output for the same reference, query set, metric, and k that you want to evaluate.

exact <- knn_bigmatrix(
  reference,
  query = query_matrix,
  k = 3,
  metric = "euclidean",
  exclude_self = FALSE
)

exact
#> <bigknn_knn_result>
#>   metric: euclidean
#>   k: 3
#>   queries: 3
#>   references: 8
#>   backend: bruteforce
neighbor_table(
  exact$index,
  query_ids = query_points$id,
  ref_ids = reference_points$id,
  distance = exact$distance
)
#>   query rank neighbor distance
#> 1    q1    1       r1   0.1118
#> 2    q1    2       r2   0.1500
#> 3    q1    3       r3   1.4773
#> 4    q2    1       r4   0.1118
#> 5    q2    2       r3   0.1500
#> 6    q2    3       r2   1.4773
#> 7    q3    1       r5   0.1118
#> 8    q3    2       r6   0.1500
#> 9    q3    3       r7   1.6470

This object is the ground truth for the rest of the vignette. Every later comparison asks how closely an approximate candidate set matches these exact top three neighbours.

Measuring recall with `recall_against_exact()`

Suppose another package returns the following approximate top-3 neighbour index matrix:

approx_top3 <- rbind(
  c(8, 3, 1),
  c(7, 4, 2),
  c(6, 2, 5)
)

neighbor_table(
  approx_top3,
  query_ids = query_points$id,
  ref_ids = reference_points$id
)
#>   query rank neighbor
#> 1    q1    1       r8
#> 2    q1    2       r3
#> 3    q1    3       r1
#> 4    q2    1       r7
#> 5    q2    2       r4
#> 6    q2    3       r2
#> 7    q3    1       r6
#> 8    q3    2       r2
#> 9    q3    3       r5

Each query still recovers some correct neighbours, but one true top-3 neighbour is missing for every row. We can quantify that with recall_against_exact():

recall_before <- recall_against_exact(exact, approx_top3, k = 3)

recall_before
#> <bigknn_recall>
#>   overall: 0.6667
#>   k: 3
#>   queries: 3
data.frame(
  query = query_points$id,
  recall = recall_before$per_query,
  row.names = NULL
)
#>   query    recall
#> 1    q1 0.6666667
#> 2    q2 0.6666667
#> 3    q3 0.6666667

Recall is set-based rather than rank-based:

it checks whether the exact top-k ids appear in the approximate top-k
it does not care whether those ids are in the same internal order
overall is the mean of the per-query recalls

That makes recall a good first coverage metric, but not a full ranking metric. An ANN system can have excellent recall while still misordering the neighbours within the recovered set.

Reranking candidate sets with `rerank_candidates_bigmatrix()`

Reranking is useful when an approximate system returns a wider candidate pool than the final k you care about. The idea is:

let the ANN stage generate a shortlist
compute exact distances only within that shortlist
sort the shortlist exactly and keep the best top_k

Here is a 5-candidate pool that contains the true exact top-3 ids for every query, but not always in the first three slots:

candidate_pool <- rbind(
  c(8, 3, 1, 2, 6),
  c(7, 4, 2, 3, 1),
  c(6, 2, 5, 7, 1)
)

neighbor_table(
  candidate_pool,
  query_ids = query_points$id,
  ref_ids = reference_points$id
)
#>    query rank neighbor
#> 1     q1    1       r8
#> 2     q1    2       r3
#> 3     q1    3       r1
#> 4     q1    4       r2
#> 5     q1    5       r6
#> 6     q2    1       r7
#> 7     q2    2       r4
#> 8     q2    3       r2
#> 9     q2    4       r3
#> 10    q2    5       r1
#> 11    q3    1       r6
#> 12    q3    2       r2
#> 13    q3    3       r5
#> 14    q3    4       r7
#> 15    q3    5       r1

Now rerank those candidates exactly against the reference matrix:

reranked <- rerank_candidates_bigmatrix(
  reference,
  query = query_matrix,
  candidate_index = candidate_pool,
  metric = "euclidean",
  top_k = 3
)

reranked
#> <bigknn_knn_result>
#>   metric: euclidean
#>   k: 3
#>   queries: 3
#>   references: 8
#>   backend: candidate-rerank
neighbor_table(
  reranked$index,
  query_ids = query_points$id,
  ref_ids = reference_points$id,
  distance = reranked$distance
)
#>   query rank neighbor distance
#> 1    q1    1       r1   0.1118
#> 2    q1    2       r2   0.1500
#> 3    q1    3       r3   1.4773
#> 4    q2    1       r4   0.1118
#> 5    q2    2       r3   0.1500
#> 6    q2    3       r2   1.4773
#> 7    q3    1       r5   0.1118
#> 8    q3    2       r6   0.1500
#> 9    q3    3       r7   1.6470

The reranked result now matches the exact top-3 output, because the true neighbours were present somewhere in the candidate pool.

Comparing approximate candidates against exact results

The usual evaluation pattern is:

compute exact neighbours once on an evaluation subset
compare ANN output to exact truth with recall_against_exact()
rerank a wider candidate pool exactly
compare recall again after reranking

recall_after <- recall_against_exact(exact, reranked, k = 3)

data.frame(
  stage = c("Approximate top-3", "Reranked top-3 from 5 candidates"),
  overall_recall = c(recall_before$overall, recall_after$overall),
  row.names = NULL
)
#>                              stage overall_recall
#> 1                Approximate top-3      0.6666667
#> 2 Reranked top-3 from 5 candidates      1.0000000

In this example, the approximate top-3 recall is 2/3, while reranking the 5-wide candidate pool recovers perfect top-3 recall.

Suggested workflow alongside a separate ANN package

A practical package-independent workflow looks like this:

Choose an evaluation subset of queries and references.
Run knn_bigmatrix() once to compute exact top-k neighbours on that subset.
Run your ANN package with the same metric definition and collect its candidate indices as a matrix or big.matrix.
Measure ANN recall with recall_against_exact().
If the ANN backend produces a wider candidate pool than final k, rerank it exactly with rerank_candidates_bigmatrix().
Compare recall before and after reranking, and decide whether the candidate width is worth the cost.

This keeps responsibilities clean:

the ANN package handles fast candidate generation
bigKNN supplies exact truth and exact candidate reranking

Interpreting recall and reranking results

There are two distinct questions to keep apart.

First: does the approximate stage cover the correct neighbours at all?

low recall means true neighbours are missing from the approximate top-k
widening the ANN candidate pool is often the first fix

Second: if the correct ids are present somewhere in the candidate pool, are they ranked well enough for the final top-k?

reranking can repair ordering within the supplied candidate set
reranking can also improve top-k coverage when true neighbours are present just outside the ANN top-k

The wider the candidate pool, the better reranking can work, but the more exact distance calculations it has to do.

Limits and caveats

Reranking improves ordering only within the candidates you provide. It cannot invent missing neighbours. If the candidate pool is already too narrow, recall may stay capped even after exact reranking:

reranked_limited <- rerank_candidates_bigmatrix(
  reference,
  query = query_matrix,
  candidate_index = approx_top3,
  metric = "euclidean",
  top_k = 3
)

recall_after_limited <- recall_against_exact(exact, reranked_limited, k = 3)

data.frame(
  stage = c("Approximate top-3", "Reranked top-3 from same 3 candidates"),
  overall_recall = c(recall_before$overall, recall_after_limited$overall),
  row.names = NULL
)
#>                                   stage overall_recall
#> 1                     Approximate top-3      0.6666667
#> 2 Reranked top-3 from same 3 candidates      0.6666667

Other important caveats:

exact search still has a real computational cost, so evaluation often uses a representative subset rather than the entire production workload
exact and approximate systems must use the same metric definition if recall is to mean anything
recall_against_exact() measures set overlap, not rank quality
sparse query inputs are supported in bigKNN, but currently densified on the R side before exact computation

Used this way, bigKNN becomes a clean exact companion to approximate search: it tells you what the right neighbours are, how often the ANN system finds them, and whether a wider candidate pool can be turned into a better final result by exact reranking.