bigANNOY Versus bigKNN • bigANNOY

bigANNOY and bigKNN are meant to complement each other, not compete for the same role.

bigKNN gives you exact Euclidean neighbours
bigANNOY gives you fast approximate neighbours through persisted Annoy indexes

That makes them a natural pair:

use bigKNN when exactness is the requirement
use bigANNOY when scale and latency matter more than perfect exactness
use them together when you want a ground-truth baseline for evaluating an approximate workflow

This vignette explains how to think about that split and how to compare the two packages in practice.

The Core Difference

At a high level, the packages answer slightly different questions.

bigKNN asks:

what are the exact Euclidean nearest neighbours of each query row?

bigANNOY asks:

what are very likely nearest neighbours of each query row, found through an approximate Annoy index?

That distinction has consequences:

exact search is the correctness baseline
approximate search is the operational speed/scale path
exact search is usually the right benchmark target for Euclidean workloads

When To Use Which Package

Use bigKNN when:

exact Euclidean neighbours are required
the result itself is a scientific or statistical reference quantity
you need a benchmark ground truth for recall measurement
approximation is not acceptable for the downstream task

Use bigANNOY when:

query latency matters more than exactness
reference data is large enough that approximate search is operationally attractive
you want a persisted Annoy index that can be reopened and reused
a small loss in recall is acceptable in exchange for speed

In other words:

bigKNN is the answer when the question is “what is exactly correct?”
bigANNOY is the answer when the question is “what is fast enough while still good enough?”

Shared Result Shape

One of the most useful design choices in bigANNOY is that its result object is intentionally aligned with bigKNN.

The returned components are conceptually parallel:

index
distance
k
metric
n_ref
n_query
exact
backend

For bigANNOY, exact = FALSE and backend = "annoy".

That shared shape matters because it makes these workflows much simpler:

row-by-row comparison of neighbour ids
inspection of distance matrices under the same indexing conventions
recall-at-k comparisons against an exact Euclidean baseline
swapping exact and approximate results into the same downstream code more easily

Load the Packages You Need

This vignette always uses bigANNOY. The bigKNN parts are optional and only run when bigKNN is installed.

library(bigANNOY)
library(bigmemory)

A Small Comparison Dataset

We will create a small reference matrix and a separate query matrix. This is large enough to show the workflow clearly without making the vignette slow.

compare_dir <- tempfile("bigannoy-vs-bigknn-")
dir.create(compare_dir, recursive = TRUE, showWarnings = FALSE)

ref_dense <- matrix(rnorm(120 * 6), nrow = 120, ncol = 6)
query_dense <- matrix(rnorm(15 * 6), nrow = 15, ncol = 6)

ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
#> [1] 120   6
dim(query_dense)
#> [1] 15  6

Approximate Search with bigANNOY

bigANNOY first builds an Annoy index and then searches that persisted index.

annoy_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(compare_dir, "ref.ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

approx_result <- annoy_search_bigmatrix(
  annoy_index,
  query = query_dense,
  k = 5L,
  search_k = 100L
)

names(approx_result)
#> [1] "index"    "distance" "k"        "metric"   "n_ref"    "n_query"  "exact"   
#> [8] "backend"
approx_result$exact
#> [1] FALSE
approx_result$backend
#> [1] "annoy"
approx_result$index[1:3, ]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  107   58   97  102   95
#> [2,]   47   21   43   92   33
#> [3,]   85  111   62   89    8
round(approx_result$distance[1:3, ], 3)
#>       [,1]  [,2]  [,3]  [,4]  [,5]
#> [1,] 1.973 2.083 2.116 2.122 2.139
#> [2,] 1.736 2.032 2.088 2.394 2.497
#> [3,] 0.945 1.120 1.267 1.292 1.315

This is the standard approximate Euclidean workflow in bigANNOY.

Exact Search with bigKNN When Available

If bigKNN is installed, the exact Euclidean comparison is straightforward because the result structure is deliberately similar.

if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  list(
    names = names(exact_result),
    exact = exact_result$exact,
    backend = exact_result$backend,
    index_head = exact_result$index[1:3, ],
    distance_head = round(exact_result$distance[1:3, ], 3)
  )
} else {
  "bigKNN is not installed in this session, so the exact comparison example is skipped."
}
#> $names
#> [1] "index"    "distance" "k"        "metric"   "n_ref"    "n_query"  "exact"   
#> [8] "backend" 
#> 
#> $exact
#> [1] TRUE
#> 
#> $backend
#> [1] "bruteforce"
#> 
#> $index_head
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  107   58   97  102   95
#> [2,]   47   21   43   92   33
#> [3,]   85  111   62   73   89
#> 
#> $distance_head
#>       [,1]  [,2]  [,3]  [,4]  [,5]
#> [1,] 1.973 2.083 2.116 2.122 2.139
#> [2,] 1.736 2.032 2.088 2.394 2.497
#> [3,] 0.945 1.120 1.267 1.274 1.292

The exact result uses the same high-level structure, but now exact is expected to be TRUE and the backend identifies the exact search path.

What Does “Aligned Result Shape” Buy You?

The aligned result shape means you can compare exact and approximate neighbour sets directly when metric = "euclidean" and both were run with the same k.

When bigKNN is available, a simple overlap-style recall comparison looks like this:

if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  recall_at_5 <- mean(vapply(seq_len(nrow(query_dense)), function(i) {
    length(intersect(approx_result$index[i, ], exact_result$index[i, ])) / 5
  }, numeric(1L)))

  recall_at_5
} else {
  "Recall example skipped because bigKNN is not installed."
}
#> [1] 0.9866667

That is the core evaluation pattern:

bigKNN provides the exact answer
bigANNOY provides the approximate answer
the overlap between the two tells you how much quality you are giving up

Why bigANNOY Still Matters When bigKNN Exists

If exact search exists, why use approximate search at all?

Because operationally, the best answer is not always the exact answer.

bigANNOY adds capabilities that solve a different problem:

persisted Annoy indexes that can be reopened across sessions
approximate search that can be much more attractive for latency-sensitive workloads
control over the build/search trade-off through n_trees and search_k
file-backed and descriptor-oriented workflows around bigmemory

So the two packages fit a common progression:

use bigKNN to establish correctness and a benchmark baseline
use bigANNOY to explore how much latency you can save
compare recall against the exact baseline
choose the operating point that is acceptable for the application

Benchmark Integration

The benchmark helpers in bigANNOY already support this pairing directly for Euclidean workloads. If bigKNN is available, they can report exact timing and recall automatically.

bench <- benchmark_annoy_bigmatrix(
  n_ref = 200L,
  n_query = 20L,
  n_dim = 6L,
  k = 5L,
  n_trees = 20L,
  search_k = 100L,
  metric = "euclidean",
  exact = length(find.package("bigKNN", quiet = TRUE)) > 0L,
  path_dir = compare_dir,
  load_mode = "eager"
)

bench$summary[, c(
  "metric",
  "n_trees",
  "search_k",
  "build_elapsed",
  "search_elapsed",
  "exact_elapsed",
  "recall_at_k"
)]
#>      metric n_trees search_k build_elapsed search_elapsed exact_elapsed
#> 1 euclidean      20      100         0.009          0.001             0
#>   recall_at_k
#> 1        0.99

This is usually the easiest way to decide whether an approximate search configuration is worth adopting.

A Practical Decision Framework

Here is a simple way to decide between the two packages for a Euclidean workflow.

Start with bigKNN when:

you need the exact answer
you are still defining the benchmark target
you do not yet know how much approximation your downstream task tolerates

Move toward bigANNOY when:

exact search is too slow for the intended query workload
you want a persisted index that can be reopened repeatedly
you have measured acceptable recall relative to the exact baseline

Keep both in the workflow when:

you want to monitor approximation quality over time
you benchmark new n_trees or search_k settings
you need a trustworthy exact baseline for evaluation or regression tests

Important Boundaries

There are also a few boundaries worth keeping clear:

bigKNN is the exact baseline only for Euclidean search
bigANNOY supports additional Annoy metrics beyond Euclidean
recall comparisons against bigKNN only make sense for Euclidean workloads
an approximate result can be operationally excellent even when it is not exactly identical to the true top-k

That last point is easy to forget. The question is not whether approximate search is exact. The question is whether the approximation quality is good enough for the application you care about.

Recap

The best way to think about the pair is:

bigKNN gives you exact Euclidean truth
bigANNOY gives you fast approximate search on top of persisted Annoy indexes
the shared result shape makes comparison practical
the benchmark helpers let you quantify the trade-off instead of guessing

If you are beginning a new Euclidean workflow, a strong default is to start with bigKNN as the baseline, then move to bigANNOY once latency, scale, or persisted-index workflows become the limiting factor.