bigANNOY and bigKNN are meant to complement
each other, not compete for the same role.
-
bigKNNgives you exact Euclidean neighbours -
bigANNOYgives you fast approximate neighbours through persisted Annoy indexes
That makes them a natural pair:
- use
bigKNNwhen exactness is the requirement - use
bigANNOYwhen scale and latency matter more than perfect exactness - use them together when you want a ground-truth baseline for evaluating an approximate workflow
This vignette explains how to think about that split and how to compare the two packages in practice.
The Core Difference
At a high level, the packages answer slightly different questions.
bigKNN asks:
- what are the exact Euclidean nearest neighbours of each query row?
bigANNOY asks:
- what are very likely nearest neighbours of each query row, found through an approximate Annoy index?
That distinction has consequences:
- exact search is the correctness baseline
- approximate search is the operational speed/scale path
- exact search is usually the right benchmark target for Euclidean workloads
When To Use Which Package
Use bigKNN when:
- exact Euclidean neighbours are required
- the result itself is a scientific or statistical reference quantity
- you need a benchmark ground truth for recall measurement
- approximation is not acceptable for the downstream task
Use bigANNOY when:
- query latency matters more than exactness
- reference data is large enough that approximate search is operationally attractive
- you want a persisted Annoy index that can be reopened and reused
- a small loss in recall is acceptable in exchange for speed
In other words:
-
bigKNNis the answer when the question is “what is exactly correct?” -
bigANNOYis the answer when the question is “what is fast enough while still good enough?”
Shared Result Shape
One of the most useful design choices in bigANNOY is
that its result object is intentionally aligned with
bigKNN.
The returned components are conceptually parallel:
indexdistancekmetricn_refn_queryexactbackend
For bigANNOY, exact = FALSE and
backend = "annoy".
That shared shape matters because it makes these workflows much simpler:
- row-by-row comparison of neighbour ids
- inspection of distance matrices under the same indexing conventions
- recall-at-
kcomparisons against an exact Euclidean baseline - swapping exact and approximate results into the same downstream code more easily
Load the Packages You Need
This vignette always uses bigANNOY. The
bigKNN parts are optional and only run when
bigKNN is installed.
A Small Comparison Dataset
We will create a small reference matrix and a separate query matrix. This is large enough to show the workflow clearly without making the vignette slow.
compare_dir <- tempfile("bigannoy-vs-bigknn-")
dir.create(compare_dir, recursive = TRUE, showWarnings = FALSE)
ref_dense <- matrix(rnorm(120 * 6), nrow = 120, ncol = 6)
query_dense <- matrix(rnorm(15 * 6), nrow = 15, ncol = 6)
ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
#> [1] 120 6
dim(query_dense)
#> [1] 15 6Approximate Search with bigANNOY
bigANNOY first builds an Annoy index and then searches
that persisted index.
annoy_index <- annoy_build_bigmatrix(
ref_big,
path = file.path(compare_dir, "ref.ann"),
metric = "euclidean",
n_trees = 20L,
seed = 123L,
load_mode = "eager"
)
approx_result <- annoy_search_bigmatrix(
annoy_index,
query = query_dense,
k = 5L,
search_k = 100L
)
names(approx_result)
#> [1] "index" "distance" "k" "metric" "n_ref" "n_query" "exact"
#> [8] "backend"
approx_result$exact
#> [1] FALSE
approx_result$backend
#> [1] "annoy"
approx_result$index[1:3, ]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 107 58 97 102 95
#> [2,] 47 21 43 92 33
#> [3,] 85 111 62 89 8
round(approx_result$distance[1:3, ], 3)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.973 2.083 2.116 2.122 2.139
#> [2,] 1.736 2.032 2.088 2.394 2.497
#> [3,] 0.945 1.120 1.267 1.292 1.315This is the standard approximate Euclidean workflow in
bigANNOY.
Exact Search with bigKNN When Available
If bigKNN is installed, the exact Euclidean comparison
is straightforward because the result structure is deliberately
similar.
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))
exact_result <- knn_bigmatrix(
ref_big,
query = query_dense,
k = 5L,
metric = "euclidean",
block_size = 64L,
exclude_self = FALSE
)
list(
names = names(exact_result),
exact = exact_result$exact,
backend = exact_result$backend,
index_head = exact_result$index[1:3, ],
distance_head = round(exact_result$distance[1:3, ], 3)
)
} else {
"bigKNN is not installed in this session, so the exact comparison example is skipped."
}
#> $names
#> [1] "index" "distance" "k" "metric" "n_ref" "n_query" "exact"
#> [8] "backend"
#>
#> $exact
#> [1] TRUE
#>
#> $backend
#> [1] "bruteforce"
#>
#> $index_head
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 107 58 97 102 95
#> [2,] 47 21 43 92 33
#> [3,] 85 111 62 73 89
#>
#> $distance_head
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.973 2.083 2.116 2.122 2.139
#> [2,] 1.736 2.032 2.088 2.394 2.497
#> [3,] 0.945 1.120 1.267 1.274 1.292The exact result uses the same high-level structure, but now
exact is expected to be TRUE and the backend
identifies the exact search path.
What Does “Aligned Result Shape” Buy You?
The aligned result shape means you can compare exact and approximate
neighbour sets directly when metric = "euclidean" and both
were run with the same k.
When bigKNN is available, a simple overlap-style recall
comparison looks like this:
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))
exact_result <- knn_bigmatrix(
ref_big,
query = query_dense,
k = 5L,
metric = "euclidean",
block_size = 64L,
exclude_self = FALSE
)
recall_at_5 <- mean(vapply(seq_len(nrow(query_dense)), function(i) {
length(intersect(approx_result$index[i, ], exact_result$index[i, ])) / 5
}, numeric(1L)))
recall_at_5
} else {
"Recall example skipped because bigKNN is not installed."
}
#> [1] 0.9866667That is the core evaluation pattern:
-
bigKNNprovides the exact answer -
bigANNOYprovides the approximate answer - the overlap between the two tells you how much quality you are giving up
Why bigANNOY Still Matters When bigKNN Exists
If exact search exists, why use approximate search at all?
Because operationally, the best answer is not always the exact answer.
bigANNOY adds capabilities that solve a different
problem:
- persisted Annoy indexes that can be reopened across sessions
- approximate search that can be much more attractive for latency-sensitive workloads
- control over the build/search trade-off through
n_treesandsearch_k - file-backed and descriptor-oriented workflows around
bigmemory
So the two packages fit a common progression:
- use
bigKNNto establish correctness and a benchmark baseline - use
bigANNOYto explore how much latency you can save - compare recall against the exact baseline
- choose the operating point that is acceptable for the application
Benchmark Integration
The benchmark helpers in bigANNOY already support this
pairing directly for Euclidean workloads. If bigKNN is
available, they can report exact timing and recall automatically.
bench <- benchmark_annoy_bigmatrix(
n_ref = 200L,
n_query = 20L,
n_dim = 6L,
k = 5L,
n_trees = 20L,
search_k = 100L,
metric = "euclidean",
exact = length(find.package("bigKNN", quiet = TRUE)) > 0L,
path_dir = compare_dir,
load_mode = "eager"
)
bench$summary[, c(
"metric",
"n_trees",
"search_k",
"build_elapsed",
"search_elapsed",
"exact_elapsed",
"recall_at_k"
)]
#> metric n_trees search_k build_elapsed search_elapsed exact_elapsed
#> 1 euclidean 20 100 0.009 0.001 0
#> recall_at_k
#> 1 0.99This is usually the easiest way to decide whether an approximate search configuration is worth adopting.
A Practical Decision Framework
Here is a simple way to decide between the two packages for a Euclidean workflow.
Start with bigKNN when:
- you need the exact answer
- you are still defining the benchmark target
- you do not yet know how much approximation your downstream task tolerates
Move toward bigANNOY when:
- exact search is too slow for the intended query workload
- you want a persisted index that can be reopened repeatedly
- you have measured acceptable recall relative to the exact baseline
Keep both in the workflow when:
- you want to monitor approximation quality over time
- you benchmark new
n_treesorsearch_ksettings - you need a trustworthy exact baseline for evaluation or regression tests
Important Boundaries
There are also a few boundaries worth keeping clear:
-
bigKNNis the exact baseline only for Euclidean search -
bigANNOYsupports additional Annoy metrics beyond Euclidean - recall comparisons against
bigKNNonly make sense for Euclidean workloads - an approximate result can be operationally excellent even when it is
not exactly identical to the true top-
k
That last point is easy to forget. The question is not whether approximate search is exact. The question is whether the approximation quality is good enough for the application you care about.
Recap
The best way to think about the pair is:
-
bigKNNgives you exact Euclidean truth -
bigANNOYgives you fast approximate search on top of persisted Annoy indexes - the shared result shape makes comparison practical
- the benchmark helpers let you quantify the trade-off instead of guessing
If you are beginning a new Euclidean workflow, a strong default is to
start with bigKNN as the baseline, then move to
bigANNOY once latency, scale, or persisted-index workflows
become the limiting factor.