bigKNN can build exact neighbour graphs directly from
bigmemory::big.matrix references. These graph builders are
useful when the exact neighbourhood structure is the real output you
care about, not only the raw search result.
Typical uses include:
- clustering and community discovery
- manifold learning and neighborhood-preserving embeddings
- graph-based preprocessing for downstream models
- exact ground truth for evaluating approximate graph builders
This vignette introduces the graph families available in
bigKNN, the symmetrization rules they use, and the output
formats that make them easy to inspect or hand off downstream.
Build a Small Reference Set
We will work with six points arranged as two well-separated local groups. That geometry keeps the graph outputs easy to read while still showing meaningful differences between directed, mutual, shared-nearest-neighbour, and radius graphs.
reference_points <- data.frame(
id = paste0("p", 1:6),
x1 = c(0.0, 0.3, 1.2, 4.0, 4.3, 5.2),
x2 = c(0.0, 0.0, 0.0, 4.0, 4.1, 4.0)
)
reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
reference_points
#> id x1 x2
#> 1 p1 0.0 0.0
#> 2 p2 0.3 0.0
#> 3 p3 1.2 0.0
#> 4 p4 4.0 4.0
#> 5 p5 4.3 4.1
#> 6 p6 5.2 4.0The returned graph objects always refer to rows of the reference
matrix. In other words, vertices 1:6 correspond to
p1:p6.
Directed k-NN graphs with
knn_graph_bigmatrix()
A directed k-nearest-neighbour graph stores one outgoing
edge per retained neighbour for every row. With
include_distance = TRUE, each edge value is the exact
distance between the source row and the neighbour row.
directed_knn <- knn_graph_bigmatrix(
reference,
k = 1,
format = "edge_list",
symmetrize = "none"
)
clean_graph(directed_knn)
#> from to distance
#> 1 1 2 0.3000000
#> 2 2 1 0.3000000
#> 3 3 2 0.9000000
#> 4 4 5 0.3162278
#> 5 5 4 0.3162278
#> 6 6 5 0.9055385Here:
-
fromis the source row -
tois the chosen neighbour row -
distanceis the exact Euclidean distance
This is the most literal graph form: every query row keeps its own directed outgoing edges. If one row points to another but not vice versa, both facts are visible.
Mutual k-NN graphs with
mutual_knn_graph_bigmatrix()
Mutual k-NN graphs keep only reciprocal neighbour
relationships. They are often sparser and more conservative than
directed k-NN graphs, because one-sided edges are
dropped.
mutual_knn <- mutual_knn_graph_bigmatrix(
reference,
k = 1,
format = "edge_list"
)
clean_graph(mutual_knn)
#> from to distance
#> 1 1 2 0.3000000
#> 2 4 5 0.3162278In this example, only the pairs (1, 2) and
(4, 5) survive. Vertices 3 and 6
each point toward their nearest local anchor, but that preference is not
returned in the other direction, so those edges disappear in the mutual
graph.
mutual_knn_graph_bigmatrix() is equivalent to calling
knn_graph_bigmatrix(..., symmetrize = "mutual"):
identical(
clean_graph(mutual_knn),
clean_graph(knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUEShared-nearest-neighbour graphs with
snn_graph_bigmatrix()
Shared-nearest-neighbour graphs connect pairs of rows that have overlapping exact neighbour sets. Instead of storing distances, they store a similarity-like weight derived from neighbourhood overlap.
bigKNN currently supports two weight definitions:
-
weight = "count": the number of shared neighbours -
weight = "jaccard": shared neighbours divided by the union size
snn_count <- snn_graph_bigmatrix(
reference,
k = 2,
weight = "count",
format = "edge_list"
)
snn_jaccard <- snn_graph_bigmatrix(
reference,
k = 2,
weight = "jaccard",
format = "edge_list"
)
merge(
clean_graph(snn_count),
clean_graph(snn_jaccard),
by = c("from", "to"),
suffixes = c("_count", "_jaccard")
)
#> from to weight_count weight_jaccard
#> 1 1 2 1 0.3333333
#> 2 1 3 1 0.3333333
#> 3 2 3 1 0.3333333
#> 4 4 5 1 0.3333333
#> 5 4 6 1 0.3333333
#> 6 5 6 1 0.3333333The edge set is the same for both weight choices in this example, but the weights differ:
-
countemphasizes absolute overlap -
jaccardnormalizes overlap by neighbourhood size
That normalization can be helpful when you want weights that are easier to compare across regions of the graph with different neighbour-set structure.
Radius graphs with radius_graph_bigmatrix()
Radius graphs use a fixed distance threshold instead of a fixed
k. That means the number of neighbours can vary from row to
row.
radius_directed <- radius_graph_bigmatrix(
reference,
radius = 1.1,
format = "edge_list",
symmetrize = "none"
)
radius_union <- radius_graph_bigmatrix(
reference,
radius = 1.1,
format = "edge_list",
symmetrize = "union"
)
clean_graph(radius_directed)
#> from to distance
#> 1 1 2 0.3000000
#> 2 2 1 0.3000000
#> 3 2 3 0.9000000
#> 4 3 2 0.9000000
#> 5 4 5 0.3162278
#> 6 5 4 0.3162278
#> 7 5 6 0.9055385
#> 8 6 5 0.9055385
clean_graph(radius_union)
#> from to distance
#> 1 1 2 0.3000000
#> 2 2 3 0.9000000
#> 3 4 5 0.3162278
#> 4 5 6 0.9055385Compared with fixed-k graphs:
- radius graphs encode a distance scale directly
- dense regions can produce more edges than sparse regions
- isolated rows may have very few or no neighbours at the chosen radius
In this self-search setting with a symmetric metric,
symmetrize = "union" and symmetrize = "mutual"
coincide because any radius edge that appears in one direction also
appears in the other:
identical(
clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "union")),
clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUESymmetrization options and edge semantics
For k-NN and radius graph builders,
symmetrize controls how directed edges are collapsed:
-
"none"keeps directed edges as-is -
"union"keeps a pair if either direction appears -
"mutual"keeps a pair only when both directions appear
The difference is easiest to see on k = 1:
graph_none <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "none")
graph_union <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "union")
graph_mutual <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual")
data.frame(
symmetrize = c("none", "union", "mutual"),
n_edges = c(nrow(clean_graph(graph_none)), nrow(clean_graph(graph_union)), nrow(clean_graph(graph_mutual))),
row.names = NULL
)
#> symmetrize n_edges
#> 1 none 6
#> 2 union 4
#> 3 mutual 2When distances are stored, bigKNN keeps the minimum
directed distance for the collapsed undirected pair. When unit weights
are stored instead (include_distance = FALSE), the retained
value is the maximum weight, which remains 1 for unweighted
graphs.
Converting outputs with as_edge_list(),
as_triplet(), and as_sparse_matrix()
Different downstream tasks prefer different graph formats:
- edge lists are easiest to inspect and export
- triplet form is a compact sparse representation in R lists
- sparse matrices are convenient for linear algebra and adjacency-style operations
triplet_graph <- as_triplet(graph_mutual)
sparse_graph <- as_sparse_matrix(
knn_graph_bigmatrix(
reference,
k = 1,
format = "edge_list",
symmetrize = "union",
include_distance = FALSE
)
)
triplet_graph
#> $i
#> [1] 1 4
#>
#> $j
#> [1] 2 5
#>
#> $x
#> [1] 0.3000000 0.3162278
#>
#> $Dim
#> [1] 6 6
#>
#> attr(,"class")
#> [1] "bigknn_graph_triplet" "bigknn_graph"
#> attr(,"value_name")
#> [1] "distance"
#> attr(,"symmetric")
#> [1] TRUE
class(sparse_graph)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
Matrix::summary(sparse_graph)
#> 6 x 6 sparse Matrix of class "dgCMatrix", with 8 entries
#> i j x
#> 1 2 1 1
#> 2 1 2 1
#> 3 3 2 1
#> 4 2 3 1
#> 5 5 4 1
#> 6 4 5 1
#> 7 6 5 1
#> 8 5 6 1One subtle but important point: symmetric edge lists and triplets
store each undirected pair once, while the sparse matrix representation
stores both matrix entries (i, j) and (j, i).
That is why the sparse summary above shows twice as many non-zero
entries as the symmetrized edge list.
You can round-trip back to an edge list when needed:
clean_graph(as_edge_list(sparse_graph))
#> from to weight
#> 2 1 2 1
#> 1 2 1 1
#> 4 2 3 1
#> 3 3 2 1
#> 6 4 5 1
#> 5 5 4 1
#> 8 5 6 1
#> 7 6 5 1Using graph outputs downstream
Once a graph is in sparse-matrix form, ordinary matrix operations become useful immediately. For an unweighted symmetric adjacency, row sums give vertex degrees:
degree <- Matrix::rowSums(sparse_graph)
data.frame(
vertex = reference_points$id,
degree = as.numeric(degree),
row.names = NULL
)
#> vertex degree
#> 1 p1 1
#> 2 p2 2
#> 3 p3 1
#> 4 p4 1
#> 5 p5 2
#> 6 p6 1The same graph could also be handed off to a separate graph package after conversion:
- use
as_edge_list()for packages that ingest edge tables - use
as_sparse_matrix()for adjacency-based workflows - use
as_triplet()when you want a lightweight sparse representation without constructing a matrix object yet
That keeps bigKNN focused on exact search and exact
graph construction, without forcing a particular downstream graph
ecosystem on the user.