Exact Graph Construction from big.matrix Data

bigKNN can build exact neighbour graphs directly from bigmemory::big.matrix references. These graph builders are useful when the exact neighbourhood structure is the real output you care about, not only the raw search result.

Typical uses include:

clustering and community discovery
manifold learning and neighborhood-preserving embeddings
graph-based preprocessing for downstream models
exact ground truth for evaluating approximate graph builders

This vignette introduces the graph families available in bigKNN, the symmetrization rules they use, and the output formats that make them easy to inspect or hand off downstream.

Build a Small Reference Set

We will work with six points arranged as two well-separated local groups. That geometry keeps the graph outputs easy to read while still showing meaningful differences between directed, mutual, shared-nearest-neighbour, and radius graphs.

reference_points <- data.frame(
  id = paste0("p", 1:6),
  x1 = c(0.0, 0.3, 1.2, 4.0, 4.3, 5.2),
  x2 = c(0.0, 0.0, 0.0, 4.0, 4.1, 4.0)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))

reference_points
#>   id  x1  x2
#> 1 p1 0.0 0.0
#> 2 p2 0.3 0.0
#> 3 p3 1.2 0.0
#> 4 p4 4.0 4.0
#> 5 p5 4.3 4.1
#> 6 p6 5.2 4.0

The returned graph objects always refer to rows of the reference matrix. In other words, vertices 1:6 correspond to p1:p6.

Directed `k`-NN graphs with `knn_graph_bigmatrix()`

A directed k-nearest-neighbour graph stores one outgoing edge per retained neighbour for every row. With include_distance = TRUE, each edge value is the exact distance between the source row and the neighbour row.

directed_knn <- knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list",
  symmetrize = "none"
)

clean_graph(directed_knn)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  1 0.3000000
#> 3    3  2 0.9000000
#> 4    4  5 0.3162278
#> 5    5  4 0.3162278
#> 6    6  5 0.9055385

Here:

from is the source row
to is the chosen neighbour row
distance is the exact Euclidean distance

This is the most literal graph form: every query row keeps its own directed outgoing edges. If one row points to another but not vice versa, both facts are visible.

Mutual `k`-NN graphs with `mutual_knn_graph_bigmatrix()`

Mutual k-NN graphs keep only reciprocal neighbour relationships. They are often sparser and more conservative than directed k-NN graphs, because one-sided edges are dropped.

mutual_knn <- mutual_knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list"
)

clean_graph(mutual_knn)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    4  5 0.3162278

In this example, only the pairs (1, 2) and (4, 5) survive. Vertices 3 and 6 each point toward their nearest local anchor, but that preference is not returned in the other direction, so those edges disappear in the mutual graph.

mutual_knn_graph_bigmatrix() is equivalent to calling knn_graph_bigmatrix(..., symmetrize = "mutual"):

identical(
  clean_graph(mutual_knn),
  clean_graph(knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUE

Shared-nearest-neighbour graphs with `snn_graph_bigmatrix()`

Shared-nearest-neighbour graphs connect pairs of rows that have overlapping exact neighbour sets. Instead of storing distances, they store a similarity-like weight derived from neighbourhood overlap.

bigKNN currently supports two weight definitions:

weight = "count": the number of shared neighbours
weight = "jaccard": shared neighbours divided by the union size

snn_count <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "count",
  format = "edge_list"
)

snn_jaccard <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "jaccard",
  format = "edge_list"
)

merge(
  clean_graph(snn_count),
  clean_graph(snn_jaccard),
  by = c("from", "to"),
  suffixes = c("_count", "_jaccard")
)
#>   from to weight_count weight_jaccard
#> 1    1  2            1      0.3333333
#> 2    1  3            1      0.3333333
#> 3    2  3            1      0.3333333
#> 4    4  5            1      0.3333333
#> 5    4  6            1      0.3333333
#> 6    5  6            1      0.3333333

The edge set is the same for both weight choices in this example, but the weights differ:

count emphasizes absolute overlap
jaccard normalizes overlap by neighbourhood size

That normalization can be helpful when you want weights that are easier to compare across regions of the graph with different neighbour-set structure.

Radius graphs with `radius_graph_bigmatrix()`

Radius graphs use a fixed distance threshold instead of a fixed k. That means the number of neighbours can vary from row to row.

radius_directed <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "none"
)

radius_union <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "union"
)

clean_graph(radius_directed)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  1 0.3000000
#> 3    2  3 0.9000000
#> 4    3  2 0.9000000
#> 5    4  5 0.3162278
#> 6    5  4 0.3162278
#> 7    5  6 0.9055385
#> 8    6  5 0.9055385
clean_graph(radius_union)
#>   from to  distance
#> 1    1  2 0.3000000
#> 2    2  3 0.9000000
#> 3    4  5 0.3162278
#> 4    5  6 0.9055385

Compared with fixed-k graphs:

radius graphs encode a distance scale directly
dense regions can produce more edges than sparse regions
isolated rows may have very few or no neighbours at the chosen radius

In this self-search setting with a symmetric metric, symmetrize = "union" and symmetrize = "mutual" coincide because any radius edge that appears in one direction also appears in the other:

identical(
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "union")),
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "mutual"))
)
#> [1] TRUE

Symmetrization options and edge semantics

For k-NN and radius graph builders, symmetrize controls how directed edges are collapsed:

"none" keeps directed edges as-is
"union" keeps a pair if either direction appears
"mutual" keeps a pair only when both directions appear

The difference is easiest to see on k = 1:

graph_none <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "none")
graph_union <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "union")
graph_mutual <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual")

data.frame(
  symmetrize = c("none", "union", "mutual"),
  n_edges = c(nrow(clean_graph(graph_none)), nrow(clean_graph(graph_union)), nrow(clean_graph(graph_mutual))),
  row.names = NULL
)
#>   symmetrize n_edges
#> 1       none       6
#> 2      union       4
#> 3     mutual       2

When distances are stored, bigKNN keeps the minimum directed distance for the collapsed undirected pair. When unit weights are stored instead (include_distance = FALSE), the retained value is the maximum weight, which remains 1 for unweighted graphs.

Converting outputs with `as_edge_list()`, `as_triplet()`, and `as_sparse_matrix()`

Different downstream tasks prefer different graph formats:

edge lists are easiest to inspect and export
triplet form is a compact sparse representation in R lists
sparse matrices are convenient for linear algebra and adjacency-style operations

triplet_graph <- as_triplet(graph_mutual)
sparse_graph <- as_sparse_matrix(
  knn_graph_bigmatrix(
    reference,
    k = 1,
    format = "edge_list",
    symmetrize = "union",
    include_distance = FALSE
  )
)

triplet_graph
#> $i
#> [1] 1 4
#> 
#> $j
#> [1] 2 5
#> 
#> $x
#> [1] 0.3000000 0.3162278
#> 
#> $Dim
#> [1] 6 6
#> 
#> attr(,"class")
#> [1] "bigknn_graph_triplet" "bigknn_graph"        
#> attr(,"value_name")
#> [1] "distance"
#> attr(,"symmetric")
#> [1] TRUE
class(sparse_graph)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
Matrix::summary(sparse_graph)
#> 6 x 6 sparse Matrix of class "dgCMatrix", with 8 entries
#>   i j x
#> 1 2 1 1
#> 2 1 2 1
#> 3 3 2 1
#> 4 2 3 1
#> 5 5 4 1
#> 6 4 5 1
#> 7 6 5 1
#> 8 5 6 1

One subtle but important point: symmetric edge lists and triplets store each undirected pair once, while the sparse matrix representation stores both matrix entries (i, j) and (j, i). That is why the sparse summary above shows twice as many non-zero entries as the symmetrized edge list.

You can round-trip back to an edge list when needed:

clean_graph(as_edge_list(sparse_graph))
#>   from to weight
#> 2    1  2      1
#> 1    2  1      1
#> 4    2  3      1
#> 3    3  2      1
#> 6    4  5      1
#> 5    5  4      1
#> 8    5  6      1
#> 7    6  5      1

Using graph outputs downstream

Once a graph is in sparse-matrix form, ordinary matrix operations become useful immediately. For an unweighted symmetric adjacency, row sums give vertex degrees:

degree <- Matrix::rowSums(sparse_graph)

data.frame(
  vertex = reference_points$id,
  degree = as.numeric(degree),
  row.names = NULL
)
#>   vertex degree
#> 1     p1      1
#> 2     p2      2
#> 3     p3      1
#> 4     p4      1
#> 5     p5      2
#> 6     p6      1

The same graph could also be handed off to a separate graph package after conversion:

use as_edge_list() for packages that ingest edge tables
use as_sparse_matrix() for adjacency-based workflows
use as_triplet() when you want a lightweight sparse representation without constructing a matrix object yet

That keeps bigKNN focused on exact search and exact graph construction, without forcing a particular downstream graph ecosystem on the user.

Build a Small Reference Set

Directed k-NN graphs with knn_graph_bigmatrix()

Mutual k-NN graphs with mutual_knn_graph_bigmatrix()

Shared-nearest-neighbour graphs with snn_graph_bigmatrix()

Radius graphs with radius_graph_bigmatrix()