k-Nearest-Neighbor Statistics — API

Warning

This module is in active development and has not yet been validated against reference simulations. Results and APIs may change without notice. Do not use for scientific analysis. See Estimators in development for the current status and what is needed before promotion to stable.

The k-nearest-neighbor cumulative distribution function (kNN-CDF) is a flexible summary statistic that captures all connected N-point correlation functions within a single estimator. It consistently outperforms traditional two-point statistics when constraining cosmological parameters, especially on small and intermediate scales (Banerjee & Abel 2021).

Mathematical Definition

For a galaxy catalogue with positions \(\{\mathbf{x}_i\}\) and a set of \(N_q\) random query points \(\{\mathbf{q}_j\}\) drawn uniformly from the survey volume:

\[d_k(\mathbf{q}_j) \;=\; \text{distance from } \mathbf{q}_j \text{ to its } k\text{-th nearest galaxy}\]

The kNN-CDF is the empirical distribution of these distances:

\[F_k(r) \;=\; \frac{1}{N_q} \#\!\left\{ j : d_k(\mathbf{q}_j) \le r \right\}\]

For a homogeneous Poisson point process with mean number density \(\bar{n}\), the kNN-CDF is given by the regularised lower incomplete gamma function (Erlang distribution):

\[F_k^{\mathrm{Pois}}(r) \;=\; \frac{\gamma\!\left(k,\; \bar{n}\,\tfrac{4\pi}{3}\,r^3\right)}{\Gamma(k)}\]

The deviation \(F_k(r) - F_k^{\mathrm{Pois}}(r)\) encodes clustering beyond a Poisson baseline. Larger values of \(F_k\) at fixed \(r\) indicate overdensities (galaxies closer than expected); smaller values indicate voids.

kNN sphere volume at a query point:

\[V_k(\mathbf{q}) \;=\; \tfrac{4\pi}{3}\,d_k(\mathbf{q})^3\]

This quantity is used for 2-D and 3-D density maps; see knn_volume_map().

Backend: pyfnntw

Nearest-neighbor queries use pyfnntw (v0.4.1), a Python binding for the fnntw Rust crate — a parallel, cache-optimised kd-tree. The tree is built with leafsize=32 and queries run across all available CPUs automatically.

If pyfnntw is not installed, the implementation falls back to scipy.spatial.cKDTree. Install the primary backend with:

pip install pyfnntw     # requires the Rust toolchain (rustup.rs)

Quick-Start Example

import numpy as np
import sum_stat as ss
from astropy.cosmology import FlatLambdaCDM

cosmo = FlatLambdaCDM(H0=67.74, Om0=0.3089)

# --- Load catalogues ---
gal  = ss.GalaxyCatalogue(ra=..., dec=..., redshift=...)
rand = ss.GalaxyCatalogue(ra=..., dec=..., redshift=...)  # survey randoms

# --- Compute kNN-CDF for k = 1 … 5 ---
k_values = np.arange(1, 6)
r_bins   = np.logspace(-1, 1.5, 21)   # 0.1 – 31.6 Mpc

r_centres, F_k, F_k_poisson = ss.knn_cdf(
    gal, cosmo, k_values, r_bins,
    rand=rand,       # query points drawn from randoms
    n_query=100_000,
)
# F_k.shape = (5, 20)

# --- Cross-kNN between two populations ---
r_c, F_k_a, F_k_b = ss.cross_knn_cdf(
    gal_a, gal_b, cosmo, k_values, r_bins, rand=rand,
)

# --- Density map (V_k at a regular grid) ---
from sum_stat.knn import comoving_xyz
gal_xyz   = comoving_xyz(gal, cosmo)          # (N_gal, 3) [Mpc]
grid_xyz  = ...                                # your custom lattice
vols = ss.knn_volume_map(gal, grid_xyz, [1, 2, 3, 4], cosmo)

# --- Write to HDF5 ---
with ss.SummaryStatWriter("results.h5") as w:
    w.write_knn(
        "knn/BGS_bright",
        r_centres, F_k, F_k_poisson, r_bins, k_values.astype(float),
        cosmo,
        {"survey": "DESI-BGS", "n_query": 100_000},
    )

Output HDF5 Schema

knn/{sample_name}/
├── attrs: estimator="knn-cdf", n_query, n_gal, survey, …
├── r_centres            [unit: "Mpc"]
├── bin_edges            [unit: "Mpc"]
├── k_values             [unit: "dimensionless"]
├── F_k                  [shape: (n_k, n_r), unit: "dimensionless"]
├── F_k_poisson          [shape: (n_k, n_r), unit: "dimensionless"]
└── cosmology/           H0, h, Om0, Ob0, Ok0

API Reference

sum_stat.knn_cdf(gal, cosmo, k_values, r_bins, rand=None, n_query=100000, n_bar=None, seed=42, workers=-1)[source]

k-nearest-neighbor CDF statistic F_k(r).

For each of n_query random query points drawn from the survey volume, the distance d_k to the k-th nearest galaxy is recorded. The empirical CDF F_k(r) = fraction of query points with d_k ≤ r is returned together with the Poisson (Erlang) reference F_k^Pois(r).

Parameters:
  • gal (GalaxyCatalogue) – Galaxy catalogue (ra, dec, redshift required).

  • cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.

  • k_values (array_like of int) – Neighbor orders to compute, e.g. np.arange(1, 6).

  • r_bins (ndarray, shape (n_r+1,)) – Separation bin edges [Mpc].

  • rand (GalaxyCatalogue, optional) – Random catalogue tracing the survey geometry. Query points are drawn from this catalogue. If None, uniform points inside the bounding box of galaxy positions are used — suitable only for simulation boxes.

  • n_query (int) – Number of query points.

  • n_bar (float, optional) – Effective galaxy number density [Mpc⁻³] used for the Poisson reference. If None, estimated as sum(weights) / V_bounding_box. Pass this explicitly when the bounding box is a poor proxy for the survey volume.

  • seed (int) – RNG seed for query-point sampling.

  • workers (int) – Parallel workers for the scipy fallback kd-tree (-1 = all CPUs). Ignored when pyfnntw is installed.

Returns:

  • r_centres (ndarray, shape (n_r,)) – Geometric-mean separation bin centres [Mpc].

  • F_k (ndarray, shape (n_k, n_r)) – Empirical kNN-CDF for each k in k_values.

  • F_k_poisson (ndarray, shape (n_k, n_r)) – Erlang (Poisson) reference CDF for each k.

Parameters:
Return type:

tuple[ndarray, ndarray, ndarray]

References

Banerjee & Abel (2021), arXiv:2007.13342. Banerjee et al. (2021), MNRAS 504, 2911. Banerjee et al. (2023), MNRAS 519, 4856. Yuan et al. (2023), MNRAS 522, 3935. Gao et al. (2025), MNRAS 543, 3409. Obreschkow et al. (2025), arXiv:2502.09709.

sum_stat.cross_knn_cdf(gal_a, gal_b, cosmo, k_values, r_bins, rand=None, n_query=100000, seed=42, workers=-1)[source]

Cross-kNN CDFs from shared query points to two galaxy populations.

Computes F_k^A(r) and F_k^B(r) from the same set of query points: the fraction of query points whose k-th nearest neighbour in population A (or B) lies within distance r. The joint distribution captures cross-correlations between the two populations.

Parameters:
  • gal_a (GalaxyCatalogue) – First galaxy population.

  • gal_b (GalaxyCatalogue) – Second galaxy population.

  • cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.

  • k_values (array_like of int) – Neighbor orders to compute.

  • r_bins (ndarray, shape (n_r+1,)) – Separation bin edges [Mpc].

  • rand (GalaxyCatalogue, optional) – Random catalogue tracing the survey geometry.

  • n_query (int) – Number of query points.

  • seed (int) – RNG seed.

  • workers (int) – Parallel workers for the scipy fallback kd-tree.

Returns:

  • r_centres (ndarray, shape (n_r,)) – Geometric-mean bin centres [Mpc].

  • F_k_a (ndarray, shape (n_k, n_r)) – kNN-CDF towards population A.

  • F_k_b (ndarray, shape (n_k, n_r)) – kNN-CDF towards population B.

Parameters:
Return type:

tuple[ndarray, ndarray, ndarray]

References

Banerjee & Abel (2021), arXiv:2007.13342, §4 (joint kNN). Banerjee et al. (2021), MNRAS 504, 2911.

sum_stat.knn_volume_map(gal, query_xyz, k_values, cosmo, workers=-1)[source]

kNN sphere volumes V_k(x) = 4π/3 · r_k(x)³ at each query point.

Reproduces the VolumekNN function from the external kNN_CDFs library used in the GAMA analysis scripts. The volumes encode the local galaxy density field and are useful for 2-D and 3-D density maps.

Parameters:
  • gal (GalaxyCatalogue) – Galaxy catalogue (ra, dec, redshift required).

  • query_xyz (ndarray, shape (N_q, 3)) – Cartesian comoving coordinates of the query points [Mpc]. Build these from a regular grid using comoving_xyz() or a custom lattice (see GAMA_gal_all_volumekNN.py for an example).

  • k_values (list of int) – Neighbor orders, e.g. [1, 2, 3, 4].

  • cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.

  • workers (int) – Parallel workers for the scipy fallback kd-tree.

Returns:

volumes (ndarray, shape (N_q, len(k_values))) – kNN sphere volume at each query point for each k [Mpc³].

Parameters:
Return type:

ndarray

Examples

>>> from sum_stat.knn import comoving_xyz, knn_volume_map
>>> gal_xyz = comoving_xyz(gal, cosmo)           # build a lattice grid
>>> vols = knn_volume_map(gal, gal_xyz, [1, 2, 3, 4], cosmo)
>>> print(vols.shape)   # (N_gal, 4)

References

Banerjee & Abel (2021), arXiv:2007.13342.

sum_stat.knn_poisson_cdf(k, r_edges, n_bar)[source]

Poisson (Erlang) reference CDF for the k-th nearest neighbor distance.

For a homogeneous Poisson point process with number density n_bar, the CDF of the distance from a random query point to its k-th nearest neighbor is the regularised lower incomplete gamma function:

\[F_k^{\mathrm{Pois}}(r) = \frac{\gamma(k,\; \bar{n} \cdot \tfrac{4\pi}{3} r^3)}{\Gamma(k)}\]
Parameters:
  • k (int) – Neighbor order (≥ 1).

  • r_edges (jax.Array, shape (n_r+1,)) – Bin edges [Mpc]. Evaluated at right edges r_edges[1:].

  • n_bar (float) – Mean galaxy number density [Mpc⁻³].

Returns:

F_k_poisson (jax.Array, shape (n_r,)) – Poisson reference CDF at each bin right-edge.

Parameters:
Return type:

Array

References

Banerjee & Abel (2021), arXiv:2007.13342, eq. (2).

sum_stat.comoving_xyz(cat, cosmo)[source]

Convert RA/Dec/z to Cartesian comoving coordinates.

Parameters:
  • cat (GalaxyCatalogue) – Catalogue with ra, dec, redshift attributes.

  • cosmo (FlatLambdaCDM) – Cosmology for redshift → comoving distance.

Returns:

xyz (ndarray, shape (N, 3)) – Cartesian comoving coordinates [Mpc].

Parameters:
Return type:

ndarray

References

  • Banerjee A. & Abel T. (2021), k-Nearest Neighbour Statistics of the Matter Distribution, MNRAS 500, 5479. ADS | arXiv

  • Banerjee A., Abel T. & Neyrinck M. (2021), MNRAS 504, 2911. ADS

  • Banerjee A. & Abel T. (2023), MNRAS 519, 4856. ADS

  • Yuan S. et al. (2023), MNRAS 522, 3935. ADS

  • Gao Y. et al. (2025), MNRAS 543, 3409. ADS

  • Obreschkow D. et al. (2025), arXiv:2502.09709. ADS