k-Nearest-Neighbor Statistics — API
Warning
This module is in active development and has not yet been validated against reference simulations. Results and APIs may change without notice. Do not use for scientific analysis. See Estimators in development for the current status and what is needed before promotion to stable.
The k-nearest-neighbor cumulative distribution function (kNN-CDF) is a flexible summary statistic that captures all connected N-point correlation functions within a single estimator. It consistently outperforms traditional two-point statistics when constraining cosmological parameters, especially on small and intermediate scales (Banerjee & Abel 2021).
Mathematical Definition
For a galaxy catalogue with positions \(\{\mathbf{x}_i\}\) and a set of \(N_q\) random query points \(\{\mathbf{q}_j\}\) drawn uniformly from the survey volume:
The kNN-CDF is the empirical distribution of these distances:
For a homogeneous Poisson point process with mean number density \(\bar{n}\), the kNN-CDF is given by the regularised lower incomplete gamma function (Erlang distribution):
The deviation \(F_k(r) - F_k^{\mathrm{Pois}}(r)\) encodes clustering beyond a Poisson baseline. Larger values of \(F_k\) at fixed \(r\) indicate overdensities (galaxies closer than expected); smaller values indicate voids.
kNN sphere volume at a query point:
This quantity is used for 2-D and 3-D density maps; see knn_volume_map().
Backend: pyfnntw
Nearest-neighbor queries use
pyfnntw (v0.4.1), a Python binding for
the fnntw Rust crate — a parallel,
cache-optimised kd-tree. The tree is built with leafsize=32 and queries
run across all available CPUs automatically.
If pyfnntw is not installed, the implementation falls back to
scipy.spatial.cKDTree. Install the primary backend with:
pip install pyfnntw # requires the Rust toolchain (rustup.rs)
Quick-Start Example
import numpy as np
import sum_stat as ss
from astropy.cosmology import FlatLambdaCDM
cosmo = FlatLambdaCDM(H0=67.74, Om0=0.3089)
# --- Load catalogues ---
gal = ss.GalaxyCatalogue(ra=..., dec=..., redshift=...)
rand = ss.GalaxyCatalogue(ra=..., dec=..., redshift=...) # survey randoms
# --- Compute kNN-CDF for k = 1 … 5 ---
k_values = np.arange(1, 6)
r_bins = np.logspace(-1, 1.5, 21) # 0.1 – 31.6 Mpc
r_centres, F_k, F_k_poisson = ss.knn_cdf(
gal, cosmo, k_values, r_bins,
rand=rand, # query points drawn from randoms
n_query=100_000,
)
# F_k.shape = (5, 20)
# --- Cross-kNN between two populations ---
r_c, F_k_a, F_k_b = ss.cross_knn_cdf(
gal_a, gal_b, cosmo, k_values, r_bins, rand=rand,
)
# --- Density map (V_k at a regular grid) ---
from sum_stat.knn import comoving_xyz
gal_xyz = comoving_xyz(gal, cosmo) # (N_gal, 3) [Mpc]
grid_xyz = ... # your custom lattice
vols = ss.knn_volume_map(gal, grid_xyz, [1, 2, 3, 4], cosmo)
# --- Write to HDF5 ---
with ss.SummaryStatWriter("results.h5") as w:
w.write_knn(
"knn/BGS_bright",
r_centres, F_k, F_k_poisson, r_bins, k_values.astype(float),
cosmo,
{"survey": "DESI-BGS", "n_query": 100_000},
)
Output HDF5 Schema
knn/{sample_name}/
├── attrs: estimator="knn-cdf", n_query, n_gal, survey, …
├── r_centres [unit: "Mpc"]
├── bin_edges [unit: "Mpc"]
├── k_values [unit: "dimensionless"]
├── F_k [shape: (n_k, n_r), unit: "dimensionless"]
├── F_k_poisson [shape: (n_k, n_r), unit: "dimensionless"]
└── cosmology/ H0, h, Om0, Ob0, Ok0
API Reference
- sum_stat.knn_cdf(gal, cosmo, k_values, r_bins, rand=None, n_query=100000, n_bar=None, seed=42, workers=-1)[source]
k-nearest-neighbor CDF statistic F_k(r).
For each of n_query random query points drawn from the survey volume, the distance d_k to the k-th nearest galaxy is recorded. The empirical CDF F_k(r) = fraction of query points with d_k ≤ r is returned together with the Poisson (Erlang) reference F_k^Pois(r).
- Parameters:
gal (GalaxyCatalogue) – Galaxy catalogue (
ra,dec,redshiftrequired).cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.
k_values (array_like of int) – Neighbor orders to compute, e.g.
np.arange(1, 6).r_bins (ndarray, shape (n_r+1,)) – Separation bin edges [Mpc].
rand (GalaxyCatalogue, optional) – Random catalogue tracing the survey geometry. Query points are drawn from this catalogue. If
None, uniform points inside the bounding box of galaxy positions are used — suitable only for simulation boxes.n_query (int) – Number of query points.
n_bar (float, optional) – Effective galaxy number density [Mpc⁻³] used for the Poisson reference. If
None, estimated assum(weights) / V_bounding_box. Pass this explicitly when the bounding box is a poor proxy for the survey volume.seed (int) – RNG seed for query-point sampling.
workers (int) – Parallel workers for the scipy fallback kd-tree (
-1= all CPUs). Ignored when pyfnntw is installed.
- Returns:
r_centres (ndarray, shape (n_r,)) – Geometric-mean separation bin centres [Mpc].
F_k (ndarray, shape (n_k, n_r)) – Empirical kNN-CDF for each k in
k_values.F_k_poisson (ndarray, shape (n_k, n_r)) – Erlang (Poisson) reference CDF for each k.
- Parameters:
gal (GalaxyCatalogue)
cosmo (FlatLambdaCDM)
k_values (ndarray)
r_bins (ndarray)
rand (GalaxyCatalogue | None)
n_query (int)
n_bar (float | None)
seed (int)
workers (int)
- Return type:
References
Banerjee & Abel (2021), arXiv:2007.13342. Banerjee et al. (2021), MNRAS 504, 2911. Banerjee et al. (2023), MNRAS 519, 4856. Yuan et al. (2023), MNRAS 522, 3935. Gao et al. (2025), MNRAS 543, 3409. Obreschkow et al. (2025), arXiv:2502.09709.
- sum_stat.cross_knn_cdf(gal_a, gal_b, cosmo, k_values, r_bins, rand=None, n_query=100000, seed=42, workers=-1)[source]
Cross-kNN CDFs from shared query points to two galaxy populations.
Computes F_k^A(r) and F_k^B(r) from the same set of query points: the fraction of query points whose k-th nearest neighbour in population A (or B) lies within distance r. The joint distribution captures cross-correlations between the two populations.
- Parameters:
gal_a (GalaxyCatalogue) – First galaxy population.
gal_b (GalaxyCatalogue) – Second galaxy population.
cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.
k_values (array_like of int) – Neighbor orders to compute.
r_bins (ndarray, shape (n_r+1,)) – Separation bin edges [Mpc].
rand (GalaxyCatalogue, optional) – Random catalogue tracing the survey geometry.
n_query (int) – Number of query points.
seed (int) – RNG seed.
workers (int) – Parallel workers for the scipy fallback kd-tree.
- Returns:
r_centres (ndarray, shape (n_r,)) – Geometric-mean bin centres [Mpc].
F_k_a (ndarray, shape (n_k, n_r)) – kNN-CDF towards population A.
F_k_b (ndarray, shape (n_k, n_r)) – kNN-CDF towards population B.
- Parameters:
gal_a (GalaxyCatalogue)
gal_b (GalaxyCatalogue)
cosmo (FlatLambdaCDM)
k_values (ndarray)
r_bins (ndarray)
rand (GalaxyCatalogue | None)
n_query (int)
seed (int)
workers (int)
- Return type:
References
Banerjee & Abel (2021), arXiv:2007.13342, §4 (joint kNN). Banerjee et al. (2021), MNRAS 504, 2911.
- sum_stat.knn_volume_map(gal, query_xyz, k_values, cosmo, workers=-1)[source]
kNN sphere volumes V_k(x) = 4π/3 · r_k(x)³ at each query point.
Reproduces the
VolumekNNfunction from the externalkNN_CDFslibrary used in the GAMA analysis scripts. The volumes encode the local galaxy density field and are useful for 2-D and 3-D density maps.- Parameters:
gal (GalaxyCatalogue) – Galaxy catalogue (
ra,dec,redshiftrequired).query_xyz (ndarray, shape (N_q, 3)) – Cartesian comoving coordinates of the query points [Mpc]. Build these from a regular grid using
comoving_xyz()or a custom lattice (see GAMA_gal_all_volumekNN.py for an example).k_values (list of int) – Neighbor orders, e.g.
[1, 2, 3, 4].cosmo (FlatLambdaCDM) – Cosmology for comoving distance conversion.
workers (int) – Parallel workers for the scipy fallback kd-tree.
- Returns:
volumes (ndarray, shape (N_q, len(k_values))) – kNN sphere volume at each query point for each k [Mpc³].
- Parameters:
gal (GalaxyCatalogue)
query_xyz (ndarray)
cosmo (FlatLambdaCDM)
workers (int)
- Return type:
Examples
>>> from sum_stat.knn import comoving_xyz, knn_volume_map >>> gal_xyz = comoving_xyz(gal, cosmo) # build a lattice grid >>> vols = knn_volume_map(gal, gal_xyz, [1, 2, 3, 4], cosmo) >>> print(vols.shape) # (N_gal, 4)
References
Banerjee & Abel (2021), arXiv:2007.13342.
- sum_stat.knn_poisson_cdf(k, r_edges, n_bar)[source]
Poisson (Erlang) reference CDF for the k-th nearest neighbor distance.
For a homogeneous Poisson point process with number density n_bar, the CDF of the distance from a random query point to its k-th nearest neighbor is the regularised lower incomplete gamma function:
\[F_k^{\mathrm{Pois}}(r) = \frac{\gamma(k,\; \bar{n} \cdot \tfrac{4\pi}{3} r^3)}{\Gamma(k)}\]- Parameters:
k (int) – Neighbor order (≥ 1).
r_edges (jax.Array, shape (n_r+1,)) – Bin edges [Mpc]. Evaluated at right edges
r_edges[1:].n_bar (float) – Mean galaxy number density [Mpc⁻³].
- Returns:
F_k_poisson (jax.Array, shape (n_r,)) – Poisson reference CDF at each bin right-edge.
- Parameters:
- Return type:
Array
References
Banerjee & Abel (2021), arXiv:2007.13342, eq. (2).
- sum_stat.comoving_xyz(cat, cosmo)[source]
Convert RA/Dec/z to Cartesian comoving coordinates.
- Parameters:
cat (GalaxyCatalogue) – Catalogue with
ra,dec,redshiftattributes.cosmo (FlatLambdaCDM) – Cosmology for redshift → comoving distance.
- Returns:
xyz (ndarray, shape (N, 3)) – Cartesian comoving coordinates [Mpc].
- Parameters:
cat (GalaxyCatalogue)
cosmo (FlatLambdaCDM)
- Return type:
References
Banerjee A. & Abel T. (2021), k-Nearest Neighbour Statistics of the Matter Distribution, MNRAS 500, 5479. ADS | arXiv
Banerjee A., Abel T. & Neyrinck M. (2021), MNRAS 504, 2911. ADS
Banerjee A. & Abel T. (2023), MNRAS 519, 4856. ADS
Yuan S. et al. (2023), MNRAS 522, 3935. ADS
Gao Y. et al. (2025), MNRAS 543, 3409. ADS
Obreschkow D. et al. (2025), arXiv:2502.09709. ADS