Overview

sum_stat measures summary statistics from galaxy and weak-lensing catalogues: luminosity function, stellar mass function, completeness curves, clustering and galaxy-galaxy lensing. It leverages Corrfunc and dsigma for clustering and lensing summary statistics, respectively. It writes self-describing HDF5 files, and is consumed by the companion gga_model package for parametric fitting. Galaxy-sample weights (from imaging systematics) produced by the companion sys_mapping package can be passed.

Estimator maturity

The table below shows which estimators are production-ready and which are still in active development. Only the four stable estimators should be used for scientific analysis; the others may change without notice.

Estimator

Module

Status

Notes

SMF φ(M★)

sum_stat.lf_smf

Stable

Validated against COSMOS, GAMA, and Uchuu mocks

wp(rp)

sum_stat.twopcf

Stable

Validated against Uchuu mocks and BGS

w(θ)

sum_stat.twopcf

Stable

Validated against Uchuu mocks and BGS

ΔΣ(rp)

sum_stat.lensing

Stable

Validated on HSC Y3 / DES Y3 / KiDS sources

LF φ(M)

sum_stat.lf_smf

Development

Same estimators as SMF; not yet validated against reference datasets

ξ(r), ξ(s)

sum_stat.twopcf

Development

API and output format may change

C, P(k)

sum_stat.powspec

Development

Not yet validated against literature

kNN-CDF

sum_stat.knn

Development

Not yet validated against simulations

n(z)

sum_stat.nz

Development

Utility; no validation suite yet

Summary statistics

A summary statistic compresses a galaxy survey into a low-dimensional quantity that retains the cosmological information relevant to a given test. The page Estimators from discrete galaxy surveys describes how each estimator functions.

One-point statistics — density functions (stable: SMF)

One-point function estimators are in sum_stat.lf_smf.

The stellar mass function (SMF) estimators below are production-ready. The luminosity function (LF) estimators use the same code but are listed separately under Statistics in development because they lack a completed validation suite.

Statistic

Function

Backend · Reference

Required inputs

CPU time (200k / 1M / 3M gal)

SMF φ(M) — 1/Vmax

stellar_mass_function_vmax()

JAX; Schmidt (1968)

cat: z, m, w; areasr, cosmo

< 1 s / 1 s / 3 s

SMF φ(M) — SWML

stellar_mass_function_swml()

iterative; Efstathiou et al. (1988)

cat: z, m, w; m★,lim, areasr, cosmo

< 1 s / 1 s / ~3 s

SMF Φ(>M) — C cumulative

cumulative_stellar_mass_function_cminus()

product O(N²); Lynden-Bell (1971)

cat: z, m, w; m★,lim, areasr, cosmo

40 s / ~15 min / ~2 h

M★–z independence τ

efron_petrosian_tau()

rank O(N²); Efron & Petrosian (1992)

cat: z, m; m★,lim, cosmo

~2 min / ~50 min / ~8 h

M★–z independence Tc/Tv

rauzy_completeness()

rank O(N²); Rauzy (2001)

cat: z, m; m★,lim, cosmo

~2 min / ~50 min / ~8 h

Completeness scan Tc(m★), Tv(m★)

completeness_scan()

rank O(N² × n_test); Johnston+ (2007), Teodoro+ (2010), Johnston+ (2012)

mapp, z, m; maglim, cosmo; optional bright limit & S/N target

~2 min / ~50 min / ~8 h (per scan point)

Input notationcat: fields are attributes of GalaxyCatalogue that must be set; the remaining items are function arguments. z = redshift; M = abs_mag; m = log10_mstar; w = weight; areasr = survey solid angle [sr]; maglim = apparent-magnitude limit; m★,lim = per-galaxy mass completeness limit [log₁₀(M/M)].

Two-point statistics (stable: WPRP, WTHETA, DeltaSigma)

Pair-counting assumes N_rand = 5 × N_gal. ΔΣ timing is source-catalogue dominated (≈ 35 M for HSC Y3, ≈ 100 M for DES Y3).

Statistic

Function

Backend · Reference

Required inputs

CPU time (200k / 1M / 3M gal)

Angular w(θ)

w_theta()

treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004)

cat: α, δ (gal + rand)

8 s / 40 s / ~2 min

Projected wp(rp)

wp()

Corrfunc DDrppi_mocks + LS; Davis & Peebles (1983), Sinha & Garrison (2020)

cat: α, δ, z (gal + rand); πmax, cosmo

~3 min / ~1 h / ~10 h

ESD ΔΣ(rp)

delta_sigma()

dsigma; Sheldon et al. (2004), Mandelbaum et al. (2005), Lange & Huang (2022)

cat: α, δ, z; shape cat, cosmo

hours (source-cat dominated)

Input notation — α, δ = ra, dec [deg]; z = redshift; shape cat = ShapeCatalogue (HSC Y3 or DES Y3).

Statistics in development

Warning

The following statistics are not yet production-ready. APIs and output formats may change without notice and results have not been validated against reference data. They are disabled by default in the pipeline scripts (COMPUTE_STATS["multipoles"] = False, COMPUTE_STATS["cl"] = False). Do not use them for scientific analysis. See Estimators in development for the detailed status and roadmap.

kNN CPU times assume pyfnntw (Rust parallel kd-tree) on 16 cores; query points are drawn from the randoms catalogue.

Statistic

Function

Backend · Reference

Required inputs

CPU time (200k / 1M / 3M gal)

LF φ(M) — 1/Vmax

luminosity_function_vmax()

JAX; Schmidt (1968)

cat: z, M, w; areasr, cosmo

< 1 s / 1 s / 3 s

LF φ(M) — SWML

luminosity_function_swml()

iterative; Efstathiou et al. (1988)

cat: z, M, w; maglim, areasr, cosmo

< 1 s / 1 s / ~3 s

LF Φ(<M) — C cumulative

cumulative_luminosity_function_cminus()

product O(N²); Lynden-Bell (1971)

cat: z, M, w; maglim, areasr, cosmo

40 s / ~15 min / ~2 h

3D ξ(r)

xi_r()

treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004)

cat: α, δ, z (gal + rand); cosmo

~8 s / ~40 s / ~2 min

Multipoles ξ(s)

xi_multipoles()

treecorr + JAX Legendre; Hamilton (1992)

cat: α, δ, z (gal + rand); cosmo

~8 s / ~40 s / ~2 min

Angular C

cl_angular()

healpy pseudo-C; Hivon et al. (2002), Górski et al. (2005)

cat: α, δ; nside, ℓ-bins

4 s / 4 s / 6 s

3D P(k)

pk3d(), pk_multipoles()

JAX FFT; Feldman et al. (1994)

cat: α, δ, z; cosmo

< 1 s / ~1 s / ~3 s

kNN-CDF Fk(r)

knn_cdf()

pyfnntw + JAX; Banerjee & Abel (2021)

cat: z, α, δ; rand; k, r_bins, cosmo

< 10 s / < 1 min / < 5 min

Cross-kNN FkA,B(r)

cross_knn_cdf()

pyfnntw + JAX; Banerjee & Abel (2021)

cat A & B: z, α, δ; rand; k, r_bins, cosmo

< 20 s / < 2 min / < 10 min

kNN volume map Vk(x)

knn_volume_map()

pyfnntw; Banerjee & Abel (2021)

cat: z, α, δ; query grid xyz; k, cosmo

depends on grid size

See k-Nearest-Neighbor Statistics — API for the kNN mathematical definition, HDF5 schema, and paper references.

JAX design

Selected estimators (1/Vmax sums, Legendre decomposition, power-spectrum post-processing) are implemented as JAX-native @jax.jit kernels, which enables:

  • Automatic differentiationjax.grad and jax.hessian work through JAX kernels, enabling Fisher matrices and MCMC gradients in downstream modelling.

  • Vectorisationjax.vmap is available for batching over parameter grids.

The heavy pair-counting backends (Corrfunc C++, treecorr C++, dsigma Cython) and the iterative estimators (SWML, C⁻) are NumPy/SciPy based and are called outside JAX boundaries.

Unit conventions

All outputs use physical (h-free) Mpc by default. This follows the astropy convention: comoving_distance() and comoving_volume() return Mpc and Mpc³ at face value with no implicit h factor.

Quantity

Unit

Notes

Comoving distances

Mpc

physical, not h⁻¹ Mpc

Comoving volumes (Vmax)

Mpc³

physical, not h⁻³ Mpc³

Luminosity function φ(M)

Mpc⁻³ mag⁻¹

physical, not h³ Mpc⁻³ mag⁻¹

Stellar mass function φ(M)

Mpc⁻³ dex⁻¹

physical, not h³ Mpc⁻³ dex⁻¹

ESD ΔΣ(rp)

M pc⁻²

h-independent

Angular 2PCF w(θ)

dimensionless

h-independent

3D 2PCF ξ(r), multipoles ξ(s), wp(rp)

Mpc (separation), dimensionless (amplitude)

physical Mpc

Wavenumber k (power spectrum only)

h/Mpc

sole exception: follows Fourier simulation convention

Power spectrum P(k)

(Mpc/h)³

sole exception; matches nbody/simulation convention

The value h = H₀/100 is stored in every HDF5 output file alongside H₀ under the cosmology/h dataset so that the power-spectrum axes can be read without arithmetic.

Script organisation

Production measurements use standalone Python scripts in scripts/. See Running the measurement scripts for commands.

measure_joint_sumstat.py is the unified entry point for all joint measurements.

Script

Surveys

Purpose

scripts/measure_joint_sumstat.py

bgs · mock · custom

Unified joint measurement: any combination of SMF, LF, WP, ESD_HSC, ESD_DES, ESD_KIDS, WTHETA, KNN; configurable bins; jackknife or bootstrap; crash-recovery checkpoints

scripts/plot_joint_sumstat.py

bgs · mock · custom

Diagnostic plots for the HDF5 output of measure_joint_sumstat.py

Output format

The unified script measure_joint_sumstat.py writes all active statistics into a single HDF5 file per sample × weight variant. Bin edges are fully configurable via --bins-* (see Bin specification). Weight variants produce separate files; the variant suffix is appended to the filename (e.g. …-sys-comb.h5; uniform has no suffix).

HDF5 schema (measure_joint_sumstat.py):

{sample_id}_joint_{stats_slug}[-{weight}].h5
├── smf/{sample_id}/
│   ├── cosmology/        H0, h, Om0, Ob0, Ok0
│   ├── bin_edges         [dex]
│   ├── log10mstar_centres  [dex]
│   ├── phi               [Mpc⁻³ dex⁻¹]
│   ├── phi_err           [Mpc⁻³ dex⁻¹]
│   └── cov               [Mpc⁻⁶ dex⁻²]
├── lf/{sample_id}/
│   ├── M_centres         [mag]
│   ├── phi               [Mpc⁻³ mag⁻¹]
│   ├── phi_err           [Mpc⁻³ mag⁻¹]
│   └── cov
├── twopcf/wp_{sample_id}/
│   ├── sep_centres       [Mpc]
│   ├── xi                [Mpc]
│   └── cov
├── twopcf/wtheta_{sample_id}/
│   ├── sep_centres       [arcmin]
│   ├── xi                [dimensionless]
│   └── cov
├── esd/{sample_id}_{HSC|DES|KIDS}/
│   ├── rp_centres        [Mpc]
│   ├── delta_sigma       [M_sun/pc²]
│   └── cov
├── knn/{sample_id}/
│   ├── r_centres         [Mpc]
│   ├── k_values          [dimensionless]
│   ├── F_k               [dimensionless]  shape (n_k, n_r)
│   └── F_k_poisson       [dimensionless]  Erlang reference
├── joint_covariance/
│   ├── data_vector       concatenated statistic vector
│   ├── err               sqrt(diag(cov))
│   ├── cov               full joint matrix
│   ├── subsamples        JK or bootstrap draws  [N_iters × N_bins]
│   └── attrs: statistics, slice_*, covariance_method
└── _checkpoint/
    └── subsamples        in-progress iterations (NaN = not yet done)

Every dataset carries unit and description attributes. The file-level git_hash attribute records the exact code revision. See Data format for the complete attribute table.

Bin specification

All --bins-* flags in measure_joint_sumstat.py accept one of three formats:

Format

Result

log:N:MIN:MAX

np.logspace(log10(MIN), log10(MAX), N+1) — N log-spaced bins

lin:N:MIN:MAX

np.linspace(MIN, MAX, N+1) — N linearly spaced bins

arange:START:STOP:STEP

np.arange(START, STOP, STEP) — uniform spacing

Examples:

--bins-rp      log:15:0.1:40        # 15 log bins, 0.1–40 Mpc
--bins-mstar   arange:10.5:12.6:0.1 # 0.1 dex steps from 10.5 to 12.5
--bins-theta   log:20:0.05:100      # 20 log bins, 0.05–100 arcmin
--bins-knn-r   log:10:1:30          # 10 log bins, 1–30 Mpc
--knn-k        1 2 3                # k = 1, 2, 3 neighbours

See also

A full reference list with annotated entries and ADS links is available at Bibliography.