Overview

sum_stat measures summary statistics from galaxy and weak-lensing catalogues: luminosity function, stellar mass function, completeness curves, clustering and galaxy-galaxy lensing. It leverages Corrfunc and dsigma for clustering and lensing summary statistics, respectively. It writes self-describing HDF5 files, and is consumed by the companion gga_model package for parametric fitting. Galaxy-sample weights (from imaging systematics) produced by the companion sys_mapping package can be passed.

Estimator maturity 

The table below shows which estimators are production-ready and which are still in active development. Only the four stable estimators should be used for scientific analysis; the others may change without notice.

Estimator	Module	Status	Notes
SMF φ(M★)	`sum_stat.lf_smf`	Stable	Validated against COSMOS, GAMA, and Uchuu mocks
w_p(r_p)	`sum_stat.twopcf`	Stable	Validated against Uchuu mocks and BGS
w(θ)	`sum_stat.twopcf`	Stable	Validated against Uchuu mocks and BGS
ΔΣ(r_p)	`sum_stat.lensing`	Stable	Validated on HSC Y3 / DES Y3 / KiDS sources
LF φ(M)	`sum_stat.lf_smf`	Development	Same estimators as SMF; not yet validated against reference datasets
ξ(r), ξ_ℓ(s)	`sum_stat.twopcf`	Development	API and output format may change
C_ℓ, P_ℓ(k)	`sum_stat.powspec`	Development	Not yet validated against literature
kNN-CDF	`sum_stat.knn`	Development	Not yet validated against simulations
n(z)	`sum_stat.nz`	Development	Utility; no validation suite yet

Summary statistics 

A summary statistic compresses a galaxy survey into a low-dimensional quantity that retains the cosmological information relevant to a given test. The page Estimators from discrete galaxy surveys describes how each estimator functions.

One-point statistics — density functions (stable: SMF)

One-point function estimators are in sum_stat.lf_smf.

The stellar mass function (SMF) estimators below are production-ready. The luminosity function (LF) estimators use the same code but are listed separately under Statistics in development because they lack a completed validation suite.

Statistic	Function	Backend · Reference	Required inputs	CPU time (200k / 1M / 3M gal)
SMF φ(M_★) — 1/Vmax	`stellar_mass_function_vmax()`	JAX; Schmidt (1968)	cat: z, m_★, w; area_sr, cosmo	< 1 s / 1 s / 3 s
SMF φ(M_★) — SWML	`stellar_mass_function_swml()`	iterative; Efstathiou et al. (1988)	cat: z, m_★, w; m_★,lim, area_sr, cosmo	< 1 s / 1 s / ~3 s
SMF Φ(>M_★) — C⁻ cumulative	`cumulative_stellar_mass_function_cminus()`	product O(N²); Lynden-Bell (1971)	cat: z, m_★, w; m_★,lim, area_sr, cosmo	40 s / ~15 min / ~2 h
M★–z independence τ	`efron_petrosian_tau()`	rank O(N²); Efron & Petrosian (1992)	cat: z, m_★; m_★,lim, cosmo	~2 min / ~50 min / ~8 h
M★–z independence T_c/T_v	`rauzy_completeness()`	rank O(N²); Rauzy (2001)	cat: z, m_★; m_★,lim, cosmo	~2 min / ~50 min / ~8 h
Completeness scan T_c(m★), T_v(m★)	`completeness_scan()`	rank O(N² × n_test); Johnston+ (2007), Teodoro+ (2010), Johnston+ (2012)	m_app, z, m_★; mag_lim, cosmo; optional bright limit & S/N target	~2 min / ~50 min / ~8 h (per scan point)

Input notation — cat: fields are attributes of GalaxyCatalogue that must be set; the remaining items are function arguments. z = redshift; M = abs_mag; m_★ = log10_mstar; w = weight; area_sr = survey solid angle [sr]; mag_lim = apparent-magnitude limit; m_★,lim = per-galaxy mass completeness limit [log₁₀(M_★/M_☉)].

Two-point statistics (stable: WPRP, WTHETA, DeltaSigma)

Pair-counting assumes N_rand = 5 × N_gal. ΔΣ timing is source-catalogue dominated (≈ 35 M for HSC Y3, ≈ 100 M for DES Y3).

Statistic	Function	Backend · Reference	Required inputs	CPU time (200k / 1M / 3M gal)
Angular w(θ)	`w_theta()`	treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004)	cat: α, δ (gal + rand)	8 s / 40 s / ~2 min
Projected w_p(r_p)	`wp()`	Corrfunc `DDrppi_mocks` + LS; Davis & Peebles (1983), Sinha & Garrison (2020)	cat: α, δ, z (gal + rand); π_max, cosmo	~3 min / ~1 h / ~10 h
ESD ΔΣ(r_p)	`delta_sigma()`	dsigma; Sheldon et al. (2004), Mandelbaum et al. (2005), Lange & Huang (2022)	cat: α, δ, z; shape cat, cosmo	hours (source-cat dominated)

Input notation — α, δ = ra, dec [deg]; z = redshift; shape cat = ShapeCatalogue (HSC Y3 or DES Y3).

Statistics in development 

Warning

The following statistics are not yet production-ready. APIs and output formats may change without notice and results have not been validated against reference data. They are disabled by default in the pipeline scripts (COMPUTE_STATS["multipoles"] = False, COMPUTE_STATS["cl"] = False). Do not use them for scientific analysis. See Estimators in development for the detailed status and roadmap.

kNN CPU times assume pyfnntw (Rust parallel kd-tree) on 16 cores; query points are drawn from the randoms catalogue.

Statistic	Function	Backend · Reference	Required inputs	CPU time (200k / 1M / 3M gal)
LF φ(M) — 1/Vmax	`luminosity_function_vmax()`	JAX; Schmidt (1968)	cat: z, M, w; area_sr, cosmo	< 1 s / 1 s / 3 s
LF φ(M) — SWML	`luminosity_function_swml()`	iterative; Efstathiou et al. (1988)	cat: z, M, w; mag_lim, area_sr, cosmo	< 1 s / 1 s / ~3 s
LF Φ(<M) — C⁻ cumulative	`cumulative_luminosity_function_cminus()`	product O(N²); Lynden-Bell (1971)	cat: z, M, w; mag_lim, area_sr, cosmo	40 s / ~15 min / ~2 h
3D ξ(r)	`xi_r()`	treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004)	cat: α, δ, z (gal + rand); cosmo	~8 s / ~40 s / ~2 min
Multipoles ξ_ℓ(s)	`xi_multipoles()`	treecorr + JAX Legendre; Hamilton (1992)	cat: α, δ, z (gal + rand); cosmo	~8 s / ~40 s / ~2 min
Angular C_ℓ	`cl_angular()`	healpy pseudo-C_ℓ; Hivon et al. (2002), Górski et al. (2005)	cat: α, δ; nside, ℓ-bins	4 s / 4 s / 6 s
3D P_ℓ(k)	`pk3d()`, `pk_multipoles()`	JAX FFT; Feldman et al. (1994)	cat: α, δ, z; cosmo	< 1 s / ~1 s / ~3 s
kNN-CDF F_k(r)	`knn_cdf()`	pyfnntw + JAX; Banerjee & Abel (2021)	cat: z, α, δ; rand; k, r_bins, cosmo	< 10 s / < 1 min / < 5 min
Cross-kNN F_k^A,B(r)	`cross_knn_cdf()`	pyfnntw + JAX; Banerjee & Abel (2021)	cat A & B: z, α, δ; rand; k, r_bins, cosmo	< 20 s / < 2 min / < 10 min
kNN volume map V_k(x)	`knn_volume_map()`	pyfnntw; Banerjee & Abel (2021)	cat: z, α, δ; query grid xyz; k, cosmo	depends on grid size

See k-Nearest-Neighbor Statistics — API for the kNN mathematical definition, HDF5 schema, and paper references.

JAX design 

Selected estimators (1/Vmax sums, Legendre decomposition, power-spectrum post-processing) are implemented as JAX-native @jax.jit kernels, which enables:

Automatic differentiation — jax.grad and jax.hessian work through JAX kernels, enabling Fisher matrices and MCMC gradients in downstream modelling.
Vectorisation — jax.vmap is available for batching over parameter grids.

The heavy pair-counting backends (Corrfunc C++, treecorr C++, dsigma Cython) and the iterative estimators (SWML, C⁻) are NumPy/SciPy based and are called outside JAX boundaries.

Unit conventions 

All outputs use physical (h-free) Mpc by default. This follows the astropy convention: comoving_distance() and comoving_volume() return Mpc and Mpc³ at face value with no implicit h factor.

Quantity	Unit	Notes
Comoving distances	Mpc	physical, not h⁻¹ Mpc
Comoving volumes (V_max)	Mpc³	physical, not h⁻³ Mpc³
Luminosity function φ(M)	Mpc⁻³ mag⁻¹	physical, not h³ Mpc⁻³ mag⁻¹
Stellar mass function φ(M_★)	Mpc⁻³ dex⁻¹	physical, not h³ Mpc⁻³ dex⁻¹
ESD ΔΣ(r_p)	M_☉ pc⁻²	h-independent
Angular 2PCF w(θ)	dimensionless	h-independent
3D 2PCF ξ(r), multipoles ξ_ℓ(s), w_p(r_p)	Mpc (separation), dimensionless (amplitude)	physical Mpc
Wavenumber k (power spectrum only)	h/Mpc	sole exception: follows Fourier simulation convention
Power spectrum P(k)	(Mpc/h)³	sole exception; matches nbody/simulation convention

The value h = H₀/100 is stored in every HDF5 output file alongside H₀ under the cosmology/h dataset so that the power-spectrum axes can be read without arithmetic.

Script organisation 

Production measurements use standalone Python scripts in scripts/. See Running the measurement scripts for commands.

measure_joint_sumstat.py is the unified entry point for all joint measurements.

Script	Surveys	Purpose
`scripts/measure_joint_sumstat.py`	bgs · mock · custom	Unified joint measurement: any combination of SMF, LF, WP, ESD_HSC, ESD_DES, ESD_KIDS, WTHETA, KNN; configurable bins; jackknife or bootstrap; crash-recovery checkpoints
`scripts/plot_joint_sumstat.py`	bgs · mock · custom	Diagnostic plots for the HDF5 output of `measure_joint_sumstat.py`

Output format 

The unified script measure_joint_sumstat.py writes all active statistics into a single HDF5 file per sample × weight variant. Bin edges are fully configurable via --bins-* (see Bin specification). Weight variants produce separate files; the variant suffix is appended to the filename (e.g. …-sys-comb.h5; uniform has no suffix).

HDF5 schema (measure_joint_sumstat.py):

{sample_id}_joint_{stats_slug}[-{weight}].h5
├── smf/{sample_id}/
│   ├── cosmology/        H0, h, Om0, Ob0, Ok0
│   ├── bin_edges         [dex]
│   ├── log10mstar_centres  [dex]
│   ├── phi               [Mpc⁻³ dex⁻¹]
│   ├── phi_err           [Mpc⁻³ dex⁻¹]
│   └── cov               [Mpc⁻⁶ dex⁻²]
├── lf/{sample_id}/
│   ├── M_centres         [mag]
│   ├── phi               [Mpc⁻³ mag⁻¹]
│   ├── phi_err           [Mpc⁻³ mag⁻¹]
│   └── cov
├── twopcf/wp_{sample_id}/
│   ├── sep_centres       [Mpc]
│   ├── xi                [Mpc]
│   └── cov
├── twopcf/wtheta_{sample_id}/
│   ├── sep_centres       [arcmin]
│   ├── xi                [dimensionless]
│   └── cov
├── esd/{sample_id}_{HSC|DES|KIDS}/
│   ├── rp_centres        [Mpc]
│   ├── delta_sigma       [M_sun/pc²]
│   └── cov
├── knn/{sample_id}/
│   ├── r_centres         [Mpc]
│   ├── k_values          [dimensionless]
│   ├── F_k               [dimensionless]  shape (n_k, n_r)
│   └── F_k_poisson       [dimensionless]  Erlang reference
├── joint_covariance/
│   ├── data_vector       concatenated statistic vector
│   ├── err               sqrt(diag(cov))
│   ├── cov               full joint matrix
│   ├── subsamples        JK or bootstrap draws  [N_iters × N_bins]
│   └── attrs: statistics, slice_*, covariance_method
└── _checkpoint/
    └── subsamples        in-progress iterations (NaN = not yet done)

Every dataset carries unit and description attributes. The file-level git_hash attribute records the exact code revision. See Data format for the complete attribute table.

Bin specification 

All --bins-* flags in measure_joint_sumstat.py accept one of three formats:

Format	Result
`log:N:MIN:MAX`	`np.logspace(log10(MIN), log10(MAX), N+1)` — N log-spaced bins
`lin:N:MIN:MAX`	`np.linspace(MIN, MAX, N+1)` — N linearly spaced bins
`arange:START:STOP:STEP`	`np.arange(START, STOP, STEP)` — uniform spacing

Examples:

--bins-rp      log:15:0.1:40        # 15 log bins, 0.1–40 Mpc
--bins-mstar   arange:10.5:12.6:0.1 # 0.1 dex steps from 10.5 to 12.5
--bins-theta   log:20:0.05:100      # 20 log bins, 0.05–100 arcmin
--bins-knn-r   log:10:1:30          # 10 log bins, 1–30 Mpc
--knn-k        1 2 3                # k = 1, 2, 3 neighbours