Overview
sum_stat measures summary statistics from galaxy and weak-lensing catalogues: luminosity function, stellar mass function, completeness curves, clustering and galaxy-galaxy lensing.
It leverages Corrfunc and
dsigma for clustering and lensing
summary statistics, respectively.
It writes self-describing HDF5 files, and is consumed by the companion gga_model package for parametric fitting.
Galaxy-sample weights (from imaging systematics) produced by the companion
sys_mapping package can be passed.
Estimator maturity
The table below shows which estimators are production-ready and which are still in active development. Only the four stable estimators should be used for scientific analysis; the others may change without notice.
Estimator |
Module |
Status |
Notes |
|---|---|---|---|
SMF φ(M★) |
|
Stable |
Validated against COSMOS, GAMA, and Uchuu mocks |
wp(rp) |
|
Stable |
Validated against Uchuu mocks and BGS |
w(θ) |
|
Stable |
Validated against Uchuu mocks and BGS |
ΔΣ(rp) |
|
Stable |
Validated on HSC Y3 / DES Y3 / KiDS sources |
LF φ(M) |
|
Development |
Same estimators as SMF; not yet validated against reference datasets |
ξ(r), ξℓ(s) |
|
Development |
API and output format may change |
Cℓ, Pℓ(k) |
|
Development |
Not yet validated against literature |
kNN-CDF |
|
Development |
Not yet validated against simulations |
n(z) |
Development |
Utility; no validation suite yet |
Summary statistics
A summary statistic compresses a galaxy survey into a low-dimensional quantity that retains the cosmological information relevant to a given test. The page Estimators from discrete galaxy surveys describes how each estimator functions.
One-point statistics — density functions (stable: SMF)
One-point function estimators are in sum_stat.lf_smf.
The stellar mass function (SMF) estimators below are production-ready. The luminosity function (LF) estimators use the same code but are listed separately under Statistics in development because they lack a completed validation suite.
Statistic |
Function |
Backend · Reference |
Required inputs |
CPU time (200k / 1M / 3M gal) |
|---|---|---|---|---|
SMF φ(M★) — 1/Vmax |
|
JAX; Schmidt (1968) |
cat: z, m★, w; areasr, cosmo |
< 1 s / 1 s / 3 s |
SMF φ(M★) — SWML |
iterative; Efstathiou et al. (1988) |
cat: z, m★, w; m★,lim, areasr, cosmo |
< 1 s / 1 s / ~3 s |
|
SMF Φ(>M★) — C− cumulative |
product O(N²); Lynden-Bell (1971) |
cat: z, m★, w; m★,lim, areasr, cosmo |
40 s / ~15 min / ~2 h |
|
M★–z independence τ |
rank O(N²); Efron & Petrosian (1992) |
cat: z, m★; m★,lim, cosmo |
~2 min / ~50 min / ~8 h |
|
M★–z independence Tc/Tv |
rank O(N²); Rauzy (2001) |
cat: z, m★; m★,lim, cosmo |
~2 min / ~50 min / ~8 h |
|
Completeness scan Tc(m★), Tv(m★) |
|
rank O(N² × n_test); Johnston+ (2007), Teodoro+ (2010), Johnston+ (2012) |
mapp, z, m★; maglim, cosmo; optional bright limit & S/N target |
~2 min / ~50 min / ~8 h (per scan point) |
Input notation — cat: fields are attributes of
GalaxyCatalogue that must be set; the remaining items are
function arguments. z = redshift; M = abs_mag; m★ =
log10_mstar; w = weight; areasr = survey solid angle [sr];
maglim = apparent-magnitude limit; m★,lim = per-galaxy
mass completeness limit [log₁₀(M★/M☉)].
Two-point statistics (stable: WPRP, WTHETA, DeltaSigma)
Pair-counting assumes N_rand = 5 × N_gal. ΔΣ timing is source-catalogue dominated (≈ 35 M for HSC Y3, ≈ 100 M for DES Y3).
Statistic |
Function |
Backend · Reference |
Required inputs |
CPU time (200k / 1M / 3M gal) |
|---|---|---|---|---|
Angular w(θ) |
|
treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004) |
cat: α, δ (gal + rand) |
8 s / 40 s / ~2 min |
Projected wp(rp) |
|
Corrfunc |
cat: α, δ, z (gal + rand); πmax, cosmo |
~3 min / ~1 h / ~10 h |
ESD ΔΣ(rp) |
|
dsigma; Sheldon et al. (2004), Mandelbaum et al. (2005), Lange & Huang (2022) |
cat: α, δ, z; shape cat, cosmo |
hours (source-cat dominated) |
Input notation — α, δ = ra, dec [deg]; z = redshift;
shape cat = ShapeCatalogue (HSC Y3 or DES Y3).
Statistics in development
Warning
The following statistics are not yet production-ready. APIs and output
formats may change without notice and results have not been validated against
reference data. They are disabled by default in the pipeline scripts
(COMPUTE_STATS["multipoles"] = False, COMPUTE_STATS["cl"] = False).
Do not use them for scientific analysis. See Estimators in development
for the detailed status and roadmap.
kNN CPU times assume pyfnntw (Rust parallel kd-tree) on 16 cores; query points are drawn from the randoms catalogue.
Statistic |
Function |
Backend · Reference |
Required inputs |
CPU time (200k / 1M / 3M gal) |
|---|---|---|---|---|
LF φ(M) — 1/Vmax |
|
JAX; Schmidt (1968) |
cat: z, M, w; areasr, cosmo |
< 1 s / 1 s / 3 s |
LF φ(M) — SWML |
iterative; Efstathiou et al. (1988) |
cat: z, M, w; maglim, areasr, cosmo |
< 1 s / 1 s / ~3 s |
|
LF Φ(<M) — C− cumulative |
product O(N²); Lynden-Bell (1971) |
cat: z, M, w; maglim, areasr, cosmo |
40 s / ~15 min / ~2 h |
|
3D ξ(r) |
|
treecorr LS; Landy & Szalay (1993), Jarvis et al. (2004) |
cat: α, δ, z (gal + rand); cosmo |
~8 s / ~40 s / ~2 min |
Multipoles ξℓ(s) |
|
treecorr + JAX Legendre; Hamilton (1992) |
cat: α, δ, z (gal + rand); cosmo |
~8 s / ~40 s / ~2 min |
Angular Cℓ |
|
healpy pseudo-Cℓ; Hivon et al. (2002), Górski et al. (2005) |
cat: α, δ; nside, ℓ-bins |
4 s / 4 s / 6 s |
3D Pℓ(k) |
|
JAX FFT; Feldman et al. (1994) |
cat: α, δ, z; cosmo |
< 1 s / ~1 s / ~3 s |
kNN-CDF Fk(r) |
pyfnntw + JAX; Banerjee & Abel (2021) |
cat: z, α, δ; rand; k, r_bins, cosmo |
< 10 s / < 1 min / < 5 min |
|
Cross-kNN FkA,B(r) |
pyfnntw + JAX; Banerjee & Abel (2021) |
cat A & B: z, α, δ; rand; k, r_bins, cosmo |
< 20 s / < 2 min / < 10 min |
|
kNN volume map Vk(x) |
pyfnntw; Banerjee & Abel (2021) |
cat: z, α, δ; query grid xyz; k, cosmo |
depends on grid size |
See k-Nearest-Neighbor Statistics — API for the kNN mathematical definition, HDF5 schema, and paper references.
JAX design
Selected estimators (1/Vmax sums, Legendre decomposition, power-spectrum
post-processing) are implemented as JAX-native @jax.jit kernels, which
enables:
Automatic differentiation —
jax.gradandjax.hessianwork through JAX kernels, enabling Fisher matrices and MCMC gradients in downstream modelling.Vectorisation —
jax.vmapis available for batching over parameter grids.
The heavy pair-counting backends (Corrfunc C++, treecorr C++, dsigma Cython) and the iterative estimators (SWML, C⁻) are NumPy/SciPy based and are called outside JAX boundaries.
Unit conventions
All outputs use physical (h-free) Mpc by default. This follows the astropy
convention: comoving_distance() and
comoving_volume() return Mpc and Mpc³ at face
value with no implicit h factor.
Quantity |
Unit |
Notes |
|---|---|---|
Comoving distances |
Mpc |
physical, not h⁻¹ Mpc |
Comoving volumes (Vmax) |
Mpc³ |
physical, not h⁻³ Mpc³ |
Luminosity function φ(M) |
Mpc⁻³ mag⁻¹ |
physical, not h³ Mpc⁻³ mag⁻¹ |
Stellar mass function φ(M★) |
Mpc⁻³ dex⁻¹ |
physical, not h³ Mpc⁻³ dex⁻¹ |
ESD ΔΣ(rp) |
M☉ pc⁻² |
h-independent |
Angular 2PCF w(θ) |
dimensionless |
h-independent |
3D 2PCF ξ(r), multipoles ξℓ(s), wp(rp) |
Mpc (separation), dimensionless (amplitude) |
physical Mpc |
Wavenumber k (power spectrum only) |
h/Mpc |
sole exception: follows Fourier simulation convention |
Power spectrum P(k) |
(Mpc/h)³ |
sole exception; matches nbody/simulation convention |
The value h = H₀/100 is stored in every HDF5 output file alongside H₀ under
the cosmology/h dataset so that the power-spectrum axes can be read
without arithmetic.
Script organisation
Production measurements use standalone Python scripts in scripts/.
See Running the measurement scripts for commands.
measure_joint_sumstat.py is the unified entry point for all joint
measurements.
Script |
Surveys |
Purpose |
|---|---|---|
|
bgs · mock · custom |
Unified joint measurement: any combination of SMF, LF, WP, ESD_HSC, ESD_DES, ESD_KIDS, WTHETA, KNN; configurable bins; jackknife or bootstrap; crash-recovery checkpoints |
|
bgs · mock · custom |
Diagnostic plots for the HDF5 output of |
Output format
The unified script measure_joint_sumstat.py writes all active statistics
into a single HDF5 file per sample × weight variant. Bin edges are fully
configurable via --bins-* (see Bin specification). Weight variants produce
separate files; the variant suffix is appended to the filename
(e.g. …-sys-comb.h5; uniform has no suffix).
HDF5 schema (measure_joint_sumstat.py):
{sample_id}_joint_{stats_slug}[-{weight}].h5
├── smf/{sample_id}/
│ ├── cosmology/ H0, h, Om0, Ob0, Ok0
│ ├── bin_edges [dex]
│ ├── log10mstar_centres [dex]
│ ├── phi [Mpc⁻³ dex⁻¹]
│ ├── phi_err [Mpc⁻³ dex⁻¹]
│ └── cov [Mpc⁻⁶ dex⁻²]
├── lf/{sample_id}/
│ ├── M_centres [mag]
│ ├── phi [Mpc⁻³ mag⁻¹]
│ ├── phi_err [Mpc⁻³ mag⁻¹]
│ └── cov
├── twopcf/wp_{sample_id}/
│ ├── sep_centres [Mpc]
│ ├── xi [Mpc]
│ └── cov
├── twopcf/wtheta_{sample_id}/
│ ├── sep_centres [arcmin]
│ ├── xi [dimensionless]
│ └── cov
├── esd/{sample_id}_{HSC|DES|KIDS}/
│ ├── rp_centres [Mpc]
│ ├── delta_sigma [M_sun/pc²]
│ └── cov
├── knn/{sample_id}/
│ ├── r_centres [Mpc]
│ ├── k_values [dimensionless]
│ ├── F_k [dimensionless] shape (n_k, n_r)
│ └── F_k_poisson [dimensionless] Erlang reference
├── joint_covariance/
│ ├── data_vector concatenated statistic vector
│ ├── err sqrt(diag(cov))
│ ├── cov full joint matrix
│ ├── subsamples JK or bootstrap draws [N_iters × N_bins]
│ └── attrs: statistics, slice_*, covariance_method
└── _checkpoint/
└── subsamples in-progress iterations (NaN = not yet done)
Every dataset carries unit and description attributes.
The file-level git_hash attribute records the exact code revision.
See Data format for the complete attribute table.
Bin specification
All --bins-* flags in measure_joint_sumstat.py accept one of three
formats:
Format |
Result |
|---|---|
|
|
|
|
|
|
Examples:
--bins-rp log:15:0.1:40 # 15 log bins, 0.1–40 Mpc
--bins-mstar arange:10.5:12.6:0.1 # 0.1 dex steps from 10.5 to 12.5
--bins-theta log:20:0.05:100 # 20 log bins, 0.05–100 arcmin
--bins-knn-r log:10:1:30 # 10 log bins, 1–30 Mpc
--knn-k 1 2 3 # k = 1, 2, 3 neighbours
See also
A full reference list with annotated entries and ADS links is available at Bibliography.