Estimators from discrete galaxy surveys

This page explains how each summary statistic is estimated from a discrete galaxy survey — moving from the continuous definition to the practical weighted sum over a finite catalogue. It is written for a reader familiar with calculus and basic probability theory but new to galaxy surveys.

The goal of a galaxy survey 

A galaxy survey measures the properties (position, brightness, stellar mass, …) of a large number of galaxies. The raw catalogue is a list of objects; to learn about cosmology we need to compress it into summary statistics — compact quantities that retain the cosmological information while being tractable to model and compare to theory.

This package computes three families of summary statistics:

One-point functions (luminosity function, stellar mass function) — how many galaxies of each luminosity/mass exist per unit volume.
Two-point functions (angular clustering, projected clustering, multipoles) — how galaxies are spatially correlated.
Lensing statistics (excess surface density ΔΣ) — how galaxies distort background shapes through weak gravitational lensing.

The selection function and survey completeness 

Every survey has a flux limit: objects fainter than some threshold \(m_\mathrm{lim}\) (in magnitudes) are not detected. This means the observed catalogue is not a fair sample of the universe — intrinsically faint objects are under-represented at large distances.

Apparent and absolute magnitudes. The apparent magnitude \(m\) is what we observe; the absolute magnitude \(M\) is the luminosity of the object expressed as a magnitude at a standard distance of 10 pc. They are related by the distance modulus:

\[m = M + \mu(z) + K(z)\]

where \(\mu(z) = 5 \log_{10}[D_L(z) / 10\,\mathrm{pc}]\) is the distance modulus and \(K(z)\) is the K-correction that converts between the rest-frame band and the observed band.

Maximum observable redshift. For a given apparent-magnitude limit \(m_\mathrm{lim}\), the maximum redshift at which object \(i\) with absolute magnitude \(M_i\) would still be included in the survey is \(z_{\max,i}\) defined by

\[M_i + \mu(z_{\max,i}) + K(z_{\max,i}) = m_\mathrm{lim}.\]

An object at \(z > z_{\max,i}\) would have \(m > m_\mathrm{lim}\) and would fall below the detection threshold.

Completeness function. In practice the survey is also incomplete at bright magnitudes (saturation) and varies across the sky due to dust extinction and imaging depth. The incompleteness function \(C(m)\) gives the probability of detecting an object of apparent magnitude \(m\). All estimators below assume \(C(m) = 1\) for \(m < m_\mathrm{lim}\) (a clean flux-limited sample), but per-galaxy completeness weights \(w_i = 1/C(m_i)\) can be passed to correct for partial incompleteness.

The 1/\(V_\mathrm{max}\) estimator (Schmidt 1968)

The 1/\(V_\mathrm{max}\) estimator is the simplest non-parametric method for measuring the luminosity function (LF) or stellar mass function (SMF).

Maximum volume. Define the maximum comoving volume accessible to galaxy \(i\) as

\[V_{\max,i} = \frac{\Omega}{4\pi} \left[ V_c\!\left(\min(z_{\max,i},\, z_{\max}^{\rm survey})\right) - V_c(z_{\min}^{\rm survey}) \right],\]

where \(\Omega\) is the survey solid angle [sr] and \(V_c(z) = \tfrac{4\pi}{3} \chi(z)^3\) is the comoving volume out to redshift \(z\) (in a flat universe).

Estimator. The LF (or SMF) in a bin of width \(\Delta M\) centred on \(M_j\) is (Schmidt 1968; Johnston 2011 Eq. 51–53):

(1)\[\hat\phi_j = \frac{1}{\Delta M} \sum_{i\,\in\,\mathrm{bin}_j} \frac{w_i}{V_{\max,i}},\]

where the sum runs over all galaxies whose observable \(M_i\) falls in bin \(j\), and \(w_i\) is a completeness weight. The associated Poisson uncertainty is (Condon 1989):

\[\hat\sigma_j = \frac{1}{\Delta M} \sqrt{\sum_{i\,\in\,\mathrm{bin}_j} \left(\frac{w_i}{V_{\max,i}}\right)^2}.\]

Why it works. Each galaxy contributes its “weight per unit volume” \(w_i/V_{\max,i}\) to the bin. Galaxies that are visible only over a small volume (intrinsically faint or at the survey edge) receive a large weight; bright nearby galaxies receive a small weight. The sum is an unbiased estimator of \(\phi\) when the assumption of no density evolution within the survey volume holds.

Limitations. The estimator assumes the universe is homogeneous at the scales probed (no large-scale clustering). Large voids or walls inside the survey volume will bias the estimate. SWML (below) is immune to this.

In sum_stat: luminosity_function_vmax() and stellar_mass_function_vmax().

The Step-Wise Maximum Likelihood estimator (SWML)

The SWML estimator (Efstathiou, Ellis & Peterson 1988; Johnston 2011 §5.5) avoids the clustering bias of \(1/V_\mathrm{max}\) by writing down the likelihood of the observed catalogue and maximising it directly.

Likelihood. Assume the LF is a step function with value \(\phi_k\) in bin \(k\). Given galaxy \(i\) with absolute magnitude \(M_i\), its probability of being in magnitude bin \(j\) (rather than some other bin) is proportional to \(\phi_j \Delta M_j \, H_{ij}\), where the accessibility flag is

\[\begin{split}H_{ij} = \begin{cases} 1 & \text{if bin } j \text{ is observable at redshift } z_i \quad (M_{\rm centre,j} < m_{\rm lim} - \mu(z_i)) \\ 0 & \text{otherwise.} \end{cases}\end{split}\]

The normalisation-free log-likelihood is (Johnston 2011 Eq. 87):

\[\ln \mathcal{L} = \sum_j n_j \ln \phi_j - \sum_i w_i \ln \!\left[ \sum_j \phi_j H_{ij} \right],\]

where \(n_j = \sum_{i \in \mathrm{bin}_j} w_i\) is the weighted count per bin.

Iterative solution. Setting \(\partial \ln \mathcal{L} / \partial \phi_k = 0\) gives the fixed-point iteration (Johnston 2011 Eq. 88):

(2)\[\phi_k^{(t+1)} = \frac{n_k}{\displaystyle \sum_i \frac{w_i H_{ik}}{\sum_j \phi_j^{(t)} H_{ij}}}.\]

This converges in ~10–20 iterations. The result is the shape of the LF; the absolute normalisation is fixed by matching the integral of \(\phi\) to the 1/\(V_\mathrm{max}\) total density.

Advantages over \(1/V_\mathrm{max}\). Because \(H_{ij}\) depends only on the redshift of galaxy \(i\) (not its position on the sky), the SWML estimator is insensitive to angular density gradients and to large-scale structure within the survey.

In sum_stat: luminosity_function_swml().

The \(C^-\) estimator (Lynden-Bell 1971)

The \(C^-\) estimator (Lynden-Bell 1971; Choloniewski 1987) constructs the cumulative LF \(\Phi(M) = \int_{-\infty}^{M} \phi(M')\,dM'\) directly from the data using a product recursion that requires no binning.

Construction. Sort galaxies by absolute magnitude \(M_{(1)} \le M_{(2)} \le \ldots\). For each galaxy \(i\), define the “associated set” \(A_i = \{j : M_j \le M_i,\; z_j \le z_{\max,j}(M_i)\}\) — all galaxies that could have been seen at magnitude \(M_i\). The cumulative LF is

\[\hat\Phi(M) = \prod_{M_{(k)} \le M} \left(1 - \frac{1}{n_k + 1}\right),\]

where \(n_k = |A_k|\) is the size of the associated set at each step. The differential LF follows from numerical differentiation: \(\hat\phi(M) = -d\hat\Phi/dM\).

Key properties. - Non-parametric, distribution-free. - Handles arbitrary truncation (e.g. X-ray flux limits, colour cuts). - No binning required; bin widths can introduce bias.

In sum_stat: cumulative_luminosity_function_cminus().

Independence tests: detecting luminosity evolution 

In a flux-limited survey the observed magnitudes and redshifts are correlated (at a given redshift, only the brightest objects are seen). Before fitting a LF one should test whether the intrinsic luminosities are independent of redshift — i.e. whether there is luminosity evolution.

Efron-Petrosian \(\tau\) statistic (Efron & Petrosian 1992). For each galaxy \(i\), define the truncated set \(T_i = \{j : j \ne i,\; z_j \le z_{\max,j}(M_i)\}\). Let \(R_i\) be the rank of \(z_i\) within \(T_i \cup \{i\}\). Then

\[\tau = \frac{\sum_i (R_i - E_i)}{\sqrt{\sum_i V_i}},\]

where \(E_i = (|T_i| + 1)/2\) and \(V_i = (|T_i|^2 - 1)/12\) are the expected value and variance of \(R_i\) under the null hypothesis of independence. Under the null, \(\tau \sim \mathcal{N}(0,1)\). Values \(|\tau| > 2\) indicate significant luminosity–redshift correlation.

In sum_stat: efron_petrosian_tau().

The Schechter luminosity/mass function 

The empirical model most commonly used to describe galaxy LFs and SMFs is the Schechter function (Schechter 1976):

(3)\[\phi(L)\,dL = \phi^* \left(\frac{L}{L^*}\right)^\alpha \exp\!\left(-\frac{L}{L^*}\right) \frac{dL}{L^*},\]

which in magnitude units (\(x = 10^{0.4(M^* - M)}\)) becomes:

\[\phi(M)\,dM = 0.4 \ln(10)\, \phi^*\, x^{\alpha+1} e^{-x}\, dM.\]

Parameters.

\(\phi^*\) [Mpc^-3 mag^-1] — overall normalisation (number density of “characteristic” galaxies).
\(M^*\) [mag] — characteristic magnitude; the function turns over exponentially above this luminosity.
\(\alpha\) — faint-end slope. \(\alpha < -1\) means an ever-increasing number of faint galaxies; \(\alpha > -1\) means the LF converges to a finite number density at faint magnitudes.

Double Schechter SMF. The SMF of the full galaxy population is better described by a sum of two Schechter components (red + blue sequences):

(4)\[\phi(\log M)\,d\log M = \ln(10)\, e^{-x} \left[\phi_1^*\, x^{\alpha_1+1} + \phi_2^*\, x^{\alpha_2+1}\right] d\log M,\]

where \(x = 10^{\log M - \log M^*}\). The total number density integrated over all masses is \(\phi_1^* \Gamma(\alpha_1 + 2) + \phi_2^* \Gamma(\alpha_2 + 2)\).

In sum_stat: schechter_mass() and double_schechter_mass().

Eddington bias 

Stellar masses estimated from broad-band photometry have a typical random uncertainty of \(\sigma_{\Delta \log M} \approx 0.1\)–\(0.3\) dex. Because the SMF rises steeply toward low masses (slope \(\alpha < -1\)), these random errors scatter more galaxies up from the low-mass side than down from the high-mass side. The net effect — called Eddington bias (Eddington 1913) — artificially boosts the observed number density at masses above \(\log M^*\).

The smeared SMF is a convolution:

\[\phi_\mathrm{obs}(\log M) = \int \phi_\mathrm{true}(\log M') \mathcal{N}\!\left(\log M' - \log M,\, \sigma_{\Delta \log M}(z)\right) d\log M',\]

where the Gaussian kernel width \(\sigma_{\Delta \log M}(z)\) increases with redshift as photometric noise grows. Correcting for Eddington bias requires either deconvolving the observed SMF or forward-modelling the convolution when fitting a model.

In sum_stat: eddington_kernel() and convolve_smf_eddington().

Angular two-point correlation function \(w(\theta)\)

The two-point correlation function \(\xi(r)\) measures the excess probability (above Poisson) of finding two galaxies separated by \(r\) (Davis & Peebles 1983):

\[dP = \bar{n}^2 [1 + \xi(r)]\, dV_1\, dV_2.\]

On the sky, without redshift information, this projected into the angular correlation function \(w(\theta)\).

Landy-Szalay estimator (Landy & Szalay 1993):

(5)\[\hat{w}(\theta) = \frac{DD - 2\,DR + RR}{RR},\]

where \(DD\), \(DR\), \(RR\) are the normalised pair counts of data–data, data–random, and random–random pairs in an angular separation bin around \(\theta\). The random catalogue must be drawn from the same survey window as the data (same mask, same radial selection) and should be at least \(10\times\) larger to suppress shot noise.

Davis-Peebles estimator (Davis & Peebles 1983):

\[\hat{w}(\theta) = \frac{DD}{DR} - 1.\]

This is slightly biased for finite surveys but converges faster.

In sum_stat: w_theta().

Projected two-point correlation function \(w_p(r_p)\)

In redshift space, peculiar velocities distort the line-of-sight separations of galaxy pairs (redshift-space distortions, RSDs). The projected correlation function \(w_p(r_p)\) integrates the 3-D correlation function along the line of sight to remove this distortion:

(6)\[w_p(r_p) = 2 \int_0^{\pi_{\max}} \xi(r_p, \pi)\, d\pi,\]

where \(r_p\) is the separation perpendicular to the line of sight and \(\pi\) is the separation parallel. A typical integration depth is \(\pi_\max = 100\) Mpc.

Relation to the projected number density. \(w_p(r_p)\) is proportional to the galaxy–galaxy lensing signal and can be related to the halo occupation distribution (HOD) model to constrain the connection between galaxies and dark matter halos.

In sum_stat: wp().

Redshift-space multipoles \(\xi_\ell(s)\)

Rather than projecting away the RSD signal, the Legendre multipole decomposition captures it:

(7)\[\xi_\ell(s) = \frac{2\ell + 1}{2} \int_{-1}^{1} \xi(s, \mu)\, P_\ell(\mu)\, d\mu,\]

where \(s = \sqrt{r_p^2 + \pi^2}\) is the redshift-space separation, \(\mu = \pi / s\) is the cosine of the angle with the line of sight, and \(P_\ell\) is a Legendre polynomial.

The monopole \(\xi_0(s)\) gives the angle-averaged correlation function. The quadrupole \(\xi_2(s)\) encodes the anisotropy induced by peculiar velocities and is sensitive to the growth rate of structure \(f\sigma_8\). The hexadecapole \(\xi_4(s)\) is a higher-order diagnostic.

In sum_stat: twopcf_multipoles().

Jackknife covariance 

Estimating the covariance matrix of a summary statistic from a single survey realization requires either analytical modelling or resampling methods. The jackknife method is the most common resampling approach.

Algorithm. Divide the survey footprint into \(N_\mathrm{jk}\) spatial regions of roughly equal area (e.g. using HEALPix). Compute the statistic \(\mathbf{x}_k\) on the sub-survey formed by removing region \(k\). The covariance is

(8)\[C_{ij} = \frac{N_\mathrm{jk} - 1}{N_\mathrm{jk}} \sum_{k=1}^{N_\mathrm{jk}} (x_{k,i} - \bar{x}_i)(x_{k,j} - \bar{x}_j),\]

where \(\bar{x}_i = N_\mathrm{jk}^{-1} \sum_k x_{k,i}\).

Typical scale. For a galaxy survey of \(\sim 1000\) deg², use \(N_\mathrm{jk} = 100\) regions of \(\sim 10\) deg² each.

Limitations. The jackknife underestimates the covariance on scales larger than the jackknife region size. For the full survey, mocks or analytical covariance matrices are more reliable at large scales.

In sum_stat: jackknife_cov_from_subsamples() and assign_jackknife_regions().

Excess surface density \(\Delta\Sigma(R)\)

Weak gravitational lensing by galaxy clusters and individual galaxies coherently distorts the shapes of background galaxies. The excess surface density (ESD) is the projected lensing signal:

(9)\[\Delta\Sigma(R) = \bar\Sigma(<R) - \Sigma(R),\]

where \(\Sigma(R)\) is the projected mass density at projected radius \(R\) and \(\bar\Sigma(<R) = 2/R^2 \int_0^R \Sigma(R') R'\, dR'\) is the mean density within \(R\).

Connection to observables. The tangential shear of background galaxies is \(\gamma_t(R) = \Delta\Sigma(R) / \Sigma_\mathrm{crit}\), where the critical surface density is

\[\Sigma_\mathrm{crit} = \frac{c^2}{4\pi G} \frac{D_s}{D_l D_{ls}},\]

with \(D_l\), \(D_s\), \(D_{ls}\) the angular diameter distances to the lens, source, and from lens to source respectively.

In sum_stat: delta_sigma() and critical_surface_density_jax().

References 

Schmidt (1968), ApJ 151, 393 — 1/V_max estimator
Condon (1989), ApJ 338, 13 — Poisson error for 1/V_max
Lynden-Bell (1971), MNRAS 155, 95 — C⁻ estimator
Efstathiou, Ellis & Peterson (1988), MNRAS 232, 431 — SWML estimator
Efron & Petrosian (1992), ApJ 399, 345 — τ independence test
Schechter (1976), ApJ 203, 297 — Schechter function
Eddington (1913), MNRAS 73, 359 — bias from scatter in stellar mass
Landy & Szalay (1993), ApJ 412, 64 — LS angular estimator
Davis & Peebles (1983), ApJ 267, 465 — projected correlation function
Johnston (2011), arXiv:1106.2039 — unified review of LF estimators
Hogg (1999), arXiv:astro-ph/9905116 — cosmological distance measures