.. _theory:

Estimators from discrete galaxy surveys
=======================================

This page explains how each summary statistic is estimated from a discrete
galaxy survey — moving from the continuous definition to the practical
weighted sum over a finite catalogue.
It is written for a reader familiar with calculus and basic probability theory
but new to galaxy surveys.

.. contents:: Contents
   :depth: 2
   :local:

----

The goal of a galaxy survey
----------------------------

A galaxy survey measures the properties (position, brightness, stellar mass, …)
of a large number of galaxies.  The raw catalogue is a list of objects; to learn
about cosmology we need to compress it into *summary statistics* — compact
quantities that retain the cosmological information while being tractable to
model and compare to theory.

This package computes three families of summary statistics:

* **One-point functions** (luminosity function, stellar mass function) — how many
  galaxies of each luminosity/mass exist per unit volume.
* **Two-point functions** (angular clustering, projected clustering, multipoles) —
  how galaxies are spatially correlated.
* **Lensing statistics** (excess surface density ΔΣ) — how galaxies distort
  background shapes through weak gravitational lensing.

----

.. _theory_selection_function:

The selection function and survey completeness
-----------------------------------------------

Every survey has a **flux limit**: objects fainter than some threshold
:math:`m_\mathrm{lim}` (in magnitudes) are not detected.  This means the
observed catalogue is not a fair sample of the universe — intrinsically faint
objects are under-represented at large distances.

**Apparent and absolute magnitudes.**
The *apparent* magnitude :math:`m` is what we observe; the *absolute* magnitude
:math:`M` is the luminosity of the object expressed as a magnitude at a standard
distance of 10 pc.  They are related by the distance modulus:

.. math::

   m = M + \mu(z) + K(z)

where :math:`\mu(z) = 5 \log_{10}[D_L(z) / 10\,\mathrm{pc}]` is the distance
modulus and :math:`K(z)` is the *K-correction* that converts between the rest-frame
band and the observed band.

**Maximum observable redshift.**
For a given apparent-magnitude limit :math:`m_\mathrm{lim}`, the maximum redshift
at which object :math:`i` with absolute magnitude :math:`M_i` would still be
included in the survey is :math:`z_{\max,i}` defined by

.. math::

   M_i + \mu(z_{\max,i}) + K(z_{\max,i}) = m_\mathrm{lim}.

An object at :math:`z > z_{\max,i}` would have :math:`m > m_\mathrm{lim}` and
would fall below the detection threshold.

**Completeness function.**
In practice the survey is also incomplete at bright magnitudes (saturation) and
varies across the sky due to dust extinction and imaging depth.  The
*incompleteness function* :math:`C(m)` gives the probability of detecting an
object of apparent magnitude :math:`m`.  All estimators below assume
:math:`C(m) = 1` for :math:`m < m_\mathrm{lim}` (a clean flux-limited sample),
but per-galaxy completeness weights :math:`w_i = 1/C(m_i)` can be passed to
correct for partial incompleteness.

----

.. _theory_vmax:

The 1/:math:`V_\mathrm{max}` estimator (Schmidt 1968)
-------------------------------------------------------

The 1/:math:`V_\mathrm{max}` estimator is the simplest non-parametric method for
measuring the luminosity function (LF) or stellar mass function (SMF).

**Maximum volume.**
Define the *maximum comoving volume* accessible to galaxy :math:`i` as

.. math::

   V_{\max,i} = \frac{\Omega}{4\pi}
   \left[ V_c\!\left(\min(z_{\max,i},\, z_{\max}^{\rm survey})\right)
          - V_c(z_{\min}^{\rm survey}) \right],

where :math:`\Omega` is the survey solid angle [sr] and
:math:`V_c(z) = \tfrac{4\pi}{3} \chi(z)^3` is the comoving volume out to
redshift :math:`z` (in a flat universe).

**Estimator.**
The LF (or SMF) in a bin of width :math:`\Delta M` centred on :math:`M_j` is
(Schmidt 1968; Johnston 2011 Eq. 51–53):

.. math::
   :label: vmax

   \hat\phi_j = \frac{1}{\Delta M} \sum_{i\,\in\,\mathrm{bin}_j}
   \frac{w_i}{V_{\max,i}},

where the sum runs over all galaxies whose observable :math:`M_i` falls in bin
:math:`j`, and :math:`w_i` is a completeness weight.  The associated
**Poisson uncertainty** is (Condon 1989):

.. math::

   \hat\sigma_j = \frac{1}{\Delta M}
   \sqrt{\sum_{i\,\in\,\mathrm{bin}_j} \left(\frac{w_i}{V_{\max,i}}\right)^2}.

**Why it works.**
Each galaxy contributes its "weight per unit volume" :math:`w_i/V_{\max,i}` to
the bin.  Galaxies that are visible only over a small volume (intrinsically faint
or at the survey edge) receive a large weight; bright nearby galaxies receive a
small weight.  The sum is an unbiased estimator of :math:`\phi` when the
assumption of *no density evolution* within the survey volume holds.

**Limitations.**
The estimator assumes the universe is homogeneous at the scales probed (no
large-scale clustering).  Large voids or walls inside the survey volume will bias
the estimate.  SWML (below) is immune to this.

In ``sum_stat``: :func:`~sum_stat.lf_smf.vmax.luminosity_function_vmax` and
:func:`~sum_stat.lf_smf.vmax.stellar_mass_function_vmax`.

----

.. _theory_swml:

The Step-Wise Maximum Likelihood estimator (SWML)
--------------------------------------------------

The SWML estimator (Efstathiou, Ellis & Peterson 1988; Johnston 2011 §5.5) avoids
the clustering bias of :math:`1/V_\mathrm{max}` by writing down the likelihood of
the observed catalogue and maximising it directly.

**Likelihood.**
Assume the LF is a step function with value :math:`\phi_k` in bin :math:`k`.
Given galaxy :math:`i` with absolute magnitude :math:`M_i`, its probability of
being in magnitude bin :math:`j` (rather than some other bin) is proportional to
:math:`\phi_j \Delta M_j \, H_{ij}`, where the *accessibility flag* is

.. math::

   H_{ij} = \begin{cases} 1 & \text{if bin } j \text{ is observable at redshift } z_i
             \quad (M_{\rm centre,j} < m_{\rm lim} - \mu(z_i)) \\
   0 & \text{otherwise.} \end{cases}

The normalisation-free log-likelihood is (Johnston 2011 Eq. 87):

.. math::

   \ln \mathcal{L} = \sum_j n_j \ln \phi_j
                   - \sum_i w_i \ln \!\left[ \sum_j \phi_j H_{ij} \right],

where :math:`n_j = \sum_{i \in \mathrm{bin}_j} w_i` is the weighted count per bin.

**Iterative solution.**
Setting :math:`\partial \ln \mathcal{L} / \partial \phi_k = 0` gives the
fixed-point iteration (Johnston 2011 Eq. 88):

.. math::
   :label: swml_iteration

   \phi_k^{(t+1)} = \frac{n_k}{\displaystyle
   \sum_i \frac{w_i H_{ik}}{\sum_j \phi_j^{(t)} H_{ij}}}.

This converges in ~10–20 iterations.  The result is the *shape* of the LF; the
absolute normalisation is fixed by matching the integral of :math:`\phi` to the
1/:math:`V_\mathrm{max}` total density.

**Advantages over** :math:`1/V_\mathrm{max}`.
Because :math:`H_{ij}` depends only on the *redshift* of galaxy :math:`i` (not its
position on the sky), the SWML estimator is insensitive to angular density
gradients and to large-scale structure within the survey.

In ``sum_stat``: :func:`~sum_stat.lf_smf.swml.luminosity_function_swml`.

----

.. _theory_cminus:

The :math:`C^-` estimator (Lynden-Bell 1971)
---------------------------------------------

The :math:`C^-` estimator (Lynden-Bell 1971; Choloniewski 1987) constructs the
*cumulative* LF :math:`\Phi(M) = \int_{-\infty}^{M} \phi(M')\,dM'` directly from
the data using a product recursion that requires no binning.

**Construction.**
Sort galaxies by absolute magnitude :math:`M_{(1)} \le M_{(2)} \le \ldots`.
For each galaxy :math:`i`, define the "associated set"
:math:`A_i = \{j : M_j \le M_i,\; z_j \le z_{\max,j}(M_i)\}` — all galaxies that
could have been seen at magnitude :math:`M_i`.  The cumulative LF is

.. math::

   \hat\Phi(M) = \prod_{M_{(k)} \le M}
   \left(1 - \frac{1}{n_k + 1}\right),

where :math:`n_k = |A_k|` is the size of the associated set at each step.
The differential LF follows from numerical differentiation:
:math:`\hat\phi(M) = -d\hat\Phi/dM`.

**Key properties.**
- Non-parametric, distribution-free.
- Handles arbitrary truncation (e.g. X-ray flux limits, colour cuts).
- No binning required; bin widths can introduce bias.

In ``sum_stat``:
:func:`~sum_stat.lf_smf.cminus.cumulative_luminosity_function_cminus`.

----

.. _theory_independence:

Independence tests: detecting luminosity evolution
---------------------------------------------------

In a flux-limited survey the observed magnitudes and redshifts are correlated
(at a given redshift, only the brightest objects are seen).  Before fitting a
LF one should test whether the intrinsic luminosities are *independent* of
redshift — i.e. whether there is luminosity evolution.

**Efron-Petrosian** :math:`\tau` **statistic** (Efron & Petrosian 1992).
For each galaxy :math:`i`, define the truncated set
:math:`T_i = \{j : j \ne i,\; z_j \le z_{\max,j}(M_i)\}`.
Let :math:`R_i` be the rank of :math:`z_i` within :math:`T_i \cup \{i\}`.
Then

.. math::

   \tau = \frac{\sum_i (R_i - E_i)}{\sqrt{\sum_i V_i}},

where :math:`E_i = (|T_i| + 1)/2` and :math:`V_i = (|T_i|^2 - 1)/12` are the
expected value and variance of :math:`R_i` under the null hypothesis of
independence.  Under the null, :math:`\tau \sim \mathcal{N}(0,1)`.
Values :math:`|\tau| > 2` indicate significant luminosity–redshift correlation.

In ``sum_stat``:
:func:`~sum_stat.lf_smf.independence.efron_petrosian_tau`.

----

.. _theory_schechter:

The Schechter luminosity/mass function
---------------------------------------

The empirical model most commonly used to describe galaxy LFs and SMFs is the
**Schechter function** (Schechter 1976):

.. math::
   :label: schechter

   \phi(L)\,dL = \phi^*
   \left(\frac{L}{L^*}\right)^\alpha
   \exp\!\left(-\frac{L}{L^*}\right)
   \frac{dL}{L^*},

which in magnitude units (:math:`x = 10^{0.4(M^* - M)}`) becomes:

.. math::

   \phi(M)\,dM = 0.4 \ln(10)\, \phi^*\, x^{\alpha+1} e^{-x}\, dM.

**Parameters.**

* :math:`\phi^*` [Mpc\ :sup:`-3` mag\ :sup:`-1`] — overall normalisation
  (number density of "characteristic" galaxies).
* :math:`M^*` [mag] — characteristic magnitude; the function turns over
  exponentially above this luminosity.
* :math:`\alpha` — faint-end slope.  :math:`\alpha < -1` means an
  ever-increasing number of faint galaxies; :math:`\alpha > -1` means the LF
  converges to a finite number density at faint magnitudes.

**Double Schechter SMF.**
The SMF of the full galaxy population is better described by a sum of two
Schechter components (red + blue sequences):

.. math::
   :label: double_schechter

   \phi(\log M)\,d\log M
   = \ln(10)\, e^{-x}
   \left[\phi_1^*\, x^{\alpha_1+1} + \phi_2^*\, x^{\alpha_2+1}\right] d\log M,

where :math:`x = 10^{\log M - \log M^*}`.  The total number density
integrated over all masses is :math:`\phi_1^* \Gamma(\alpha_1 + 2) + \phi_2^* \Gamma(\alpha_2 + 2)`.

In ``sum_stat``:
:func:`~sum_stat.lf_smf.zalesky25.schechter_mass` and
:func:`~sum_stat.lf_smf.zalesky25.double_schechter_mass`.

----

.. _theory_eddington:

Eddington bias
--------------

Stellar masses estimated from broad-band photometry have a typical random
uncertainty of :math:`\sigma_{\Delta \log M} \approx 0.1`–:math:`0.3` dex.  Because the
SMF rises steeply toward low masses (slope :math:`\alpha < -1`), these random
errors scatter more galaxies *up* from the low-mass side than *down* from the
high-mass side.  The net effect — called **Eddington bias** (Eddington 1913) —
artificially boosts the observed number density at masses above :math:`\log M^*`.

The smeared SMF is a convolution:

.. math::

   \phi_\mathrm{obs}(\log M) = \int \phi_\mathrm{true}(\log M')
   \mathcal{N}\!\left(\log M' - \log M,\, \sigma_{\Delta \log M}(z)\right)
   d\log M',

where the Gaussian kernel width :math:`\sigma_{\Delta \log M}(z)` increases with
redshift as photometric noise grows.  Correcting for Eddington bias requires
either deconvolving the observed SMF or forward-modelling the convolution when
fitting a model.

In ``sum_stat``:
:func:`~sum_stat.lf_smf.zalesky25.eddington_kernel` and
:func:`~sum_stat.lf_smf.zalesky25.convolve_smf_eddington`.

----

.. _theory_angular_twopcf:

Angular two-point correlation function :math:`w(\theta)`
---------------------------------------------------------

The **two-point correlation function** :math:`\xi(r)` measures the excess
probability (above Poisson) of finding two galaxies separated by :math:`r`
(Davis & Peebles 1983):

.. math::

   dP = \bar{n}^2 [1 + \xi(r)]\, dV_1\, dV_2.

On the sky, without redshift information, this projected into the *angular*
correlation function :math:`w(\theta)`.

**Landy-Szalay estimator** (Landy & Szalay 1993):

.. math::
   :label: landy_szalay

   \hat{w}(\theta) = \frac{DD - 2\,DR + RR}{RR},

where :math:`DD`, :math:`DR`, :math:`RR` are the normalised pair counts of
data–data, data–random, and random–random pairs in an angular separation bin
around :math:`\theta`.  The random catalogue must be drawn from the same
survey window as the data (same mask, same radial selection) and should be at
least :math:`10\times` larger to suppress shot noise.

**Davis-Peebles estimator** (Davis & Peebles 1983):

.. math::

   \hat{w}(\theta) = \frac{DD}{DR} - 1.

This is slightly biased for finite surveys but converges faster.

In ``sum_stat``:
:func:`~sum_stat.twopcf.angular.w_theta`.

----

.. _theory_projected_wp:

Projected two-point correlation function :math:`w_p(r_p)`
----------------------------------------------------------

In redshift space, peculiar velocities distort the line-of-sight separations of
galaxy pairs (redshift-space distortions, RSDs).  The **projected correlation
function** :math:`w_p(r_p)` integrates the 3-D correlation function along the
line of sight to remove this distortion:

.. math::
   :label: wp

   w_p(r_p) = 2 \int_0^{\pi_{\max}} \xi(r_p, \pi)\, d\pi,

where :math:`r_p` is the separation perpendicular to the line of sight and
:math:`\pi` is the separation parallel.  A typical integration depth is
:math:`\pi_\max = 100` Mpc.

**Relation to the projected number density.**
:math:`w_p(r_p)` is proportional to the galaxy–galaxy lensing signal and can be
related to the halo occupation distribution (HOD) model to constrain the
connection between galaxies and dark matter halos.

In ``sum_stat``:
:func:`~sum_stat.twopcf.projected.wp`.

----

.. _theory_multipoles:

Redshift-space multipoles :math:`\xi_\ell(s)`
-----------------------------------------------

Rather than projecting away the RSD signal, the **Legendre multipole**
decomposition captures it:

.. math::
   :label: multipoles

   \xi_\ell(s) = \frac{2\ell + 1}{2}
   \int_{-1}^{1} \xi(s, \mu)\, P_\ell(\mu)\, d\mu,

where :math:`s = \sqrt{r_p^2 + \pi^2}` is the redshift-space separation,
:math:`\mu = \pi / s` is the cosine of the angle with the line of sight, and
:math:`P_\ell` is a Legendre polynomial.

The **monopole** :math:`\xi_0(s)` gives the angle-averaged correlation function.
The **quadrupole** :math:`\xi_2(s)` encodes the anisotropy induced by peculiar
velocities and is sensitive to the growth rate of structure :math:`f\sigma_8`.
The **hexadecapole** :math:`\xi_4(s)` is a higher-order diagnostic.

In ``sum_stat``:
:func:`~sum_stat.twopcf.multipoles.twopcf_multipoles`.

----

.. _theory_jackknife:

Jackknife covariance
---------------------

Estimating the covariance matrix of a summary statistic from a single survey
realization requires either analytical modelling or resampling methods.  The
**jackknife** method is the most common resampling approach.

**Algorithm.**
Divide the survey footprint into :math:`N_\mathrm{jk}` spatial regions of roughly
equal area (e.g. using HEALPix).  Compute the statistic :math:`\mathbf{x}_k`
on the sub-survey formed by removing region :math:`k`.  The covariance is

.. math::
   :label: jackknife_cov

   C_{ij} = \frac{N_\mathrm{jk} - 1}{N_\mathrm{jk}}
   \sum_{k=1}^{N_\mathrm{jk}}
   (x_{k,i} - \bar{x}_i)(x_{k,j} - \bar{x}_j),

where :math:`\bar{x}_i = N_\mathrm{jk}^{-1} \sum_k x_{k,i}`.

**Typical scale.**
For a galaxy survey of :math:`\sim 1000` deg², use :math:`N_\mathrm{jk} = 100`
regions of :math:`\sim 10` deg² each.

**Limitations.**
The jackknife underestimates the covariance on scales larger than the jackknife
region size.  For the full survey, mocks or analytical covariance matrices are
more reliable at large scales.

In ``sum_stat``:
:func:`~sum_stat.covariance.jackknife.jackknife_cov_from_subsamples` and
:func:`~sum_stat.covariance.jackknife.assign_jackknife_regions`.

----

.. _theory_esd:

Excess surface density :math:`\Delta\Sigma(R)`
-----------------------------------------------

**Weak gravitational lensing** by galaxy clusters and individual galaxies
coherently distorts the shapes of background galaxies.  The **excess surface
density** (ESD) is the projected lensing signal:

.. math::
   :label: esd

   \Delta\Sigma(R) = \bar\Sigma(<R) - \Sigma(R),

where :math:`\Sigma(R)` is the projected mass density at projected radius
:math:`R` and :math:`\bar\Sigma(<R) = 2/R^2 \int_0^R \Sigma(R') R'\, dR'` is
the mean density within :math:`R`.

**Connection to observables.**
The tangential shear of background galaxies is
:math:`\gamma_t(R) = \Delta\Sigma(R) / \Sigma_\mathrm{crit}`, where the
**critical surface density** is

.. math::

   \Sigma_\mathrm{crit} = \frac{c^2}{4\pi G}
   \frac{D_s}{D_l D_{ls}},

with :math:`D_l`, :math:`D_s`, :math:`D_{ls}` the angular diameter distances
to the lens, source, and from lens to source respectively.

In ``sum_stat``:
:func:`~sum_stat.lensing.esd.delta_sigma` and
:func:`~sum_stat.cosmology.critical_surface_density_jax`.

----

References
----------

* Schmidt (1968), ApJ 151, 393 — 1/V\ :sub:`max` estimator
* Condon (1989), ApJ 338, 13 — Poisson error for 1/V\ :sub:`max`
* Lynden-Bell (1971), MNRAS 155, 95 — C\ :sup:`−` estimator
* Efstathiou, Ellis & Peterson (1988), MNRAS 232, 431 — SWML estimator
* Efron & Petrosian (1992), ApJ 399, 345 — τ independence test
* Schechter (1976), ApJ 203, 297 — Schechter function
* Eddington (1913), MNRAS 73, 359 — bias from scatter in stellar mass
* Landy & Szalay (1993), ApJ 412, 64 — LS angular estimator
* Davis & Peebles (1983), ApJ 267, 465 — projected correlation function
* Johnston (2011), arXiv:1106.2039 — unified review of LF estimators
* Hogg (1999), arXiv:astro-ph/9905116 — cosmological distance measures