Introduction for students
This page is written for a reader with a solid background in physics who is new to observational cosmology. It explains what this package does and why each measurement matters, before you dive into the technical details elsewhere in the documentation. Experts may want to skip ahead to the API reference.
What is a galaxy survey?
A galaxy survey is a systematic census of the universe — a telescope photographs the sky and/or disperses the light of each target into a spectrum, and software detects hundreds of millions of galaxies and stars. It is important to separate what is directly observed from what is physically inferred.
Directly measured quantities
Sky position — right ascension (RA) and declination (Dec), the astronomical equivalent of longitude and latitude, measured from the centroid of the detected light distribution.
Apparent flux in photometric bands — the number of photons detected per second through each broad-band filter (e.g. u, g, r, i, z, Y in the optical/near-IR). This is the fundamental output of an imaging survey; see astronomical photometry. The ratio of fluxes in two bands is called a colour and encodes information about the galaxy’s temperature and stellar population.
Spectra — the wavelength-resolved flux, obtained by dispersing the light through a spectrograph. Spectral lines — the fingerprints of specific atomic or molecular transitions — appear at well-known rest-frame wavelengths and are Doppler-shifted by the motion of the source and cosmologically redshifted due to the expansion of the universe.
Galaxy shape (ellipticity) — from high-resolution images, the projected shape of each galaxy can be measured. Shape catalogues are the raw input for weak gravitational lensing analyses.
Quantities inferred from the observations
Redshift \(z\) — the fractional shift of spectral lines toward longer (redder) wavelengths, caused by the expansion of the universe (Wikipedia). In spectroscopic surveys \(z\) is measured precisely by identifying known lines in the dispersed spectrum (precision \(\sigma_z \sim 0.0001\)). In photometric surveys it is estimated from the shape of the broad-band spectral energy distribution (SED) — a technique called photometric redshift — with typical precision \(\sigma_z/(1+z) \sim 0.02\). As a rough distance scale: \(z = 0.1\) corresponds to roughly 1.3 billion light-years; \(z = 1\) to roughly 8 billion light-years.
Absolute luminosity and distance — once \(z\) is known, a cosmological model converts it to a luminosity distance, which combined with the apparent flux yields the intrinsic luminosity (and hence absolute magnitude). This step introduces a dependence on the assumed cosmology.
Stellar mass \(M_\star\) — the total mass locked up in stars, inferred by fitting stellar population synthesis (SPS) models to the observed SED (Wikipedia). The result depends on the assumed initial mass function, star-formation history, and dust attenuation law — all of which introduce systematic uncertainties at the factor-of-two level.
Star formation rate (SFR) — derived from nebular emission-line fluxes (e.g. H\(\alpha\), [O II]) or from UV and infrared luminosities; both trace the rate at which interstellar gas is converted into new stars (Wikipedia).
Modern surveys observe tens to hundreds of millions of galaxies, producing a 3-D map of the observable universe that can be statistically compared to theoretical predictions.
Major galaxy surveys
Surveys are broadly divided into photometric surveys — which prioritise wide sky coverage and galaxy-shape measurements — and spectroscopic surveys — which measure precise redshifts for millions of targets.
Ongoing and upcoming reference surveys
Euclid — ESA space mission launched in 2023. Its wide-field imager (VIS) and near-infrared spectrograph/photometer (NISP) will map the distribution and shapes of more than a billion galaxies out to \(z \sim 2\). Primary probes: weak lensing and galaxy clustering.
Rubin Observatory / LSST — the Legacy Survey of Space and Time; a 10-year ground-based photometric survey from Cerro Pachón, Chile. Six-band (ugrizy) imaging to unprecedented depth over half the sky, with science goals spanning dark energy, dark matter, transients, and the Solar System.
DESI — the Dark Energy Spectroscopic Instrument, a massively multiplexed fibre spectrograph at Kitt Peak National Observatory. It will deliver spectroscopic redshifts for ~40 million galaxies and quasars over \(0 < z < 3.5\).
Previous weak-lensing and photometric surveys
HSC (Hyper Suprime-Cam Subaru Strategic Program) — deep wide-field imaging on the 8.2-m Subaru Telescope, noted for its exceptional image quality and depth.
DES (Dark Energy Survey) — six-year imaging survey covering ~5000 deg² of the southern sky in five bands; released shape catalogues of ~100 million galaxies.
KiDS (Kilo-Degree Survey) — European weak-lensing survey with the OmegaCAM camera at the VLT Survey Telescope.
COSMOS — a deep pencil-beam survey covering a 2 deg² field with data from X-ray to radio, widely used as a photometric-redshift calibration field and for lensing studies.
Previous spectroscopic surveys
SDSS (Sloan Digital Sky Survey) — the landmark survey that established wide-area spectroscopy. Multiple generations (SDSS-I through SDSS-V) have provided redshifts for over 3 million objects and photometry across one-third of the sky.
2dFGRS (Two-degree Field Galaxy Redshift Survey) — obtained ~220 000 galaxy spectra in the early 2000s and established key results on large-scale structure and the galaxy luminosity function.
GAMA (Galaxy and Mass Assembly) — deep spectroscopic survey designed to study galaxy evolution and the galaxy–halo connection out to \(z \sim 0.5\) at high spectroscopic completeness.
Why do we compress data into summary statistics?
A catalogue of one billion galaxies cannot be compared directly to a theoretical model — it would take impossibly long to compute the probability of the entire catalogue under any cosmological model.
Instead, we compress the catalogue into a small number of summary statistics: compact numbers (or curves) that
retain most of the cosmological information we care about, and
can be quickly predicted from a model.
This is analogous to characterising the height distribution of a population not by listing every person, but by reporting the mean and standard deviation. The reduction in data volume is enormous (from \(\sim 10^9\) numbers to \(\sim 10^2\)), yet the compressed form still allows us to constrain the cosmological parameters that govern the large-scale structure of the universe.
sum_stat computes three families of summary statistics from galaxy
catalogues and weak-lensing data.
The three families of statistics
One-point statistics: how many galaxies of each mass exist?
The stellar mass function \(\phi(M_\star)\) answers the question: how many galaxies per cubic megaparsec have a stellar mass between \(M_\star\) and \(M_\star + dM_\star\)?
Think of it as a histogram of galaxy masses, normalised by the volume of universe surveyed. Its shape — a power law at low masses that drops exponentially above a characteristic mass \(M^*\) (the Schechter function) — encodes how efficiently the universe has converted dark matter and gas into stars.
Similarly, the luminosity function \(\phi(M)\) counts galaxies by their absolute brightness rather than their mass.
Note
Megaparsec (Mpc): 1 Mpc = 3.09 × 10²² m ≈ 3.26 million light-years. A typical distance between galaxies is a few Mpc.
Two-point statistics: do galaxies cluster?
Galaxies are not scattered randomly — they cluster along cosmic filaments and around massive dark matter halos, leaving voids almost devoid of galaxies. Two-point correlation functions quantify this clustering by measuring the excess probability of finding a galaxy pair at a given separation, compared with a uniform random distribution.
Specifically:
Angular correlation function \(w(\theta)\) — pairs separated by an angle \(\theta\) on the sky, regardless of distance. No redshift information is needed.
Projected correlation function \(w_p(r_p)\) — pairs at a physical (comoving) transverse separation \(r_p\). The line-of-sight component is integrated out to remove the distortions that galaxy peculiar velocities introduce to measured distances.
Multipole decomposition \(\xi_\ell(s)\) — retains the directional anisotropy of the clustering signal to constrain the growth rate of large-scale structure.
The amplitude and shape of these functions depend on how galaxies trace the underlying dark matter distribution — this connection is what we ultimately want to model.
Galaxy-galaxy lensing: weighing dark matter halos
Weak gravitational lensing is one of the most direct ways to measure mass in the universe, including the invisible dark matter.
When light from a distant source galaxy passes near a massive lens galaxy or cluster, the gravitational field slightly deflects the light path. This deflection causes the source to appear stretched into an arc. For a single lens–source pair the effect is far too small to see, but by averaging over millions of pairs the coherent shear signal becomes measurable.
The quantity we compute is the excess surface density:
where \(\Sigma(r_p)\) is the projected mass density at projected radius \(r_p\) from the lens, and \(\bar{\Sigma}\) is the mean density within that radius. \(\Delta\Sigma\) is directly related to the observable mean tangential ellipticity of source galaxies, so it is the primary quantity from weak lensing.
Together, \(\phi(M_\star)\), \(w_p(r_p)\), and \(\Delta\Sigma(r_p)\) form a joint probe of the galaxy–halo connection: the stellar mass function constrains how many galaxies live in each halo, the clustering constrains their spatial distribution, and the lensing directly weighs the halos.
How sum_stat fits in the pipeline
The diagram below shows where sum_stat sits in a typical analysis:
Raw survey files (FITS, HDF5)
│
▼
GalaxyCatalogue / ShapeCatalogue ← catalogue.py
│
▼
┌─────────────────────────────────────┐
│ sum_stat estimators │
│ lf_smf ── twopcf ── lensing ──... │
└─────────────────────────────────────┘
│
▼
HDF5 output file (SummaryStatWriter) ← io/
│
▼
Parametric model fitting ← gga_model (companion package)
The core workflow is:
Load your galaxy catalogue and create a
GalaxyCatalogueobject (orShapeCataloguefor lensing sources).Call the estimator functions — each returns the statistic, its uncertainty, and optionally the full covariance matrix.
Write results to a self-describing HDF5 file with
SummaryStatWriter.Use the
SummaryStatReaderto read results back for plotting or to pass them to a fitting code.
See Quick start for a complete working example.
Glossary
- Comoving distance
The distance between two points in space measured in a coordinate system that expands with the universe. A pair of objects at rest relative to the cosmic expansion has constant comoving distance, even as the physical distance between them grows.
- Dark matter halo
A spherical concentration of dark matter in which galaxies form and reside. Halos range in mass from roughly \(10^9\) to \(10^{15}\,M_\odot\).
- Excess surface density (ESD)
The projected mass excess \(\Delta\Sigma(r_p)\) measured via weak gravitational lensing. Units: \(M_\odot\,\mathrm{pc}^{-2}\).
- Jackknife covariance
An error-estimation technique: the survey is divided into \(N\) spatial regions; the statistic is recomputed \(N\) times, each time omitting one region. The scatter of these \(N\) realisations gives the covariance matrix without requiring analytical models.
- Luminosity function
The number density of galaxies per unit luminosity (or absolute magnitude) per unit comoving volume, \(\phi(M)\).
- Megaparsec (Mpc)
A unit of distance equal to 3.09 × 10²² m, or about 3.26 million light-years. Typical galaxy separations and survey depths are measured in Mpc.
- Redshift
The fractional shift of the wavelength of light caused by the expansion of the universe. A galaxy with redshift \(z\) emitted its light when the universe was a factor \(1/(1+z)\) of its present size.
- Schechter function
An empirical parametric model for the luminosity function and stellar mass function: a power law at low mass/luminosity that is exponentially suppressed above a characteristic value \(M^*\).
- Stellar mass function
The number density of galaxies per unit stellar mass per unit comoving volume, \(\phi(M_\star)\).
- Stellar mass \(M_\star\)
The total mass locked up in stars in a galaxy (not including dark matter or gas). Usually expressed as \(\log_{10}(M_\star / M_\odot)\) where \(M_\odot = 2 \times 10^{30}\,\mathrm{kg}\) is the solar mass.
- Weak gravitational lensing
The coherent distortion of background galaxy shapes caused by the gravitational deflection of light by foreground mass concentrations. “Weak” means the individual distortions are undetectable; only the statistical average over many source galaxies reveals the signal.