Photometric variability feature extraction pipeline for Vera Rubin / LSST research, using ZTF data (via ALeRCE) as a public proxy.
The Vera Rubin Observatory's LSST will observe ~37 billion objects over 10 years, generating 10 million nightly alerts. ZTF is the best available proxy: same g/r photometric bands, similar cadence (3-night revisit), and the same sky. Pipelines built on ZTF transfer directly to Rubin DP1 data with a one-line change in the ingest layer.
Most variability pipelines treat g and r bands independently. This pipeline introduces inter-band lag — the time delay (in days) between g and r band variability peaks, measured via cross-correlation.
Physical interpretation:
- g leads r by 1–10 days → AGN accretion disk reverberation. The disk is hotter (bluer) in the centre; a driving fluctuation propagates outward to cooler (redder) radii with a light-travel-time delay proportional to disk size.
- lag ≈ 0 → stellar flares, where g and r brighten simultaneously.
- r leads g (negative lag) → dust reverberation echoes, where the dust re-radiates at longer wavelengths with a delay.
At survey scale, this feature has been largely unexplored. The combination of inter_band_lag with color_slope (bluer-when-brighter = AGN-like) cleanly separates physical variability mechanisms.
| Feature | Physical meaning |
|---|---|
amplitude_g/r |
Total flux swing — AGN/SNe large, RRL moderate, noise small |
stetson_j |
Correlated cross-band variability — high for real astrophysical events |
stetson_k_g/r |
Shape of variability distribution — Gaussian vs. burst-like |
von_neumann_g/r |
Temporal autocorrelation — low = periodic/smooth, high = stochastic |
ls_period_g/r |
Best Lomb-Scargle period — useful for RRL, Cepheids, LPV |
ls_fap_g/r |
False alarm probability — quality flag for periodicity |
sf_slope_g/r |
Structure function slope — AGN ≈ 0.3, RRL steep, random walk = 1.0 |
color_slope |
Pearson r(color vs brightness) — negative = bluer-when-brighter (AGN) |
inter_band_lag |
g−r cross-correlation lag in days — accretion disk size proxy |
ccf_peak |
CCF peak quality (0–1) — use as confidence weight on lag |
ALeRCE API (ZTF)
│
ingest.py ← fetch light curves, ~400 objects × 4 classes
│
features.py ← extract 30 features per object including inter_band_lag
│
cluster.py ← UMAP (2D) + HDBSCAN, generate 3 diagnostic plots
│
pipeline.py ← orchestrate, save features.parquet + top_candidates.csv
| File | Description |
|---|---|
outputs/features.parquet |
Full feature catalog (one row per object) |
outputs/top_candidates.csv |
Top 5 objects per cluster with |lag| high + ccf_peak > 0.5 |
outputs/cluster_plot.png |
UMAP embedding coloured by cluster, shaped by ALeRCE class |
outputs/lag_histogram.png |
Inter-band lag distribution per cluster |
outputs/color_lag_scatter.png |
color_slope vs inter_band_lag (the key diagnostic plot) |
outputs/summary.txt |
Plain-text run summary |
pip install -r requirements.txt
python scripts/run_pipeline.pyReplace ingest.py's load_dataset() with:
from lsst.daf.butler import Butler
import pandas as pd
def load_dataset_rubin(repo_path, collection):
butler = Butler(repo_path, collections=collection)
dataset = {}
for ref in butler.registry.queryDatasets("source", ..., instrument="LSSTCam"):
src = butler.get(ref)
for obj_id in src["objectId"].unique():
obj = src[src["objectId"] == obj_id]
g = obj[obj["band"] == "g"][["midpointMjdTai", "psfMag", "psfMagErr"]]
r = obj[obj["band"] == "r"][["midpointMjdTai", "psfMag", "psfMagErr"]]
g.columns = ["mjd", "mag", "magerr"]
r.columns = ["mjd", "mag", "magerr"]
dataset[str(obj_id)] = {"g": g, "r": r, "class": "unknown"}
return datasetEverything downstream (features, clustering, plots) is unchanged.
- Förster et al. 2021 — ALeRCE: Alert broker for ZTF (arXiv:2008.03311)
- Bellm et al. 2019 — ZTF survey design (PASP 131, 018002)
- McInnes et al. 2018 — UMAP (arXiv:1802.03426)
- Campello et al. 2013 — HDBSCAN (ECML/PKDD 2013)
- Stetson 1996 — Variability indices J, K (PASP 108, 851)
- Apply to Rubin DP1 AGN candidates (available via RSP)
- Use
top_candidates.csvto prioritise reverberation mapping follow-up - Extend lag measurement to all band pairs (u, g, r, i, z, y) once Rubin data is available
- Train a supervised classifier on the feature set using ALeRCE labels as ground truth