Author: Shweta Shah — Senior Product Manager, Investment & Financial Data & Regulatory Solutions Date: 2026-03-20 Framework: Business-First AI — Step 2: Deconstruct (Output) Feeds into: Investment Option Benchmark Tracker (Use Case 4) · YFYS Risk Simulator (Use Case 9)
| Field | Value |
|---|---|
| Workflow Name | APRA Performance Data Loader |
| Description | Downloads, parses, and stores APRA CPPP data for both MySuper and TDP (Trustee Directed) products — current year and all available historical files — into a structured JSON dataset ready for member-facing lookup in the Benchmark Tracker webapp |
| Process Outcome | A clean, queryable performance-data.json file containing per-product current year metrics (10yr NIR, fees, RAG status, pass/fail) and historical pass/fail records for both MySuper and TDP product types |
| Trigger | APRA publishes new annual CPPP data (June each year, available through December) — or developer initiates a manual data refresh |
| Type | Deterministic |
| Lens | Individual |
| Business Objective | Establish the complete public data foundation for the Investment Option Benchmark Tracker and YFYS Risk Simulator — covering both MySuper and TDP product cohorts under the YFYS Performance Test |
| Current Owner | Shweta Shah (developer / data owner during Maven course build, Apr–May 2026) |
| Regulatory Tie-in | YFYS Performance Test — APRA's annual assessment of MySuper and TDP products against SAA benchmarks; failure triggers mandatory member notification and eventual product closure to new members. In 2025: 563 MySuper products tested; 7 platform TDPs failed; 25 products in the danger zone |
Action: For the current year, construct both APRA page URLs, scrape each page to locate the CPPP Excel download links, and fetch both files.
Sub-steps:
- Determine current year (accept as parameter or derive from system date)
- MySuper file:
- Construct URL:
https://www.apra.gov.au/YYYY-annual-superannuation-performance-test-mysuper-products - Scrape page to find
.xlsxlink matchingYYYY CPP-MySuper - Download to local working directory
- Construct URL:
- TDP/Choice file:
- Construct URL:
https://www.apra.gov.au/YYYY-annual-superannuation-performance-test-trustee-directed-products - Scrape page to find
.xlsxlink matchingYYYY CPPP - Choice - Download to local working directory
- Construct URL:
Decision points:
- If either page returns 404 → APRA has not yet published data; abort that file with message:
"CPPP [MySuper/TDP] data not yet available for YYYY. Published June–December annually." - If multiple
.xlsxlinks found on a page → select the one matching the exact naming pattern - Files are independent — if one fails, continue with the other and flag the failure
| Data in | Current year (integer) |
| Data out | YYYY_CPP_MySuper.xlsx and YYYY_CPPP_Choice.xlsx saved locally |
Failure modes:
- APRA changes URL structure for either product type → download fails; validation check needed post-fetch
- File not yet published (before June) → clear error message returned
- Page exists but file link renamed → no match found, requires manual intervention
Action: Navigate to the APRA historical results page, identify all available files for both MySuper and TDP product types, and download any years not already stored locally.
Historical URL: https://www.apra.gov.au/previous-performance-test-results
Sub-steps:
- Navigate to the historical results page
- Scrape page to find all
.xlsxlinks and classify by naming pattern:
| Year range | Product type | File naming pattern |
|---|---|---|
| 2023+ | MySuper | YYYY-annual-superannuation-performance-test-mysuper-products |
| 2023+ | TDP | YYYY-annual-superannuation-performance-test-trustee-directed-products |
| 2021–2022 | MySuper only | YYYY Annual Superannuation Performance Test |
- Compare against already-downloaded files; skip years already stored
- Download all missing files to local working directory
Decision points:
- First-time load → download all available files across both product types and all years
- Annual refresh → skip already-downloaded files; only Step 1A needed for the new year
- 2021 and 2022 files contain MySuper data only — do NOT expect or create TDP history for those years
- If a historical file is unavailable → log warning, continue with remaining files
| Data in | List of already-downloaded historical files (for incremental refresh logic) |
| Data out | Historical .xlsx files per year per product type |
Failure modes:
- APRA restructures the historical results page → file discovery fails; URL needs manual update
- 2021–2022 combined files use inconsistent internal sheet structure → sheet navigation logic needs year-aware branching
- Older historical TDP files not available (TDP testing only started 2023) → TDP history will only span 2023+
Action: Open each downloaded .xlsx file, navigate to the relevant sheet(s), and extract the required columns. Uses openpyxl (not pandas) to enable cell colour and conditional formatting inspection.
Note on RAG extraction: APRA CPPP files are likely to use Excel conditional formatting (rules-based) rather than direct cell fill for RAG colouring. openpyxl can read the conditional formatting rules but cannot evaluate which colour applies to a given cell. The extraction logic must attempt direct fill first and fall through a defined decision tree if fills return empty. See Decision points below.
Sub-steps:
- For each file, open with
openpyxl(read-only mode) - For current year files only: Open
Colour Legendsheet first; read the label-to-colour mapping (RGB hex → Green / Amber / Red) — used to interpret direct cell fills if present. Note: this sheet maps colours to labels but does not contain the value thresholds used in conditional formatting rules. - Determine file type (MySuper or TDP/Choice) from filename
- Navigate to the correct sheet(s) and extract:
MySuper files — sheet: MySuper Products
| Column | All years | Current year only |
|---|---|---|
RSE licensee MySuper product name |
✅ | ✅ |
Pass/Fail Indicator |
✅ | ✅ |
10 year Net Investment Return (NIR) p.a. |
— | ✅ |
Administration fees and costs charged ($50,000 account balance) |
— | ✅ |
Administration fees and costs charged ($100,000 account balance) |
— | ✅ |
| RAG cell fill colour (NIR + both fee columns) | — | ✅ |
TDP/Choice files — sheets: Non-Platform TDPs and Platform TDPs (process both)
| Column | All years | Current year only |
|---|---|---|
Product Name |
✅ | ✅ |
Investment Menu Name |
✅ | ✅ |
Investment Option Name |
✅ | ✅ |
Pass/Fail Indicator |
✅ | ✅ |
10 year Net Investment Return (NIR) p.a. |
— | ✅ |
Administration fees and costs charged ($50,000 account balance) |
— | ✅ |
Administration fees and costs charged ($100,000 account balance) |
— | ✅ |
| RAG cell fill colour (NIR + both fee columns) | — | ✅ |
- Tag each record with: source year, product type (
MySuper/Platform TDP/Non-Platform TDP)
Decision points:
RAG extraction — apply this decision tree for each of the three RAG columns (NIR, fees $50k, fees $100k):
- Attempt direct cell fill read via
openpyxl(cell.fill.fgColor.rgb) - If all three RAG columns return
Noneor"00000000"(transparent/no fill) → APRA is using conditional formatting; proceed to step 3 - Parse the worksheet's conditional formatting rules via
ws.conditional_formatting; extract value thresholds and colour mappings from rule objects (e.g.,ColorScaleRule,CellIsRule) - Apply extracted thresholds to each cell's numeric value to derive the RAG label (
"Green"/"Amber"/"Red") - If CF rules cannot be parsed or use an unsupported rule type → set RAG to
"Unknown"for affected columns; record"rag_method": "fallback_unknown"in output metadata; do not abort
Other decision points:
- If expected sheet not found in a file → abort that sheet, log mismatch, continue with remaining files/sheets
- If
Colour Legendsheet not found in current year file → log warning; skip Colour Legend step; rely on CF rule parsing (step 3 above) for RAG derivation - 2021–2022 combined files: extract MySuper data from
MySuper Productssheet only — no TDP sheets expected
| Data in | .xlsx files (current year + all historical years, both product types) |
| Data out | Raw extracted records per file per sheet, each tagged with source year and product type; colour legend map from current year files |
Failure modes:
- APRA renames a column in a future file → extraction fails silently (KeyError or wrong column); add explicit column-presence validation post-extraction
- CF rule type unrecognised by openpyxl (e.g., data bar, icon set) → RAG set to
"Unknown"for affected columns; downstream webapp must handlenull/"Unknown"RAG values gracefully Colour Legendsheet structure changes → colour mapping step skipped; RAG falls through to CF rule parsing; only a problem if CF rules also fail
Action: Normalise all extracted records — across both product types — for reliable joining across years and accurate downstream output.
Sub-steps:
- MySuper records:
- Trim whitespace and standardise casing on
RSE licensee MySuper product name(join key)
- Trim whitespace and standardise casing on
- TDP records:
- Trim whitespace and standardise casing on all three identifier columns:
Product Name,Investment Menu Name,Investment Option Name - Composite join key =
Product Name+|+Investment Menu Name+|+Investment Option Name
- Trim whitespace and standardise casing on all three identifier columns:
- Normalise
Pass/Fail Indicatorto consistent values:"Pass"/"Fail"(across all product types) - Apply colour legend map (read in Step 2) to convert RGB cell fill values →
"Green"/"Amber"/"Red"/"Unknown"for each of the three metric columns independently (NIR, fees $50k, fees $100k) - Parse numeric fields (
NIR, fees) — strip percentage signs, currency symbols, formatting; convert tofloat; set tonullif value is non-numeric ("N/A","--","*") - Apply fuzzy match logic across years for each product type to resolve minor identifier variations:
- Distance ≤ 1 → auto-apply match silently; log to run log; no human action required
- Distance = 2 → apply best-guess match; set
"match_confidence": "low"on the record; append full detail todata/flagged-matches.jsonsidecar for post-pipeline human review - Distance > 2 or no match → treat as distinct product/option; do not merge; if history-only record, mark
current_metrics: null
Decision points:
- Fuzzy match distance ≤ 1 → auto-apply; log only
- Fuzzy match distance = 2 → apply best-guess; flag in
flagged-matches.json; pipeline continues without pausing - Fuzzy match distance > 2 → no merge; orphan record handled per product-type rules below
- MySuper product appears in current year only → include with empty
history: {} - TDP investment option appears in current year only → include with empty
history: {} - Product/option discontinued (in history only) → retain historical record, mark
current_metrics: null - TDP options with no history before 2023 → expected;
historyobject will only contain 2023+ entries - Human reviewer consults
flagged-matches.jsonafter pipeline completes; pipeline does NOT pause; corrections applied viamatch-overrides.jsonon the next run (see Context Item 7 for schema)
| Data in | Raw extracted records from all files |
| Data out | Cleaned, standardised records for both MySuper and TDP, ready for joining; data/flagged-matches.json sidecar (written if any distance-2 matches found) |
Context needs:
- Colour legend map and/or CF rule thresholds (from Step 2)
- Fuzzy match threshold: distance ≤ 1 (auto-apply), distance = 2 (flag), distance > 2 (no merge)
Failure modes:
- Fund renames a product or investment option between years → distance > 2; historical records orphaned; reviewer must add a manual override entry in
match-overrides.json - Numeric fields contain non-numeric characters → set to
null(handled in sub-step 5); downstream webapp must handlenullnumeric fields
Action: Join current year records with historical pass/fail records per product/option to produce unified entries — separately for MySuper (flat) and TDP (hierarchical).
Sub-steps:
- MySuper longitudinal:
- Group cleaned MySuper records by
RSE licensee MySuper product name - Set current year metrics as top-level fields
- Build
historyobject:{ "YYYY": "Pass" | "Fail" }for each available year (2021+) - Sort
historykeys by year descending
- Group cleaned MySuper records by
- TDP longitudinal:
- Group cleaned TDP records by composite key (
Product Name+Investment Menu Name+Investment Option Name) - Set current year metrics as top-level fields
- Tag with
product_type:"Platform TDP"or"Non-Platform TDP"(from source sheet) - Build
historyobject:{ "YYYY": "Pass" | "Fail" }for each available year (2023+) - Sort
historykeys by year descending
- Group cleaned TDP records by composite key (
Decision points:
- Products/options with no current year record but historical entries → include with
current_metrics: nullflag - Duplicate identifiers within a single year's file → deduplicate; log warning if values conflict
| Data in | Cleaned records from all years, both product types |
| Data out | Two unified datasets — mysuper_products list (flat) and tdp_products list (hierarchical identifier) |
Failure modes:
- Duplicate identifiers in one year's file → conflicting records; deduplication logic required
- All records fail to join across years → history empty for all products; data integrity check needed
Action: Serialise both unified datasets into a single performance-data.json file with metadata header and safety backup of the previous version.
Sub-steps:
- Archive existing
performance-data.jsonasperformance-data-YYYY-backup.json(if present) - Construct output object with metadata header and both product type arrays (see schema below)
- Serialise to JSON (UTF-8, indented for readability)
- Write to
data/performance-data.jsonin the webapp root directory - Log: total MySuper products, total TDP options, source years per type, timestamp
Decision points:
- Output directory does not exist → create it before writing
- JSON serialisation error → abort write, preserve backup, surface error
| Data in | mysuper_products list + tdp_products list |
| Data out | data/performance-data.json |
Failure modes:
- File write fails mid-way → partial/corrupt JSON; backup enables recovery
- Serialisation error at final step → pipeline fails after all processing; backup must be retained
Steps 1A and 1B run in parallel — independent downloads. Within Step 2, each file can be processed independently (parallelisable).
1A (MySuper + TDP current year) ──┐
├──► 2 (extract all files) ──► 3 (clean) ──► 4 (join longitudinal) ──► 5 (write JSON)
1B (MySuper + TDP historical) ──┘
1A + 1B (parallel) → 2 → 3 → 4 → 5
| Step | Depends On |
|---|---|
| 1A | Trigger confirmed (APRA data published for current year) |
| 1B | Trigger confirmed |
| 2 | 1A and 1B complete |
| 3 | 2 complete |
| 4 | 3 complete |
| 5 | 4 complete |
| Product Type | URL Pattern |
|---|---|
| MySuper | https://www.apra.gov.au/YYYY-annual-superannuation-performance-test-mysuper-products |
| TDP / Choice | https://www.apra.gov.au/YYYY-annual-superannuation-performance-test-trustee-directed-products |
Used by: Step 1A Status: ✅ Exists — confirmed by Shweta Shah
Value: https://www.apra.gov.au/previous-performance-test-results
Used by: Step 1B
Status: ✅ Exists — confirmed by Shweta Shah
| Year range | Product type | File naming pattern |
|---|---|---|
| Current year | MySuper | YYYY CPP-MySuper XLSX |
| Current year | TDP / Choice | YYYY CPPP - Choice XLSX |
| 2023+ historical | MySuper | YYYY-annual-superannuation-performance-test-mysuper-products |
| 2023+ historical | TDP | YYYY-annual-superannuation-performance-test-trustee-directed-products |
| 2021–2022 historical | MySuper only | YYYY Annual Superannuation Performance Test |
Used by: Steps 1A, 1B Status: ✅ Exists — confirmed by Shweta Shah Note: TDP performance testing began in 2023. No TDP historical files exist for 2021–2022.
MySuper files — sheet: MySuper Products
| Column | Purpose |
|---|---|
RSE licensee MySuper product name |
Product identifier (join key) |
Pass/Fail Indicator |
YFYS test result — all years |
10 year Net Investment Return (NIR) p.a. |
Return metric — current year only |
Administration fees and costs charged ($50,000 account balance) |
Fee metric — current year only |
Administration fees and costs charged ($100,000 account balance) |
Fee metric — current year only |
TDP/Choice files — sheets: Non-Platform TDPs and Platform TDPs
| Column | Purpose |
|---|---|
Product Name |
Top-level product identifier |
Investment Menu Name |
Mid-level identifier |
Investment Option Name |
Lowest-level identifier (join key component) |
Pass/Fail Indicator |
YFYS test result — all years |
10 year Net Investment Return (NIR) p.a. |
Return metric — current year only |
Administration fees and costs charged ($50,000 account balance) |
Fee metric — current year only |
Administration fees and costs charged ($100,000 account balance) |
Fee metric — current year only |
Used by: Step 2 Status: ✅ Exists — confirmed by Shweta Shah
Description: The APRA CPPP files contain a dedicated Colour Legend sheet in the same workbook (both MySuper and TDP/Choice files). The pipeline reads this sheet programmatically in Step 2 to build the RGB → label map — no manual hardcoding required.
Columns carrying RAG cell fill (same across all product types):
10 year Net Investment Return (NIR) p.a.Administration fees and costs charged ($50,000 account balance)Administration fees and costs charged ($100,000 account balance)
How to read: Open Colour Legend sheet with openpyxl; iterate rows; extract cell fill RGB + label text; build dict { "RRGGBB": "Green", ... }
Used by: Steps 2, 3
Status: ✅ Exists — confirmed by Shweta Shah
Risk: If APRA changes Colour Legend structure in a future year, mapping extraction logic needs updating
Used by: Step 5 Status: ✅ Defined
{
"last_updated": "2025-06",
"source_years_mysuper": ["2025", "2024", "2023", "2022", "2021"],
"source_years_tdp": ["2025", "2024", "2023"],
"total_mysuper_products": 563,
"total_tdp_options": 1250,
"mysuper_products": [
{
"product_name": "AustralianSuper — Balanced",
"pass_fail_current": "Pass",
"nir_10yr": 7.2,
"nir_rag": "Green",
"fees_50k": 450,
"fees_50k_rag": "Green",
"fees_100k": 550,
"fees_100k_rag": "Amber",
"history": {
"2024": "Pass",
"2023": "Pass",
"2022": "Fail",
"2021": "Pass"
}
}
],
"tdp_products": [
{
"product_name": "BT Panorama",
"investment_menu_name": "BT Panorama Super",
"investment_option_name": "Australian Shares",
"product_type": "Platform TDP",
"pass_fail_current": "Fail",
"nir_10yr": 5.1,
"nir_rag": "Red",
"fees_50k": 820,
"fees_50k_rag": "Red",
"fees_100k": 1100,
"fees_100k_rag": "Red",
"history": {
"2024": "Pass",
"2023": "Pass"
}
}
]
}Note:
nir_rag,fees_50k_rag, andfees_100k_ragare independent across all product types. TDPhistoryonly spans 2023+ (TDP testing began 2023). MySuperhistoryspans 2021+. The webapp filters TDP records byProduct Name,Investment Menu Name, andInvestment Option Nameindependently.
Description: Decision rules and output artifacts for handling minor identifier variations across years Used by: Step 3 Status: ✅ Defined
Match thresholds:
| Distance | Action |
|---|---|
| ≤ 1 | Auto-apply match; log to run log only |
| = 2 | Apply best-guess match; set "match_confidence": "low"; append to flagged-matches.json |
| > 2 | No merge; treat as distinct product; orphaned history-only records marked current_metrics: null |
TDP composite key matching: All three identifier columns (Product Name, Investment Menu Name, Investment Option Name) are matched independently. A match is only applied if all three components are within threshold.
data/flagged-matches.json sidecar schema:
[
{
"product_type": "MySuper",
"year_a": "2024",
"name_a": "AustralianSuper Balanced",
"year_b": "2023",
"name_b": "AustralianSuper - Balanced",
"levenshtein_distance": 2,
"action_taken": "best_guess_match_applied",
"match_confidence": "low"
}
]match-overrides.json (future capability — run 2+): Allows reviewer to confirm or reject flagged matches. On the next pipeline run, confirmed overrides are applied before fuzzy matching runs. Schema to be defined on first run after reviewing flagged-matches.json output.
Pipeline behaviour: Pipeline never pauses for human review. It always completes with best-available data. Reviewer consults flagged-matches.json post-run.