Skip to content

Latest commit

 

History

History
82 lines (60 loc) · 2.74 KB

File metadata and controls

82 lines (60 loc) · 2.74 KB

Rebuild Missing Dataset Files

This task rebuilds missing extracted files in GTFS datasets. It downloads datasets from their hosted_url, extracts all files, computes zipped and unzipped sizes, calculates hashes, uploads the files to a GCS bucket, and updates the database.


Task ID

Use task ID: rebuild_missing_dataset_files


Usage

The function accepts the following payload:

{
  "dry_run": true,                // [optional] If true, do not upload or modify the database (default: true)
  "after_date": "YYYY-MM-DD",     // [optional] Only include datasets downloaded after this ISO date
  "latest_only": true,            // [optional] If true, only process the latest version of each dataset (default: true)
  "dataset_id": id                // [optional] If provided, only process the specified dataset. It will supersede the after_date and latest_only parameters.
}

Example:

{
  "dry_run": false,
  "after_date": "2025-07-01",
  "latest_only": true
}

or

{
  "dry_run": false,
  "dataset_id": "mdb-1147-202407031702"
}

What It Does

For each GTFS dataset with missing file information (missing zipped/unzipped sizes or missing extracted files), this function:

  1. Downloads the .zip file from its hosted_url

  2. Computes the zipped size in bytes

  3. Extracts all GTFS files locally

  4. Computes the unzipped size in bytes

  5. Uploads each extracted file to a GCS bucket with the structure:

    {feed-stable-id}/{dataset-stable-id}/extracted/{file_name}
    
  6. Makes each file publicly accessible and stores its GCS URL

  7. Computes SHA256 hashes for each file

  8. Stores metadata in the Gtfsfile table for later use

If the dataset_id parameter is provided, the process is a bit simplified. It does not download the dataset as it is assumed the dataset is already present in the bucket. The rest of the processing is the same.


GCP Environment Variables

The function requires the following environment variables:

Variable Description
DATASETS_BUCKET_NAME The name of the GCS bucket used to store extracted GTFS files

Additional Notes

  • This function disables SSL verification when downloading files, as the sources are trusted internally.
  • Commits to the database occur in batches of 5 datasets to improve performance and avoid large transaction blocks.
  • If dry_run is enabled, no downloads, uploads, or DB modifications are performed. Only the number of affected datasets is logged.
  • The function is safe to rerun. It will only affect datasets missing required file metadata.