|
1 | 1 | # Download the Dataset |
2 | 2 |
|
3 | | -:::warning Deprecation Notice |
| 3 | +:::warning |
| 4 | + |
| 5 | +The previous Parquet export format v1 is now deprecated. See the [note](/docs/download/#legacy-format-v1) below. Please follow the [instructions](/docs/download/#export-v2-format) for the new v2 format. |
4 | 6 |
|
5 | | -The current parquet download format will be deprecated. A new `/v2` endpoint will be introduced with an updated format. Documentation for the new format will be added once it is live. Feel free to use the export in its current form, but be aware that it will be replaced. |
6 | 7 | ::: |
7 | 8 |
|
8 | | -The whole dataset is exported daily in [Parquet](https://github.com/apache/parquet-format), a modern column-based data format that is directly queryable and compressed. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)). The parquet files are hosted at Cloudflare R2. |
9 | | - |
10 | | -The latest dataset manifest is available at [export.verifieralliance.org](https://export.verifieralliance.org/) in JSON format. |
11 | | - |
12 | | -The manifest contains each table under the `files` field. Each table is partitioned in a certain number of rows, depending on the size of the table. E.g. the `verified_contracts` table is partitioned in 1 million rows since each row is small, whereas the `sources` table is partitioned in 10,000 rows. |
13 | | - |
14 | | -Following each entry in the array under the table names, one can download each partition. For example, the first partition of the `verified_contracts` table is available at [https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet](https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet). |
15 | | - |
16 | | -The manifest looks like this: |
17 | | - |
18 | | -```json |
19 | | -{ |
20 | | - "timestamp": 1742781993318, |
21 | | - "dateStr": "2025-03-24T02:06:33.318021Z", |
22 | | - "files": { |
23 | | - "code": ["code/code_0_100000.parquet", "code/code_100000_200000.parquet", "code/code_200000_300000.parquet"], |
24 | | - "contracts": ["contracts/contracts_0_1000000.parquet"], |
25 | | - "contract_deployments": ["contract_deployments/contract_deployments_0_1000000.parquet"], |
26 | | - "compiled_contracts": [ |
27 | | - "compiled_contracts/compiled_contracts_0_10000.parquet", |
28 | | - "compiled_contracts/compiled_contracts_10000_20000.parquet", |
29 | | - "compiled_contracts/compiled_contracts_20000_30000.parquet", |
30 | | - "compiled_contracts/compiled_contracts_30000_40000.parquet", |
31 | | - "compiled_contracts/compiled_contracts_40000_50000.parquet", |
32 | | - "compiled_contracts/compiled_contracts_50000_60000.parquet", |
33 | | - "compiled_contracts/compiled_contracts_60000_70000.parquet" |
34 | | - ], |
35 | | - "compiled_contracts_sources": ["compiled_contracts_sources/compiled_contracts_sources_0_1000000.parquet"], |
36 | | - "sources": [ |
37 | | - "sources/sources_0_10000.parquet", |
38 | | - "sources/sources_10000_20000.parquet", |
39 | | - "sources/sources_20000_30000.parquet", |
40 | | - "sources/sources_30000_40000.parquet", |
41 | | - "sources/sources_40000_50000.parquet", |
42 | | - "sources/sources_50000_60000.parquet", |
43 | | - "sources/sources_60000_70000.parquet", |
44 | | - "sources/sources_70000_80000.parquet", |
45 | | - "sources/sources_80000_90000.parquet", |
46 | | - "sources/sources_90000_100000.parquet", |
47 | | - "sources/sources_100000_110000.parquet" |
48 | | - ], |
49 | | - "verified_contracts": ["verified_contracts/verified_contracts_0_1000000.parquet"] |
50 | | - } |
51 | | -} |
| 9 | +The entire Verifier Alliance dataset is exported continuously as [Parquet](https://github.com/apache/parquet-format) files, a modern columnar data format. Parquet files are compressed, efficient to query, and widely supported by data tools. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)). |
| 10 | + |
| 11 | +The export is hosted on Google Cloud Storage and accessible via an S3-compatible API at [export.verifieralliance.org](https://export.verifieralliance.org/). |
| 12 | + |
| 13 | +## Export v2 Format |
| 14 | + |
| 15 | +The export format has undergone a redesign to make it more efficient and easier to use. The v2 format follows these principles: |
| 16 | + |
| 17 | +- New data is uploaded **daily**. |
| 18 | +- Each database **table** is stored as a set of Parquet files. |
| 19 | +- Files are partitioned by row ranges and **ordered** by `created_at` timestamps. |
| 20 | +- **Append-only** pattern: New data is added to new files; existing files are not modified. Only the most recent file for each table may be updated while it is not full yet. This design is possible because in the underlying database verified contracts are only inserted and rows are never updated. |
| 21 | +- **File metadata** (checksums, sizes, timestamps) is provided directly by the Google Cloud Storage API. |
| 22 | +- Files use **zstd compression** built into the Parquet format. |
| 23 | + |
| 24 | +The dataset is available at https://export.verifieralliance.org/. All files of the v2 format are stored under the `v2/` prefix. |
| 25 | + |
| 26 | +### Downloading and Syncing the Dataset |
| 27 | + |
| 28 | +To download the entire dataset, you can run this command: |
| 29 | + |
| 30 | +```bash |
| 31 | +curl -s 'https://export.verifieralliance.org/?prefix=v2/' | \ |
| 32 | + grep -oP '(?<=<Key>)[^<]+' | \ |
| 33 | + xargs -I {} curl -L -O https://export.verifieralliance.org/{} |
| 34 | +``` |
| 35 | + |
| 36 | +Alternatively, the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions) makes it easy to download and keep the dataset in sync. The following command downloads the entire dataset on the first run, and on subsequent runs only downloads new or modified files: |
| 37 | + |
| 38 | +```bash |
| 39 | +aws s3 sync s3://verifier-alliance-parquet-export/v2/ ./verifier-alliance-dataset --endpoint-url https://storage.googleapis.com --no-sign-request |
| 40 | +``` |
| 41 | + |
| 42 | +### Working with Parquet Files |
| 43 | + |
| 44 | +Once downloaded, you can query and analyze Parquet files using various tools and libraries. Here are some popular options to give you a head start: |
| 45 | + |
| 46 | +- [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html): Read data from Parquet files in Python |
| 47 | +- [DuckDB](https://duckdb.org/docs/data/parquet): SQL queries on Parquet files |
| 48 | +- [pg_parquet](https://github.com/CrunchyData/pg_parquet): PostgreSQL extension for copying Parquet data into a Postgres database |
| 49 | + |
| 50 | +### API |
| 51 | + |
| 52 | +For more fine-grained control, you can browse and download files directly using the S3-compatible Google Cloud Storage API: |
| 53 | + |
| 54 | +**List all v2 files:** |
| 55 | + |
| 56 | +``` |
| 57 | +https://export.verifieralliance.org/?prefix=v2/ |
| 58 | +``` |
| 59 | + |
| 60 | +**List files for a specific table:** |
| 61 | + |
| 62 | +``` |
| 63 | +https://export.verifieralliance.org/?prefix=v2/verified_contracts/ |
| 64 | +``` |
| 65 | + |
| 66 | +**Download a specific file:** |
| 67 | + |
| 68 | +``` |
| 69 | +https://export.verifieralliance.org/v2/verified_contracts/verified_contracts_0_1000000.parquet |
| 70 | +``` |
| 71 | + |
| 72 | +The API returns XML responses following the [Google Cloud Storage XML API specification](https://cloud.google.com/storage/docs/xml-api/get-bucket-list). |
| 73 | + |
| 74 | +#### Available Tables |
| 75 | + |
| 76 | +The Parquet export is available for all VerA database tables: `verified_contracts`, `sources`, `compiled_contracts_sources`, `compiled_contracts`, `contract_deployments`, `contracts`, and `code`. |
| 77 | + |
| 78 | +#### API Parameters |
| 79 | + |
| 80 | +The most important parameters of the listing API are the following: |
| 81 | + |
| 82 | +- **prefix**: Filter results to objects whose names begin with this prefix (e.g., `?prefix=v2/verified_contracts/`) |
| 83 | +- **marker**: Start listing after this object name (for pagination) |
| 84 | +- **max-keys**: Maximum number of objects to return in one response |
| 85 | + |
| 86 | +The response from the listing API might be truncated, which is indicated by the `IsTruncated` field of the result. The `marker` parameter can be used to paginate through results by setting it to the `NextMarker` of the previous response. |
| 87 | + |
| 88 | +Example with pagination: |
| 89 | + |
| 90 | +``` |
| 91 | +https://export.verifieralliance.org/?prefix=v2/verified_contracts/&max-keys=2&marker=v2/verified_contracts/verified_contracts_1000000_2000000.parquet |
| 92 | +``` |
| 93 | + |
| 94 | +#### Metadata |
| 95 | + |
| 96 | +The listing API provides detailed metadata for each of the Parquet files: |
| 97 | + |
| 98 | +```xml |
| 99 | +<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01"> |
| 100 | + <Name>verifier-alliance-parquet-export</Name> |
| 101 | + <Prefix>v2/</Prefix> |
| 102 | + <Marker/> |
| 103 | + <IsTruncated>false</IsTruncated> |
| 104 | + <Contents> |
| 105 | + <Key>v2/code/code_0_100000.parquet</Key> |
| 106 | + <Generation>1766065018286394</Generation> |
| 107 | + <MetaGeneration>1</MetaGeneration> |
| 108 | + <LastModified>2025-12-18T13:36:58.292Z</LastModified> |
| 109 | + <ETag>"ba687acd0afab85ed203a593479f0ce3"</ETag> |
| 110 | + <Size>101591414</Size> |
| 111 | + </Contents> |
| 112 | + <!-- More entries... --> |
| 113 | +</ListBucketResult> |
52 | 114 | ``` |
53 | 115 |
|
| 116 | +Most important fields: |
| 117 | + |
| 118 | +- **Key**: The file path (download at `https://export.verifieralliance.org/{Key}`) |
| 119 | +- **LastModified**: When the file was last uploaded/modified |
| 120 | +- **ETag**: MD5 hash of the file contents (use this to detect changes) |
| 121 | +- **Size**: File size in bytes |
| 122 | + |
| 123 | +## Legacy Format (v1) |
| 124 | + |
| 125 | +:::warning Deprecation Notice |
| 126 | + |
| 127 | +The v1 Parquet export format is **no longer updated**. All new data is only available in the v2 format. Please migrate to v2 for access to current data. |
| 128 | + |
| 129 | +::: |
| 130 | + |
| 131 | +The legacy v1 format files can still be accessed via non-prefixed paths in the bucket (e.g., `https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet`). |
| 132 | + |
| 133 | +The v1 format used a JSON manifest file at [https://export.verifieralliance.org/manifest.json](https://export.verifieralliance.org/manifest.json) listing all available Parquet files. However, this format was not append-only. Each daily export regenerated all files, requiring users to download the entire dataset again after every update. The manifest also did not include checksums or modification timestamps, making it difficult to determine what changed between exports. |
| 134 | + |
| 135 | +## Export Script |
| 136 | + |
54 | 137 | The source code of the export script is available at [https://github.com/verifier-alliance/parquet-export](https://github.com/verifier-alliance/parquet-export). |
0 commit comments