Skip to content

Commit cfa3ceb

Browse files
authored
Update dataset download documentation for Parquet export v2 (#5)
* Update dataset download documentation for Parquet export v2 * Add warning about deprecation of Parquet export v1 at the top
1 parent ea91a93 commit cfa3ceb

1 file changed

Lines changed: 129 additions & 46 deletions

File tree

docs/5. download.md

Lines changed: 129 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,137 @@
11
# Download the Dataset
22

3-
:::warning Deprecation Notice
3+
:::warning
4+
5+
The previous Parquet export format v1 is now deprecated. See the [note](/docs/download/#legacy-format-v1) below. Please follow the [instructions](/docs/download/#export-v2-format) for the new v2 format.
46

5-
The current parquet download format will be deprecated. A new `/v2` endpoint will be introduced with an updated format. Documentation for the new format will be added once it is live. Feel free to use the export in its current form, but be aware that it will be replaced.
67
:::
78

8-
The whole dataset is exported daily in [Parquet](https://github.com/apache/parquet-format), a modern column-based data format that is directly queryable and compressed. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)). The parquet files are hosted at Cloudflare R2.
9-
10-
The latest dataset manifest is available at [export.verifieralliance.org](https://export.verifieralliance.org/) in JSON format.
11-
12-
The manifest contains each table under the `files` field. Each table is partitioned in a certain number of rows, depending on the size of the table. E.g. the `verified_contracts` table is partitioned in 1 million rows since each row is small, whereas the `sources` table is partitioned in 10,000 rows.
13-
14-
Following each entry in the array under the table names, one can download each partition. For example, the first partition of the `verified_contracts` table is available at [https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet](https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet).
15-
16-
The manifest looks like this:
17-
18-
```json
19-
{
20-
"timestamp": 1742781993318,
21-
"dateStr": "2025-03-24T02:06:33.318021Z",
22-
"files": {
23-
"code": ["code/code_0_100000.parquet", "code/code_100000_200000.parquet", "code/code_200000_300000.parquet"],
24-
"contracts": ["contracts/contracts_0_1000000.parquet"],
25-
"contract_deployments": ["contract_deployments/contract_deployments_0_1000000.parquet"],
26-
"compiled_contracts": [
27-
"compiled_contracts/compiled_contracts_0_10000.parquet",
28-
"compiled_contracts/compiled_contracts_10000_20000.parquet",
29-
"compiled_contracts/compiled_contracts_20000_30000.parquet",
30-
"compiled_contracts/compiled_contracts_30000_40000.parquet",
31-
"compiled_contracts/compiled_contracts_40000_50000.parquet",
32-
"compiled_contracts/compiled_contracts_50000_60000.parquet",
33-
"compiled_contracts/compiled_contracts_60000_70000.parquet"
34-
],
35-
"compiled_contracts_sources": ["compiled_contracts_sources/compiled_contracts_sources_0_1000000.parquet"],
36-
"sources": [
37-
"sources/sources_0_10000.parquet",
38-
"sources/sources_10000_20000.parquet",
39-
"sources/sources_20000_30000.parquet",
40-
"sources/sources_30000_40000.parquet",
41-
"sources/sources_40000_50000.parquet",
42-
"sources/sources_50000_60000.parquet",
43-
"sources/sources_60000_70000.parquet",
44-
"sources/sources_70000_80000.parquet",
45-
"sources/sources_80000_90000.parquet",
46-
"sources/sources_90000_100000.parquet",
47-
"sources/sources_100000_110000.parquet"
48-
],
49-
"verified_contracts": ["verified_contracts/verified_contracts_0_1000000.parquet"]
50-
}
51-
}
9+
The entire Verifier Alliance dataset is exported continuously as [Parquet](https://github.com/apache/parquet-format) files, a modern columnar data format. Parquet files are compressed, efficient to query, and widely supported by data tools. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)).
10+
11+
The export is hosted on Google Cloud Storage and accessible via an S3-compatible API at [export.verifieralliance.org](https://export.verifieralliance.org/).
12+
13+
## Export v2 Format
14+
15+
The export format has undergone a redesign to make it more efficient and easier to use. The v2 format follows these principles:
16+
17+
- New data is uploaded **daily**.
18+
- Each database **table** is stored as a set of Parquet files.
19+
- Files are partitioned by row ranges and **ordered** by `created_at` timestamps.
20+
- **Append-only** pattern: New data is added to new files; existing files are not modified. Only the most recent file for each table may be updated while it is not full yet. This design is possible because in the underlying database verified contracts are only inserted and rows are never updated.
21+
- **File metadata** (checksums, sizes, timestamps) is provided directly by the Google Cloud Storage API.
22+
- Files use **zstd compression** built into the Parquet format.
23+
24+
The dataset is available at https://export.verifieralliance.org/. All files of the v2 format are stored under the `v2/` prefix.
25+
26+
### Downloading and Syncing the Dataset
27+
28+
To download the entire dataset, you can run this command:
29+
30+
```bash
31+
curl -s 'https://export.verifieralliance.org/?prefix=v2/' | \
32+
grep -oP '(?<=<Key>)[^<]+' | \
33+
xargs -I {} curl -L -O https://export.verifieralliance.org/{}
34+
```
35+
36+
Alternatively, the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions) makes it easy to download and keep the dataset in sync. The following command downloads the entire dataset on the first run, and on subsequent runs only downloads new or modified files:
37+
38+
```bash
39+
aws s3 sync s3://verifier-alliance-parquet-export/v2/ ./verifier-alliance-dataset --endpoint-url https://storage.googleapis.com --no-sign-request
40+
```
41+
42+
### Working with Parquet Files
43+
44+
Once downloaded, you can query and analyze Parquet files using various tools and libraries. Here are some popular options to give you a head start:
45+
46+
- [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html): Read data from Parquet files in Python
47+
- [DuckDB](https://duckdb.org/docs/data/parquet): SQL queries on Parquet files
48+
- [pg_parquet](https://github.com/CrunchyData/pg_parquet): PostgreSQL extension for copying Parquet data into a Postgres database
49+
50+
### API
51+
52+
For more fine-grained control, you can browse and download files directly using the S3-compatible Google Cloud Storage API:
53+
54+
**List all v2 files:**
55+
56+
```
57+
https://export.verifieralliance.org/?prefix=v2/
58+
```
59+
60+
**List files for a specific table:**
61+
62+
```
63+
https://export.verifieralliance.org/?prefix=v2/verified_contracts/
64+
```
65+
66+
**Download a specific file:**
67+
68+
```
69+
https://export.verifieralliance.org/v2/verified_contracts/verified_contracts_0_1000000.parquet
70+
```
71+
72+
The API returns XML responses following the [Google Cloud Storage XML API specification](https://cloud.google.com/storage/docs/xml-api/get-bucket-list).
73+
74+
#### Available Tables
75+
76+
The Parquet export is available for all VerA database tables: `verified_contracts`, `sources`, `compiled_contracts_sources`, `compiled_contracts`, `contract_deployments`, `contracts`, and `code`.
77+
78+
#### API Parameters
79+
80+
The most important parameters of the listing API are the following:
81+
82+
- **prefix**: Filter results to objects whose names begin with this prefix (e.g., `?prefix=v2/verified_contracts/`)
83+
- **marker**: Start listing after this object name (for pagination)
84+
- **max-keys**: Maximum number of objects to return in one response
85+
86+
The response from the listing API might be truncated, which is indicated by the `IsTruncated` field of the result. The `marker` parameter can be used to paginate through results by setting it to the `NextMarker` of the previous response.
87+
88+
Example with pagination:
89+
90+
```
91+
https://export.verifieralliance.org/?prefix=v2/verified_contracts/&max-keys=2&marker=v2/verified_contracts/verified_contracts_1000000_2000000.parquet
92+
```
93+
94+
#### Metadata
95+
96+
The listing API provides detailed metadata for each of the Parquet files:
97+
98+
```xml
99+
<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01">
100+
<Name>verifier-alliance-parquet-export</Name>
101+
<Prefix>v2/</Prefix>
102+
<Marker/>
103+
<IsTruncated>false</IsTruncated>
104+
<Contents>
105+
<Key>v2/code/code_0_100000.parquet</Key>
106+
<Generation>1766065018286394</Generation>
107+
<MetaGeneration>1</MetaGeneration>
108+
<LastModified>2025-12-18T13:36:58.292Z</LastModified>
109+
<ETag>"ba687acd0afab85ed203a593479f0ce3"</ETag>
110+
<Size>101591414</Size>
111+
</Contents>
112+
<!-- More entries... -->
113+
</ListBucketResult>
52114
```
53115

116+
Most important fields:
117+
118+
- **Key**: The file path (download at `https://export.verifieralliance.org/{Key}`)
119+
- **LastModified**: When the file was last uploaded/modified
120+
- **ETag**: MD5 hash of the file contents (use this to detect changes)
121+
- **Size**: File size in bytes
122+
123+
## Legacy Format (v1)
124+
125+
:::warning Deprecation Notice
126+
127+
The v1 Parquet export format is **no longer updated**. All new data is only available in the v2 format. Please migrate to v2 for access to current data.
128+
129+
:::
130+
131+
The legacy v1 format files can still be accessed via non-prefixed paths in the bucket (e.g., `https://export.verifieralliance.org/verified_contracts/verified_contracts_0_1000000.parquet`).
132+
133+
The v1 format used a JSON manifest file at [https://export.verifieralliance.org/manifest.json](https://export.verifieralliance.org/manifest.json) listing all available Parquet files. However, this format was not append-only. Each daily export regenerated all files, requiring users to download the entire dataset again after every update. The manifest also did not include checksums or modification timestamps, making it difficult to determine what changed between exports.
134+
135+
## Export Script
136+
54137
The source code of the export script is available at [https://github.com/verifier-alliance/parquet-export](https://github.com/verifier-alliance/parquet-export).

0 commit comments

Comments
 (0)