Skip to content

Commit 6ea651a

Browse files
authored
chore: add athena api blog (#753)
* feat: ✨ add athena blog * fix: πŸ› update athena blog post * fix: πŸ› cost * fix: πŸ› update flow * fix: πŸ› update graph schema * fix: πŸ› update graph schema * fix: πŸ› grammar * fix: πŸ› sequence * fix: πŸ› tense * feat: ✨ add links * fix: πŸ› typos * fix: πŸ› update the image * fix: πŸ› update text * fix: πŸ› update text * fix: πŸ› word * fix: πŸ› syntax * fix: πŸ› text * fix: πŸ› minor syntax
1 parent cf889c7 commit 6ea651a

2 files changed

Lines changed: 261 additions & 0 deletions

File tree

β€Žblog/swh-s3-athena.mdβ€Ž

Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
---
2+
title: Retrieving Data from the Software Heritage S3 Graph Dataset Using Amazon Athena
3+
authors:
4+
- 'AydanGasimova'
5+
date: '2026-03-25'
6+
category: 'Guide'
7+
heroImage: 'https://images.unsplash.com/photo-1483736762161-1d107f3c78e1?q=80&w=1374&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'
8+
imageAuthor: 'Marco Palumbo'
9+
imageAuthorLink: 'https://unsplash.com/@sapporo2025'
10+
subtitle: 'Retrieving Data from the Software Heritage S3 Graph Dataset Using Amazon Athena.'
11+
tags:
12+
- FAIR data
13+
- public data archive
14+
- Software Heritage
15+
- AWS
16+
- AWS S3 bucket
17+
- Athena API
18+
- Parquet
19+
---
20+
21+
[Software Heritage](https://docs.softwareheritage.org) (SWH) is one of the most ambitious efforts to archive the world's source code. The idea is simple: collect everything, keep it long-term, and make it accessible, not just for today, but also for future generations of researchers and developers. Beyond that, it also serves as a powerful resource for large-scale code analysis.
22+
23+
In this guide, we explore how to use the SWH [Graph Dataset](https://docs.softwareheritage.org/devel/swh-export/graph/) on [Amazon Athena](https://docs.aws.amazon.com/athena/latest/APIReference/Welcome.html), using the retrieval of README files from GitHub repositories as an example. We cover the full traversal path, the filters we apply, how results are written to a user-owned [AWS S3 bucket](https://aws.amazon.com/s3/), and the potential cost of running this workflow.
24+
25+
26+
## Mission of Software Heritage
27+
28+
For over a decade, SWH has been archiving publicly available source code from across the internet. Today, the archive holds billions of source files spanning millions of repositories. These are stored as a fully deduplicated [Merkle DAG](https://docs.ipfs.tech/concepts/merkle-dag/) in [Apache ORC](https://orc.apache.org/) format, accessible through public S3 buckets on AWS. Instead of simple repository snapshots, it models software as a graph where every object is hash-addressed and deduplicated, making it highly reproducible.
29+
30+
## Prerequisites
31+
Before we get started, you'll need to make sure you have access to the following:
32+
- An AWS account with an IAM user or role
33+
- Correct permissions attached to your IAM user or role
34+
35+
### Set up AWS S3 bucket
36+
37+
1. Create (or confirm) an S3 bucket for Athena outputs
38+
39+
```bash
40+
aws s3 mb s3://<your-bucket-name> --region us-east-1
41+
```
42+
2. Verify that your IAM role or user has the required S3 permissions for Athena result storage:
43+
44+
```bash
45+
athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults, athena:StopQueryExecution
46+
glue:GetTable, glue:GetDatabase, glue:GetPartitions
47+
s3:PutObject, s3:GetObject, s3:ListBucket
48+
```
49+
50+
### Step 1. Accessing the Software Heritage Graph
51+
52+
We start by creating an external Athena table pointing to the Software Heritage latest snapshot (2025-10-08, at the time of writing) to query the origin data.
53+
54+
```sql
55+
CREATE EXTERNAL TABLE IF NOT EXISTS swh_graph_2025_10_08.origin (
56+
id STRING,
57+
url STRING
58+
)
59+
STORED AS ORC
60+
LOCATION 's3://softwareheritage/graph/2025-10-08/origin/';
61+
```
62+
63+
Once the Athena tables are set up, the goal is to retrieve repository URLs, visit dates, and SHA-1 identifiers. To do this, it helps to understand the SWH graph schema, shown below:
64+
65+
<figure>
66+
<img src="/images/blog/athena.png" alt="Software heritage relational schema" width="70%" />
67+
<figcaption>
68+
SWH
69+
<a href="https://docs.softwareheritage.org/devel/swh-export/graph/schema.html" target="_blank" rel="noopener noreferrer">
70+
relational schema
71+
</a>
72+
</figcaption>
73+
</figure>
74+
75+
An initial attempt might look like joining six tables as shown below:
76+
```sql
77+
-- Initial attempt: single-pass join (exceeded resource limits)
78+
SELECT o.url, ovs.date AS visit_date, c.sha1 AS content_sha1
79+
FROM swh_graph_2025_10_08.origin o
80+
JOIN swh_graph_2025_10_08.origin_visit_status ovs ON o.url = ovs.origin
81+
JOIN swh_graph_2025_10_08.snapshot_branch sb ON ovs.snapshot = sb.snapshot_id
82+
JOIN swh_graph_2025_10_08.revision r ON sb.target = r.id
83+
JOIN swh_graph_2025_10_08.directory_entry de ON r.directory = de.directory_id
84+
JOIN swh_graph_2025_10_08.content c ON de.target = c.sha1_git
85+
WHERE sb.target_type = 'revision' AND de.type = 'file';
86+
```
87+
88+
This query runs into Athena's resource limits quickly. The SWH tables are large and unpartitioned, and without traditional indexing, joining multiple large tables at once significantly increases scan and shuffle costs. More details on the dataset structure can be found in the [SWH article](https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf).
89+
90+
To address this, we break the process into incremental steps, storing results in tables along the way.
91+
92+
### Step 2. Extracting Repository URLs and Visit Data
93+
94+
We start with the origin table, pulling around 400 million repository URLs, then stage intermediate results to progressively narrow the working set. From `origin_visit_status`, we extract around three billion visit records, each representing a crawl attempt and its associated snapshot.
95+
96+
```sql
97+
CREATE TABLE default.url_and_date AS
98+
SELECT
99+
o.url,
100+
ovs.date AS visit_date
101+
FROM swh_graph_2025_10_08.origin o
102+
JOIN swh_graph_2025_10_08.origin_visit_status ovs
103+
ON o.url = ovs.origin;
104+
ON o.url = ovs.origin;
105+
```
106+
Using these dates, we then retrieve the snapshot identifiers:
107+
```sql
108+
CREATE TABLE default.url_date_snapshot_2a AS
109+
SELECT
110+
u.url,
111+
u.visit_date,
112+
ovs.snapshot as snapshot_id
113+
FROM default.url_and_date u
114+
JOIN swh_graph_2025_10_08.origin_visit_status ovs
115+
ON u.url = ovs.origin
116+
AND u.visit_date = ovs.date
117+
WHERE ovs.snapshot IS NOT NULL;
118+
```
119+
120+
### Step 3. Linking Snapshots to Revisions and Directories
121+
122+
After obtaining the snapshot IDs, a direct export of the `snapshot_branch` table hit Athena's resource limits, so we filter for main and master branches only. Note that repos using a different default branch name may be under-represented.
123+
124+
```sql
125+
CREATE TABLE default.snapshot_branch_filtered AS
126+
SELECT
127+
snapshot_id,
128+
target AS revision_id
129+
FROM swh_graph_2025_10_08.snapshot_branch
130+
WHERE target_type = 'revision'
131+
AND (
132+
name = CAST('refs/heads/main' AS VARBINARY)
133+
OR name = CAST('refs/heads/master' AS VARBINARY)
134+
);
135+
```
136+
After filtering, we obtain snapshot branches, and revision tables.
137+
138+
```sql
139+
CREATE TABLE default.url_date_branch_2b AS
140+
SELECT
141+
u.url,
142+
u.visit_date,
143+
sf.revision_id
144+
FROM default.url_date_snapshot_2a u
145+
JOIN default.snapshot_branch_filtered sf
146+
ON u.snapshot_id = sf.snapshot_id;
147+
148+
149+
CREATE TABLE default.url_date_rev_2c AS
150+
SELECT
151+
b.url,
152+
b.visit_date,
153+
r.directory as directory_id
154+
FROM default.url_date_branch_2b b
155+
JOIN swh_graph_2025_10_08.revision r
156+
ON b.revision_id = r.id;
157+
```
158+
159+
### Step 4: Extracting README Entries
160+
161+
This is the most expensive step, as the `directory_entry` table is one of the largest in the dataset at around 24 TB. To keep it manageable, we filter for just four README filetypes by matching their hexadecimal filename encodings.
162+
163+
```sql
164+
CREATE TABLE default.directory_entry_readme AS
165+
SELECT
166+
directory_id,
167+
target AS content_sha1_git
168+
FROM swh_graph_2025_10_08.directory_entry
169+
WHERE type = 'file'
170+
AND name IN (
171+
X'524541444D452E6D64',
172+
X'726561646D652E6D64',
173+
X'524541444D45',
174+
X'524541444D452E747874'
175+
);
176+
```
177+
178+
### Step 5. Resolving Git SHA-1 to Canonical SHA-1
179+
180+
Once we have the directory-level `sha1_git` values, we split the remaining work into three steps. First, we pull the distinct `content_sha1_git` values from the intermediate table. Then we join this smaller set against the content table to get the matching `sha1_git` and `sha1` pairs. Finally, we join everything back with the original URL and date. Breaking it up this way keeps join sizes manageable and avoids resource exhaustion errors.
181+
182+
```sql
183+
CREATE TABLE default.url_date_directory_sha_3b AS
184+
SELECT
185+
u.url,
186+
u.visit_date,
187+
d.content_sha1_git
188+
FROM default.url_date_rev_2c u
189+
JOIN default.directory_entry_readme d
190+
ON u.directory_id = d.directory_id;
191+
192+
193+
CREATE TABLE default.filtered_directory_sha1 AS
194+
SELECT DISTINCT content_sha1_git
195+
FROM default.url_date_directory_sha_3b;
196+
CREATE TABLE default.content_matched AS
197+
SELECT c.sha1_git, c.sha1
198+
FROM swh_graph_2025_10_08.content c
199+
JOIN default.filtered_directory_sha1 f
200+
ON c.sha1_git = f.content_sha1_git;
201+
202+
203+
CREATE TABLE default.url_content_final AS
204+
SELECT d.url, d.visit_date, d.content_sha1_git, cm.sha1
205+
FROM default.url_date_directory_sha_3b d
206+
JOIN default.content_matched cm
207+
ON d.content_sha1_git = cm.sha1_git;
208+
209+
210+
```
211+
By following the steps above, you can retrieve GitHub repository records and store their URLs, visit dates, and SHA-1 identifiers in an intermediate table. In our run, this resulted in over 450 million records.
212+
213+
### Step 6. Deduplicating Repositories
214+
215+
We then deduplicate the URLs, keeping one record per repository using the most recent visit date, with `MAX_BY` ensuring the content hash matches that latest snapshot.
216+
217+
```sql
218+
CREATE TABLE default.filtered_github_total_table AS
219+
SELECT url, content_sha1_git, sha1, visit_date
220+
FROM url_content_final
221+
WHERE url LIKE 'https://github.com/%';
222+
223+
CREATE TABLE default.filtered_github_unique AS
224+
SELECT
225+
url,
226+
MAX_BY(content_sha1_git, visit_date) AS content_sha1_git,
227+
MAX(visit_date) AS visit_date
228+
FROM default.filtered_github_total_table
229+
GROUP BY url;
230+
```
231+
232+
As a result, the dataset is reduced to approximately 225 million rows. After excluding records with empty `sha1_git` values, the final dataset contained approximately 223 million rows.
233+
234+
## Computational Cost Breakdown Across Processing Steps
235+
236+
Working with a dataset of this size comes with real costs. Athena charges $5 per TB scanned, and while the SWH dataset itself is free to query, any intermediate tables are stored in a user-managed S3 bucket (~$0.023 per GB/month at the time of writing). The approximate cost breakdown for each step, based on our run, is shown below:
237+
238+
| Step | Stage Description | Data Scanned | Cost (USD) |
239+
|------|-------------------|--------------|------------|
240+
| 1 | Accessing SWH via Athena | Minimal | Minimal |
241+
| 2 | Extracting URLs and Visit Data | ~660 GB | ~$3.30 |
242+
| 3 | Linking Snapshots to Revisions | ~1.84 TB | ~$9.20 |
243+
| 4 | Extracting README Entries | ~26 TB | ~$130.00 |
244+
| 5 | Resolving Git SHA-1 to Canonical SHA-1 | ~1.4 TB | ~$7.00 |
245+
| 6 | GitHub Filtering and Deduplication | Minimal | Minimal |
246+
247+
Although materializing intermediate tables improves performance, operations on the largest SWH tables remain costly. In particular, Step 4 is the most expensive, driven by the size of the `directory_entry` table.
248+
249+
## Results
250+
251+
After working through the query sequence step by step, you can obtain a consolidated table of unique GitHub URLs, along with their visit dates, SHA-1 git revision identifiers, and the SHA-1 hashes of their README contents. In our case, this process produced approximately 223 million rows. An example is shown below.
252+
253+
| REPOSITORY URL | SHA1 codes of README files | Date |
254+
|----------------------------------------------------|-------------|------|
255+
| https://github.com/fairdataihub/fairshare | 8964359b0597187a29028955ecc3845dfcf86173 | 2025-07-29 12:26:33 |
256+
| https://github.com/megasanjay/fairdataihub.org | 458585c7c8b579e4547d445cb49d496b1be1ba19 | 2023-08-19 21:50:20 |
257+
| https://github.com/fairdataihub/SODA-for-SPARC-Docs | 47acbdb4c775d1bb5bbd127fde9287211eee504c | 2025-10-06 11:55:02 |
258+
259+
## Conclusion
260+
261+
In this guide, we walk through a practical approach to extracting README content hashes from the Software Heritage Graph Dataset using Amazon Athena. Breaking large joins into incremental steps and materializing intermediate tables keeps the workflow manageable at scale. This process produces a dataset of GitHub URLs paired with SHA-1 hashes that can be used for downstream tasks such as DOI mining and software citation analysis, and the same approach can be adapted for other large-scale archival queries.
203 KB
Loading

0 commit comments

Comments
Β (0)