Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ jobs:
- name: Build
run: bun run build

- name: Type check (web)
run: bun run --cwd apps/web typecheck
- name: Type check (site)
run: bun run --cwd apps/site typecheck

- name: Build (web)
- name: Build (site)
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
run: bun run --cwd apps/web build
run: bun run --cwd apps/site build
15 changes: 7 additions & 8 deletions .github/workflows/deploy-site.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ on:
branches: [main]
paths:
- ".github/workflows/deploy-site.yml"
- "apps/web/**"
- "!apps/web/worker/**"
- "apps/site/**"

jobs:
deploy:
Expand All @@ -28,7 +27,7 @@ jobs:
- name: Build site
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
run: bun run --cwd apps/web build
run: bun run --cwd apps/site build

- name: Upload site to R2
env:
Expand All @@ -45,15 +44,15 @@ jobs:
# with stale-while-revalidate for 1 hour. A Cloudflare Cache Rule on
# the zone enables caching for these specific paths (extensionless
# HTML isn't cached by Cloudflare default); see infrastructure notes
# in apps/web/README.md. Static assets (favicons, og-image, logo)
# in apps/site/README.md. Static assets (favicons, og-image, logo)
# keep Cloudflare's default static-asset cache behavior - no header.
CACHE_HTML="public, max-age=300, stale-while-revalidate=3600"

# Non-HTML assets: copy the finite Astro output directly. Do not use
# `aws s3 sync` at the bucket root because the same bucket also stores
# the million-plus scraper-managed /documents/ and /extracted/ objects.
find apps/web/dist -type f ! -name "*.html" -print0 | while IFS= read -r -d '' f; do
rel="${f#apps/web/dist/}"
find apps/site/dist -type f ! -name "*.html" -print0 | while IFS= read -r -d '' f; do
rel="${f#apps/site/dist/}"
case "$rel" in
sitemap.xml|robots.txt|llms.txt)
aws s3 cp "$f" "$R2_BUCKET/$rel" \
Expand All @@ -70,8 +69,8 @@ jobs:
# HTML files: upload to extensionless keys to match canonical URLs.
# /classification.html -> r2://classification, /types/legal.html -> r2://types/legal.
# index.html is the one exception; it stays as a static homepage at /.
find apps/web/dist -name "*.html" -print0 | while IFS= read -r -d '' f; do
rel="${f#apps/web/dist/}"
find apps/site/dist -name "*.html" -print0 | while IFS= read -r -d '' f; do
rel="${f#apps/site/dist/}"
if [ "$rel" = "index.html" ]; then
key="index.html"
else
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/deploy-worker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ on:
push:
branches: [main]
paths:
- "apps/web/worker/**"
- "apps/api/**"

jobs:
deploy:
Expand All @@ -16,14 +16,14 @@ jobs:

- uses: oven-sh/setup-bun@v2

- run: bun install --cwd apps/web/worker
- run: bun install --cwd apps/api

- name: Deploy to Cloudflare Workers
uses: cloudflare/wrangler-action@v3
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
workingDirectory: apps/web/worker
workingDirectory: apps/api
secrets: |
DATABASE_URL
env:
Expand Down
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ dist/
corpus/
dev/

# Generated web data (local only)
apps/web/data/
# Generated site data (local only)
apps/site/data/

# Astro
apps/web/.astro/
apps/site/.astro/

# Tests
coverage/
Expand Down
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ Run `corpus <command> --help` for detailed options.
apps/
cli/ # Unified CLI — corpus <command>
cdx-filter/ # AWS Lambda — filters CDX indexes for .docx URLs
web/ # Landing page (docxcorp.us) + Cloudflare Worker API
site/ # Static Astro site for docxcorp.us
api/ # Cloudflare Worker API for api.docxcorp.us
packages/
shared/ # DB client, storage abstraction, formatting
scraper/ # Downloads WARC, validates .docx, deduplicates
Expand All @@ -84,7 +85,8 @@ db/
|-------|------|---------|
| **cli** | `corpus` command — orchestrates everything | Bun |
| **cdx-filter** | Filter Common Crawl CDX indexes (Lambda) | Node.js |
| **web** | docxcorp.us landing page + API worker | Static + CF Worker |
| **site** | docxcorp.us landing page and dataset pages | Static Astro |
| **api** | api.docxcorp.us `/stats`, `/documents`, `/manifest` | Cloudflare Worker |
| **scraper** | Download, validate, deduplicate .docx files | Bun |
| **extractor** | Extract text + detect language (Docling) | Bun + Python |
| **embedder** | Generate embeddings (Gemini) | Bun |
Expand Down Expand Up @@ -244,8 +246,8 @@ docker compose up -d
DATABASE_URL=postgres://postgres:postgres@localhost:5432/docx_corpus \
bun run corpus status

# Run web API locally
cd apps/web/worker
# Run API locally
cd apps/api
npx wrangler dev
```

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion apps/web/README.md → apps/site/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# docxcorp.us web
# docxcorp.us site

Static site assets are deployed to Cloudflare R2 by `.github/workflows/deploy-site.yml`.

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion apps/web/package.json → apps/site/package.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "@docx-corpus/web",
"name": "@docx-corpus/site",
"private": true,
"version": "0.0.0",
"type": "module",
Expand Down
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/**
* Content loader for markdown drafts in apps/web/content/.
* Content loader for markdown drafts in apps/site/content/.
*
* Each .md file has YAML frontmatter parsed by Astro's import.meta.glob.
* The body is rendered to HTML at build time.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
// Homepage. Ported from the legacy static apps/web/public/index.html so it
// participates in the Astro build pipeline. Inline CSS and JS are preserved
// Homepage. Ported from the legacy static file so it participates in the
// Astro build pipeline. Inline CSS and JS are preserved
// verbatim under `is:global is:inline` and `is:inline` respectively; the
// explorer JS still fetches live data from api.docxcorp.us at runtime, same
// as before.
Expand Down Expand Up @@ -1564,4 +1564,4 @@ curl "https://api.docxcorp.us/manifest?type=legal&amp;lang=en&amp;min_confidence

</body>

</html>
</html>
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 8 additions & 4 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
"name": "docx-corpus",
"private": true,
"workspaces": [
"apps/*",
"apps/cdx-filter",
"apps/cli",
"apps/site",
"packages/*"
],
"scripts": {
Expand All @@ -16,11 +18,13 @@
"release:cli": "bun run --cwd apps/cli release",
"setup:extractor": "bun run --cwd packages/extractor setup",
"prepare": "lefthook install",
"dev:web": "bun run --cwd apps/web dev",
"build:web": "bun run --cwd apps/web build"
"dev:site": "bun run --cwd apps/site dev",
"build:site": "bun run --cwd apps/site build",
"dev:web": "bun run dev:site",
"build:web": "bun run build:site"
},
"devDependencies": {
"@biomejs/biome": "^2.4.6",
"lefthook": "^1.11.13"
}
}
}
Loading