Skip to content

Commit 16222f1

Browse files
committed
fix(python): include JAR and license files in sdist for conda-forge
Objective: conda-forge packagers must bundle Maven + OpenJDK as build-time dependencies because PyPI only ships a wheel — there is no sdist with the pre-compiled JAR they can build from. Attempting to produce one silently fails: hatch drops gitignored files (JAR, LICENSE, NOTICE, THIRD_PARTY) from the sdist even when they are listed in 'include', yielding a 32KB tarball that installs but is missing the CLI it wraps. Approach: Promote 'artifacts' to top-level [tool.hatch.build] so both wheel and sdist force-include the gitignored build outputs. Drop '--wheel' from build-python.sh so 'uv build' produces both artifacts. Add verify-python-sdist.sh as a standalone check that runs inside build-python.sh and can also be run locally — this guards against silent regressions if .gitignore or hatch config drifts in future. README.md is copied by build-python.sh before 'uv build' because hatchling validates [project.readme] during metadata parsing, which runs before build hooks. Evidence: Built from a fresh clean checkout (no prior artifacts in the package dir) and installed the resulting sdist in a fresh venv. | Scenario | Expected | Actual | |----------|----------|--------| | uv build on clean checkout | sdist + wheel | 21MB sdist + 21MB wheel | | sdist contains JAR | yes | yes (tar -tzf confirmed) | | sdist contains LICENSE/NOTICE/THIRD_PARTY | yes | yes | | verify-python-sdist.sh | pass | "OK: sdist contains all required files" | | verify-python-sdist.sh with multiple tarballs | fail with list | fails cleanly | | pip install from sdist | no mvn call | wheel built in ~1s, mvn not invoked | | JAR in site-packages | yes | 23MB at .../jar/opendataloader-pdf-cli.jar | | import opendataloader_pdf | success | success | Fixes #435
1 parent 7965cea commit 16222f1

4 files changed

Lines changed: 92 additions & 8 deletions

File tree

python/opendataloader-pdf/hatch_build.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,11 @@ def initialize(self, version, build_data):
1919

2020
readme_path = root_dir / "README.md"
2121

22-
# Check if all required files already exist (building from sdist)
22+
# sdist-install code path: when users `pip install <sdist>.tar.gz`,
23+
# the extracted sdist already contains JAR/LICENSE/NOTICE/THIRD_PARTY
24+
# (force-included via [tool.hatch.build] artifacts in pyproject.toml),
25+
# and there is no java/ tree to rebuild from. Do not remove — sdist
26+
# installs would break with a spurious "mvn package" error.
2327
if (
2428
dest_jar_path.exists()
2529
and license_path.exists()
@@ -52,10 +56,12 @@ def initialize(self, version, build_data):
5256
print(f"Copying JAR to {dest_jar_path}")
5357
shutil.copy(source_jar_path, dest_jar_path)
5458

55-
# --- Copy LICENSE, NOTICE, README ---
59+
# --- Copy LICENSE, NOTICE ---
60+
# README is copied by build-python.sh before this hook runs, because
61+
# hatchling validates [project.readme] during metadata parsing, which
62+
# happens before build hooks. Do not copy README here.
5663
shutil.copy(root_dir / "../../LICENSE", license_path)
5764
shutil.copy(root_dir / "../../NOTICE", notice_path)
58-
shutil.copy(root_dir / "../../README.md", readme_path)
5965
third_party_src = root_dir / "../../THIRD_PARTY"
6066
print(f"Copying THIRD_PARTY directory to {third_party_dest}")
6167
if third_party_dest.exists():

python/opendataloader-pdf/pyproject.toml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,19 @@ Homepage = "https://github.com/opendataloader-project/opendataloader-pdf"
3333
requires = ["hatchling"]
3434
build-backend = "hatchling.build"
3535

36-
[tool.hatch.build.targets.wheel]
37-
packages = ["src/opendataloader_pdf"]
36+
# Shared by wheel and sdist: force-include gitignored build outputs
37+
# (JAR, LICENSE, NOTICE, THIRD_PARTY) that hatch_build.py copies at build time.
38+
[tool.hatch.build]
3839
artifacts = [
3940
"src/opendataloader_pdf/jar/*.jar",
4041
"src/opendataloader_pdf/LICENSE",
4142
"src/opendataloader_pdf/NOTICE",
4243
"src/opendataloader_pdf/THIRD_PARTY/**",
4344
]
4445

46+
[tool.hatch.build.targets.wheel]
47+
packages = ["src/opendataloader_pdf"]
48+
4549
[tool.hatch.build.targets.sdist]
4650
include = [
4751
"src/opendataloader_pdf/**",

scripts/build-python.sh

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,18 @@ command -v uv >/dev/null || { echo "Error: uv not found. Install with: curl -LsS
1616
# Clean previous build
1717
rm -rf dist/
1818

19-
# Copy README.md from root (gitignored in package dir)
19+
# Copy README.md from repo root *before* build. hatchling validates [project.readme]
20+
# during metadata parsing, which runs BEFORE hatch_build.py's build hook — so we
21+
# cannot rely on the hook to provide it. Clean up on exit so branch switches
22+
# don't leave a stale copy in the package dir.
2023
cp "$ROOT_DIR/README.md" "$PACKAGE_DIR/README.md"
24+
trap 'rm -f "$PACKAGE_DIR/README.md"' EXIT
2125

22-
# Build wheel package
23-
uv build --wheel
26+
# Build sdist and wheel packages (hatch_build.py copies JAR/LICENSE/NOTICE/THIRD_PARTY)
27+
uv build
28+
29+
# Verify sdist contains required artifacts (JAR, LICENSE, NOTICE, THIRD_PARTY)
30+
"$SCRIPT_DIR/verify-python-sdist.sh"
2431

2532
# Install and run tests (include hybrid extras for full test coverage)
2633
uv sync --extra hybrid

scripts/verify-python-sdist.sh

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
#!/bin/bash
2+
3+
# Verify the Python sdist contains all required files (JAR, LICENSE, etc).
4+
# The files listed below are gitignored in the package dir and only exist in
5+
# the dist because [tool.hatch.build] artifacts force-includes them. This
6+
# script guards against silent regressions if that config ever drifts.
7+
#
8+
# Usage: ./scripts/verify-python-sdist.sh
9+
# Prerequisite: run 'uv build' (or scripts/build-python.sh) first.
10+
11+
set -e
12+
13+
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
14+
ROOT_DIR="$SCRIPT_DIR/.."
15+
DIST_DIR="$ROOT_DIR/python/opendataloader-pdf/dist"
16+
17+
shopt -s nullglob
18+
SDIST_CANDIDATES=("$DIST_DIR"/*.tar.gz)
19+
shopt -u nullglob
20+
21+
if [ ${#SDIST_CANDIDATES[@]} -eq 0 ]; then
22+
echo "Error: no sdist found in $DIST_DIR. Run 'uv build' first." >&2
23+
exit 1
24+
fi
25+
if [ ${#SDIST_CANDIDATES[@]} -gt 1 ]; then
26+
echo "Error: multiple sdists found in $DIST_DIR. Remove stale ones first:" >&2
27+
printf ' - %s\n' "${SDIST_CANDIDATES[@]}" >&2
28+
exit 1
29+
fi
30+
SDIST="${SDIST_CANDIDATES[0]}"
31+
32+
echo "Verifying sdist: $(basename "$SDIST")"
33+
34+
REQUIRED=(
35+
"jar/opendataloader-pdf-cli.jar"
36+
"LICENSE"
37+
"NOTICE"
38+
"THIRD_PARTY/"
39+
)
40+
41+
CONTENTS=$(tar -tzf "$SDIST")
42+
MISSING=()
43+
for path in "${REQUIRED[@]}"; do
44+
# Directory prefixes (trailing '/') match any entry under that prefix.
45+
# File paths are anchored to end-of-line so "LICENSE" does not match "LICENSE.bak".
46+
# (^|/) prefix keeps the check layout-tolerant if hatchling ever emits
47+
# a differently-rooted sdist (e.g., without the top-level pkgname-version/ dir).
48+
if [[ "$path" == */ ]]; then
49+
pattern="(^|/)src/opendataloader_pdf/${path}"
50+
else
51+
pattern="(^|/)src/opendataloader_pdf/${path}\$"
52+
fi
53+
if ! echo "$CONTENTS" | grep -qE "$pattern"; then
54+
MISSING+=("$path")
55+
fi
56+
done
57+
58+
if [ ${#MISSING[@]} -gt 0 ]; then
59+
echo "Error: sdist is missing required files:" >&2
60+
printf ' - src/opendataloader_pdf/%s\n' "${MISSING[@]}" >&2
61+
echo "" >&2
62+
echo "Fix: ensure [tool.hatch.build] in pyproject.toml lists these under 'artifacts'." >&2
63+
echo "(They are gitignored, so hatch drops them unless force-included.)" >&2
64+
exit 1
65+
fi
66+
67+
echo "OK: sdist contains all required files."

0 commit comments

Comments
 (0)