Skip to content

Commit 0f93efd

Browse files
authored
Merge pull request #488 from PolicyEngine/fix/utf8-byte-enum-encoding
Fix enum encoding for UTF-8 byte strings
2 parents 2f9d7dd + 2706fe8 commit 0f93efd

10 files changed

Lines changed: 265 additions & 2 deletions

File tree

AGENTS.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Agent Instructions
2+
3+
These instructions apply repository-wide.
4+
5+
## Skills system
6+
7+
Canonical AI-facing engineering skills live under `docs/engineering/skills/`.
8+
Use those files as the source of truth across Codex, Claude, Copilot, and other
9+
AI tools.
10+
11+
When adding, moving, or reviewing tests, read
12+
`docs/engineering/skills/testing.md`.
13+
14+
When reviewing changes to public APIs, architecture, documentation, or generated
15+
artifacts, read `docs/engineering/skills/documentation_review.md`.
16+
17+
## GitHub PRs
18+
19+
Read `docs/engineering/skills/github-prs.md` before opening, replacing, or
20+
sharing any pull request.
21+
22+
Before creating or sharing any PR, all developers and agents must:
23+
24+
1. Confirm the target remote is the canonical repository:
25+
`gh repo view PolicyEngine/policyengine-core --json nameWithOwner`.
26+
2. Add a towncrier changelog fragment in `changelog.d/` using the format
27+
documented in `docs/engineering/skills/github-prs.md`.
28+
3. Push the branch to `PolicyEngine/policyengine-core`.
29+
4. Create the PR against `master`.
30+
5. Verify the PR head repository before reporting it.
31+
32+
If you cannot push to the canonical repository, stop and ask for access. Do not
33+
open a fork PR as a fallback unless the user explicitly asks for one.

CLAUDE.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Claude Instructions
2+
3+
These instructions apply repository-wide.
4+
5+
## Canonical guidance
6+
7+
Repository-wide AI-facing engineering guidance lives in `AGENTS.md`.
8+
Canonical skills live under `docs/engineering/skills/`.
9+
10+
Use those files as the source of truth. This file is a Claude adapter and should
11+
stay thin; do not duplicate detailed testing, CI, formatting, or architecture
12+
rules here.
13+
14+
## Required skill lookup
15+
16+
Before opening, replacing, or sharing a PR, read
17+
`docs/engineering/skills/github-prs.md`. Add the required towncrier changelog
18+
fragment before creating the PR.
19+
20+
When adding, moving, or reviewing tests, read
21+
`docs/engineering/skills/testing.md` before editing.
22+
23+
When reviewing changes to public APIs, architecture, documentation, or generated
24+
artifacts, read `docs/engineering/skills/documentation_review.md`.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added provider-neutral AI engineering guidance for tests, documentation review, pull requests, and changelog fragments.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fixed enum encoding for HDF5-style UTF-8 byte-string arrays containing non-ASCII enum member names.

docs/engineering/skills/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Engineering Skills
2+
3+
This directory is the canonical source for AI-facing engineering rules.
4+
5+
Tool-specific instruction files such as `AGENTS.md`, `CLAUDE.md`, and
6+
`.github/copilot-instructions.md` should point here instead of duplicating
7+
implementation-specific guidance. When a rule changes, update the skill here
8+
first, then keep adapters thin.
9+
10+
Current skills:
11+
12+
- `documentation_review.md`: model-neutral review checklist for public API,
13+
architecture, documentation, and generated artifact changes.
14+
- `github-prs.md`: canonical PR workflow, required changelog fragments, PR head
15+
verification, and title conventions.
16+
- `testing.md`: test placement, fixture scope, command, and environment
17+
expectations.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Documentation Review
2+
3+
Use this skill when reviewing a pull request that changes public APIs,
4+
architecture, documentation, generated artifacts, or developer-facing workflows.
5+
6+
## Review goal
7+
8+
Documentation review is a harness check, not copyediting. Confirm that durable
9+
documentation still describes the code paths a maintainer or AI agent would use
10+
to understand, validate, or modify the system.
11+
12+
Do not put PR-specific confidence, impact, or reviewer notes into durable API
13+
docs. Keep durable facts in source documentation and keep review judgment in the
14+
PR description or review comment.
15+
16+
## Trigger files
17+
18+
Run documentation review when a PR changes any of these paths:
19+
20+
- `policyengine_core/`
21+
- `docs/`
22+
- `.github/`
23+
- `README.md`
24+
- `CONTRIBUTING.md`
25+
- `pyproject.toml`
26+
- generated documentation or package metadata
27+
28+
Also run it when a PR changes public import surfaces, command-line behavior,
29+
test/development workflows, changelog tooling, or release behavior even if the
30+
changed path is not listed above.
31+
32+
## Checks
33+
34+
- Public API changes are reflected in relevant docs or API reference pages.
35+
- Developer workflow changes are reflected in `README.md`, contributing docs, or
36+
AI-facing skills when needed.
37+
- Generated artifacts are not hand-edited unless the repo expects them to be
38+
checked in.
39+
- Changelog fragment guidance still matches the towncrier config in
40+
`pyproject.toml`.
41+
- Review notes distinguish facts proven by code or tests from assumptions.
42+
43+
## Review output
44+
45+
Use a concise PR-facing note. Include:
46+
47+
- Documentation changes observed, or `not required` with a reason.
48+
- Commands run, or commands not run with a reason.
49+
- Impact: `low`, `medium`, or `high`.
50+
- Confidence: `low`, `medium`, or `high`.
51+
- Known gaps that should be handled in this PR or a follow-up.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# GitHub PRs
2+
3+
These rules apply to every developer and AI agent opening pull requests in this
4+
repository.
5+
6+
## Repository and branch
7+
8+
Open PRs against `master` in `PolicyEngine/policyengine-core`.
9+
10+
Before creating or sharing a PR:
11+
12+
1. Confirm the canonical repository is reachable:
13+
`gh repo view PolicyEngine/policyengine-core --json nameWithOwner`.
14+
2. Add the required towncrier changelog fragment under `changelog.d/`.
15+
3. Push the current branch to `PolicyEngine/policyengine-core`.
16+
4. Create the PR against `master`.
17+
5. Verify the PR head repository before reporting it:
18+
`gh pr view <PR> --repo PolicyEngine/policyengine-core --json headRepositoryOwner,headRepository`.
19+
20+
If you cannot push to the canonical repository, stop and ask for access. Do not
21+
open a fork PR as a fallback unless the user explicitly asks for one.
22+
23+
## Changelog fragment
24+
25+
A changelog entry is required before opening, replacing, or sharing a PR. When a
26+
user asks an AI agent to open a PR, the agent must check for an appropriate
27+
fragment and add one if it is missing before running `gh pr create`.
28+
29+
Use towncrier fragments in this format:
30+
31+
```text
32+
changelog.d/<short-slug>.<type>.md
33+
```
34+
35+
Allowed `<type>` values are configured in `pyproject.toml`:
36+
37+
- `breaking`
38+
- `added`
39+
- `changed`
40+
- `fixed`
41+
- `removed`
42+
43+
Examples:
44+
45+
```text
46+
changelog.d/fix-enum-utf8-bytes.fixed.md
47+
changelog.d/add-agent-pr-guidance.changed.md
48+
```
49+
50+
Write one concise Markdown sentence in the fragment. Use `fixed` for bug fixes,
51+
`added` for new user-facing capabilities, `changed` for behavior, documentation,
52+
tooling, or refactors, `removed` for removals, and `breaking` only for changes
53+
that intentionally break compatibility. Prefer updating an existing branch
54+
fragment over adding duplicate fragments for the same PR.
55+
56+
Do not run `make changelog` for a PR. The release workflow builds
57+
`CHANGELOG.md` from fragments after merge.
58+
59+
## PR title
60+
61+
Do not add `[codex]`, `[claude]`, `[copilot]`, or other agent labels to PR
62+
titles. Use a plain descriptive title.
63+
64+
## PR body
65+
66+
Keep the description concrete:
67+
68+
- Link the issue the PR fixes.
69+
- Summarize the user-visible behavior change.
70+
- List the tests or checks run.
71+
- Call out any commands that were not run and why.

docs/engineering/skills/testing.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Testing Skill
2+
3+
Use this skill whenever adding, moving, or reviewing tests.
4+
5+
## Commands
6+
7+
Use `uv run` for Python commands so the repo environment is selected
8+
consistently.
9+
10+
Common checks:
11+
12+
```bash
13+
uv run pytest tests/core/enums/test_enum.py -v
14+
uv run pytest tests/core/test_file.py::test_name -v
15+
make format
16+
make test
17+
make documentation
18+
```
19+
20+
Run the narrowest test that proves the change while working. Before handing off a
21+
broader behavioral change, run the relevant focused tests and formatting check.
22+
23+
## Placement
24+
25+
- Put core package tests under `tests/core/`.
26+
- Put smoke tests under `tests/smoke/`.
27+
- Keep fixtures under `tests/fixtures/` unless pytest fixture discovery requires
28+
a local `conftest.py`.
29+
- Do not add tests inside `policyengine_core/`.
30+
31+
## Fixture and dependency boundaries
32+
33+
- Keep root `tests/conftest.py` lightweight.
34+
- Avoid network access, cloud credentials, or country package imports in ordinary
35+
unit tests unless the test is explicitly a smoke/integration check.
36+
- Prefer small synthetic fixtures for regression tests.
37+
- When fixing a bug, add a regression test that fails without the fix and passes
38+
with it unless the change is documentation-only.

policyengine_core/enums/enum.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,8 +77,11 @@ def encode(cls, array: Union[EnumArray, np.ndarray]) -> EnumArray:
7777
indices = np.array([item.index for item in array], dtype=ENUM_ARRAY_DTYPE)
7878
return EnumArray(indices, cls)
7979

80-
# Convert byte-strings or object arrays to Unicode strings
81-
if array.dtype.kind == "S" or array.dtype == object:
80+
# Convert fixed-width byte strings, as returned by h5py for string
81+
# datasets, to Unicode before matching enum names.
82+
if array.dtype.kind == "S":
83+
array = np.char.decode(array, "utf-8")
84+
elif array.dtype == object:
8285
array = array.astype(str)
8386

8487
if isinstance(array, np.ndarray) and array.dtype.kind in {"U", "S"}:

tests/core/enums/test_enum.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,30 @@ class Sample(Enum):
7474
assert "FOO" in error_message or "BAR" in error_message or "BAZ" in error_message
7575

7676

77+
def test_enum_encode_utf8_byte_string_array():
78+
"""
79+
Test that HDF5-style UTF-8 byte strings encode as enum names.
80+
81+
The ñ mirrors values like DOÑA_ANA_COUNTY_NM. It is a non-ASCII
82+
character, so NumPy's default byte-string conversion cannot decode it
83+
as ASCII.
84+
"""
85+
86+
class Sample(Enum):
87+
DOÑA_ANA = "Doña Ana"
88+
DWORKIN = "dworkin"
89+
90+
byte_string_array = np.array(
91+
["DOÑA_ANA".encode("utf-8"), b"DWORKIN"],
92+
dtype="S10",
93+
)
94+
95+
encoded_array = Sample.encode(byte_string_array)
96+
97+
assert isinstance(encoded_array, EnumArray)
98+
assert list(encoded_array) == [0, 1]
99+
100+
77101
def test_enum_encode_empty_string_raises_error():
78102
"""Test that encoding empty strings raises ValueError."""
79103

0 commit comments

Comments
 (0)