chore: add HUGGING_FACE_TOKEN to env example

nikhilwoodruff · nikhilwoodruff · commit ff5262859f37 · 2025-12-30T13:43:31.000Z
diff --git a/.env.example b/.env.example
@@ -42,6 +42,12 @@ DEBUG=true
 LOGFIRE_TOKEN=
 LOGFIRE_ENVIRONMENT=local
 
+# =============================================================================
+# HUGGING FACE (for dataset downloads)
+# =============================================================================
+# Get token from https://huggingface.co/settings/tokens
+HUGGING_FACE_TOKEN=hf_...
+
 # =============================================================================
 # AGENT (Claude Code)
 # =============================================================================
diff --git a/docs/AGENT_TESTING.md b/docs/AGENT_TESTING.md
@@ -146,12 +146,35 @@ Tests are in `tests/test_agent_policy_questions.py` (integration tests requiring
 | Seed script | Deduplicate parameters by name |
 | System prompt | Fixed model names (hyphen not underscore) |
 
+### Issue 8: Agent not using country filter in economy-wide analysis
+
+**Problem**: When answering economic impact questions, agent didn't use `tax_benefit_model_name` filter despite it being in the system prompt. This led to 18 turns for UK budgetary impact question (12 turns just searching for parameter).
+
+**Root cause**: System prompt mentioned the filter but didn't emphasize it enough; economic impact workflow didn't show the filter in example.
+
+**Solution implemented**: Restructured system prompt with:
+- **CRITICAL** section at the top emphasizing country filter
+- Explanation of why filter is needed (mixed results waste turns)
+- Added filter to all workflow examples including economic impact
+
+**Result**: UK budgetary impact question now completes in **6 turns** (down from 18).
+
+### Issue 9: Key US parameters missing from database
+
+**Problem**: Core CTC parameters like `gov.irs.credits.ctc.amount.base[0].amount` have `label=None` in policyengine-us package, so they're not seeded (seed script only includes parameters with labels).
+
+**Impact**: Agent can't find the main CTC amount parameter to double it. Had to use `refundable.individual_max` as a proxy.
+
+**Solution needed**: Add labels to core parameters in policyengine-us package (upstream fix).
+
 ## Measurements
 
 | Question type | Baseline | After improvements | Target |
 |---------------|----------|-------------------|--------|
 | Parameter lookup (UK personal allowance) | 10 turns | **3 turns** | 3-4 |
 | Household calculation (UK £50k income) | 6 turns | - | 5-6 |
+| Economy-wide (UK budgetary impact) | 18 turns | **6 turns** | 5-8 |
+| Economy-wide (US CTC impact) | 20+ turns | - | 8-10 |
 
 ## Progress log
 
@@ -163,3 +186,7 @@ Tests are in `tests/test_agent_policy_questions.py` (integration tests requiring
 - 2024-12-30: Fixed model name mismatch (policyengine-uk with hyphen, not underscore)
 - 2024-12-30: Added case-insensitive search using ILIKE
 - 2024-12-30: Tested personal allowance lookup - **3 turns** (target met!)
+- 2025-12-30: Tested UK economy-wide (budgetary impact) - 18 turns initially
+- 2025-12-30: Restructured system prompt to emphasize country filter at top
+- 2025-12-30: UK economy-wide now **6 turns** (3x improvement)
+- 2025-12-30: Discovered US CTC parameters missing labels (upstream issue in policyengine-us)
diff --git a/docs/src/app/modal/page.tsx b/docs/src/app/modal/page.tsx
@@ -0,0 +1,205 @@
+export default function ModalPage() {
+  return (
+    <div className="max-w-4xl">
+      <h1 className="text-3xl font-semibold text-[var(--color-text-primary)] mb-4">
+        Modal compute
+      </h1>
+      <p className="text-lg text-[var(--color-text-secondary)] mb-8">
+        PolicyEngine uses Modal.com for serverless compute, with two separate apps for different workloads.
+      </p>
+
+      <div className="space-y-8">
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">Why two apps?</h2>
+          <p className="text-sm text-[var(--color-text-secondary)] mb-4">
+            The API uses two separate Modal apps rather than one combined app. This separation is intentional and provides several benefits:
+          </p>
+          <div className="space-y-4">
+            <div>
+              <h3 className="font-medium text-[var(--color-text-primary)] mb-2">Image size</h3>
+              <p className="text-sm text-[var(--color-text-secondary)]">
+                The <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">policyengine</code> app has massive container images (multiple GB) with the full UK and US tax-benefit models pre-loaded. The <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">policyengine-sandbox</code> app is minimal - just the Anthropic SDK and requests library.
+              </p>
+            </div>
+            <div>
+              <h3 className="font-medium text-[var(--color-text-primary)] mb-2">Cold start optimisation</h3>
+              <p className="text-sm text-[var(--color-text-secondary)]">
+                The main app uses Modal&apos;s memory snapshot feature to pre-load PolicyEngine models at build time. When a function cold starts, it restores from the snapshot rather than re-importing the models, achieving sub-1s cold starts for functions that would otherwise take 30+ seconds to import.
+              </p>
+            </div>
+            <div>
+              <h3 className="font-medium text-[var(--color-text-primary)] mb-2">Architectural decoupling</h3>
+              <p className="text-sm text-[var(--color-text-secondary)]">
+                The sandbox/agent calls the public API endpoints, which then trigger the simulation functions. They&apos;re independent - the agent doesn&apos;t directly import PolicyEngine models, it makes HTTP calls.
+              </p>
+            </div>
+            <div>
+              <h3 className="font-medium text-[var(--color-text-primary)] mb-2">Independent scaling</h3>
+              <p className="text-sm text-[var(--color-text-secondary)]">
+                Simulation workloads scale differently from agent chat sessions. Keeping them separate lets Modal scale each independently based on demand.
+              </p>
+            </div>
+          </div>
+        </section>
+
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">policyengine app</h2>
+          <p className="text-sm text-[var(--color-text-secondary)] mb-4">
+            The main compute app for running simulations. Located at <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">src/policyengine_api/modal_app.py</code>.
+          </p>
+
+          <div className="overflow-x-auto">
+            <table className="w-full text-sm">
+              <thead>
+                <tr className="border-b border-[var(--color-border)]">
+                  <th className="text-left py-2 pr-4 font-medium text-[var(--color-text-primary)]">Function</th>
+                  <th className="text-left py-2 pr-4 font-medium text-[var(--color-text-primary)]">Image</th>
+                  <th className="text-left py-2 pr-4 font-medium text-[var(--color-text-primary)]">Memory</th>
+                  <th className="text-left py-2 font-medium text-[var(--color-text-primary)]">Purpose</th>
+                </tr>
+              </thead>
+              <tbody className="text-[var(--color-text-secondary)]">
+                <tr className="border-b border-[var(--color-border)]">
+                  <td className="py-2 pr-4 font-mono text-xs">simulate_household_uk</td>
+                  <td className="py-2 pr-4">uk_image</td>
+                  <td className="py-2 pr-4">4GB</td>
+                  <td className="py-2">Single UK household calculation</td>
+                </tr>
+                <tr className="border-b border-[var(--color-border)]">
+                  <td className="py-2 pr-4 font-mono text-xs">simulate_household_us</td>
+                  <td className="py-2 pr-4">us_image</td>
+                  <td className="py-2 pr-4">4GB</td>
+                  <td className="py-2">Single US household calculation</td>
+                </tr>
+                <tr className="border-b border-[var(--color-border)]">
+                  <td className="py-2 pr-4 font-mono text-xs">simulate_economy_uk</td>
+                  <td className="py-2 pr-4">uk_image</td>
+                  <td className="py-2 pr-4">8GB</td>
+                  <td className="py-2">UK economy simulation</td>
+                </tr>
+                <tr className="border-b border-[var(--color-border)]">
+                  <td className="py-2 pr-4 font-mono text-xs">simulate_economy_us</td>
+                  <td className="py-2 pr-4">us_image</td>
+                  <td className="py-2 pr-4">8GB</td>
+                  <td className="py-2">US economy simulation</td>
+                </tr>
+                <tr className="border-b border-[var(--color-border)]">
+                  <td className="py-2 pr-4 font-mono text-xs">economy_comparison_uk</td>
+                  <td className="py-2 pr-4">uk_image</td>
+                  <td className="py-2 pr-4">8GB</td>
+                  <td className="py-2">UK decile impacts, budget impact</td>
+                </tr>
+                <tr>
+                  <td className="py-2 pr-4 font-mono text-xs">economy_comparison_us</td>
+                  <td className="py-2 pr-4">us_image</td>
+                  <td className="py-2 pr-4">8GB</td>
+                  <td className="py-2">US decile impacts, budget impact</td>
+                </tr>
+              </tbody>
+            </table>
+          </div>
+
+          <div className="mt-4 p-3 bg-[var(--color-surface-sunken)] rounded-lg">
+            <p className="text-xs text-[var(--color-text-muted)]">
+              Deploy with: <code className="font-mono">modal deploy src/policyengine_api/modal_app.py</code>
+            </p>
+          </div>
+        </section>
+
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">policyengine-sandbox app</h2>
+          <p className="text-sm text-[var(--color-text-secondary)] mb-4">
+            Lightweight app for the AI agent. Located at <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">src/policyengine_api/agent_sandbox.py</code>.
+          </p>
+
+          <div className="overflow-x-auto">
+            <table className="w-full text-sm">
+              <thead>
+                <tr className="border-b border-[var(--color-border)]">
+                  <th className="text-left py-2 pr-4 font-medium text-[var(--color-text-primary)]">Function</th>
+                  <th className="text-left py-2 pr-4 font-medium text-[var(--color-text-primary)]">Dependencies</th>
+                  <th className="text-left py-2 font-medium text-[var(--color-text-primary)]">Purpose</th>
+                </tr>
+              </thead>
+              <tbody className="text-[var(--color-text-secondary)]">
+                <tr>
+                  <td className="py-2 pr-4 font-mono text-xs">run_agent</td>
+                  <td className="py-2 pr-4">anthropic, requests</td>
+                  <td className="py-2">Agentic loop using Claude with API tools</td>
+                </tr>
+              </tbody>
+            </table>
+          </div>
+
+          <p className="text-sm text-[var(--color-text-secondary)] mt-4">
+            The agent dynamically generates Claude tools from the OpenAPI spec, then executes an agentic loop to answer policy questions by making API calls. It doesn&apos;t import PolicyEngine directly.
+          </p>
+
+          <div className="mt-4 p-3 bg-[var(--color-surface-sunken)] rounded-lg">
+            <p className="text-xs text-[var(--color-text-muted)]">
+              Deploy with: <code className="font-mono">modal deploy src/policyengine_api/agent_sandbox.py</code>
+            </p>
+          </div>
+        </section>
+
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">Memory snapshots</h2>
+          <p className="text-sm text-[var(--color-text-secondary)] mb-4">
+            The <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">policyengine</code> app uses Modal&apos;s <code className="px-1.5 py-0.5 bg-[var(--color-surface-sunken)] rounded text-xs">run_function</code> to snapshot the Python interpreter state after importing the models:
+          </p>
+          <pre className="p-4 bg-[var(--color-surface-sunken)] rounded-lg text-xs font-mono overflow-x-auto text-[var(--color-text-secondary)]">
+{`def _import_uk():
+    from policyengine.tax_benefit_models.uk import uk_latest
+    print("UK model loaded and snapshotted")
+
+uk_image = base_image.run_commands(
+    "uv pip install --system policyengine-uk>=2.0.0"
+).run_function(_import_uk)`}
+          </pre>
+          <p className="text-sm text-[var(--color-text-secondary)] mt-4">
+            When a cold start happens, Modal restores from this snapshot rather than re-running the imports. This turns a 30+ second import into sub-second startup.
+          </p>
+        </section>
+
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">Secrets</h2>
+          <p className="text-sm text-[var(--color-text-secondary)] mb-4">
+            Each app uses different Modal secrets:
+          </p>
+          <div className="space-y-3">
+            <div className="flex items-start gap-3">
+              <span className="px-2 py-1 bg-[var(--color-surface-sunken)] rounded text-xs font-mono text-[var(--color-text-secondary)]">policyengine-db</span>
+              <p className="text-sm text-[var(--color-text-secondary)]">Database credentials for the main app (DATABASE_URL, SUPABASE_URL, SUPABASE_KEY)</p>
+            </div>
+            <div className="flex items-start gap-3">
+              <span className="px-2 py-1 bg-[var(--color-surface-sunken)] rounded text-xs font-mono text-[var(--color-text-secondary)]">anthropic-api-key</span>
+              <p className="text-sm text-[var(--color-text-secondary)]">Anthropic API key for the sandbox app (ANTHROPIC_API_KEY)</p>
+            </div>
+          </div>
+        </section>
+
+        <section className="p-6 border border-[var(--color-border)] rounded-xl bg-white">
+          <h2 className="text-xl font-semibold text-[var(--color-text-primary)] mb-4">Request flow</h2>
+          <div className="space-y-3">
+            {[
+              "Client calls API endpoint (e.g. POST /household/calculate)",
+              "FastAPI validates request and creates job record in Supabase",
+              "FastAPI triggers Modal function asynchronously",
+              "API returns job ID immediately",
+              "Modal function runs calculation with pre-loaded models",
+              "Modal function writes results directly to Supabase",
+              "Client polls API until job status = completed",
+            ].map((step, index) => (
+              <div key={index} className="flex items-start gap-3">
+                <span className="flex-shrink-0 w-6 h-6 rounded-full bg-[var(--color-pe-green)] text-white text-xs font-medium flex items-center justify-center">
+                  {index + 1}
+                </span>
+                <p className="text-sm text-[var(--color-text-secondary)] pt-0.5">{step}</p>
+              </div>
+            ))}
+          </div>
+        </section>
+      </div>
+    </div>
+  );
+}
diff --git a/src/policyengine_api/agent_sandbox.py b/src/policyengine_api/agent_sandbox.py
@@ -18,30 +18,39 @@
 
 SYSTEM_PROMPT = """You are a PolicyEngine assistant that helps users understand tax and benefit policies.
 
-You have access to the full PolicyEngine API. Key workflows:
+You have access to the full PolicyEngine API.
 
-1. **Household calculations**: POST to /household/calculate with people array, then poll GET /household/calculate/{job_id}
-2. **Parameter lookup**: GET /parameters/ with search query and tax_benefit_model_name, then GET /parameter-values/ with parameter_id
-3. **Economic impact**:
-   - GET /parameters/ to find parameter_id
+## CRITICAL: Always filter by country
+
+When searching for parameters or datasets, ALWAYS include tax_benefit_model_name:
+- "policyengine-uk" for UK questions
+- "policyengine-us" for US questions
+
+Parameters and datasets from both countries are in the same database. Without the filter, you'll get mixed results and waste turns finding the right ones.
+
+## Key workflows
+
+1. **Household calculations**:
+   - POST /household/calculate with model_name and people array
+   - Poll GET /household/calculate/{job_id} until completed
+
+2. **Parameter lookup**:
+   - GET /parameters/?search=...&tax_benefit_model_name=policyengine-uk (ALWAYS include country filter)
+   - GET /parameter-values/?parameter_id=...&current=true for the current value
+
+3. **Economic impact analysis** (budget impact, decile impacts):
+   - GET /parameters/?search=...&tax_benefit_model_name=policyengine-uk to find parameter_id
    - POST /policies/ to create reform with parameter_values
-   - GET /datasets/ to find dataset_id
-   - POST /analysis/economic-impact with policy_id and dataset_id
-   - Poll GET /analysis/economic-impact/{report_id} until completed
+   - GET /datasets/?tax_benefit_model_name=policyengine-uk to find dataset_id
+   - POST /analysis/economic-impact with tax_benefit_model_name, policy_id and dataset_id
+   - GET /analysis/economic-impact/{report_id} for results (includes decile_impacts and program_statistics)
 
-When searching for parameters, use tax_benefit_model_name to filter by country:
-- "policyengine-uk" for UK parameters
-- "policyengine-us" for US parameters
+## Guidelines
 
-When answering questions:
 1. Use the API tools to get accurate, current data
-2. Show your calculations clearly
-3. Be concise but thorough
-4. For UK, amounts are in GBP. For US, amounts are in USD.
-5. Poll async endpoints until status is "completed"
-
-IMPORTANT: When polling async endpoints, ALWAYS use the sleep tool to wait 5-10 seconds between requests.
-Do not poll in a tight loop - this wastes resources and may hit rate limits.
+2. Be concise but thorough
+3. For UK, amounts are in GBP. For US, amounts are in USD.
+4. When polling async endpoints, use the sleep tool to wait 5-10 seconds between requests
 """
 
 # Sleep tool for polling delays