Skip to content

Add Human Pages integration for human fallback#27

Open
human-pages-ai wants to merge 3 commits intotinyfish-io:mainfrom
human-pages-ai:add-humanpages-integration
Open

Add Human Pages integration for human fallback#27
human-pages-ai wants to merge 3 commits intotinyfish-io:mainfrom
human-pages-ai:add-humanpages-integration

Conversation

@human-pages-ai
Copy link
Copy Markdown

Summary

  • Adds a humanpages integration package that combines AgentQL with Human Pages for automatic human fallback when extraction fails
  • HumanFallbackAgent class: tries AgentQL first, delegates to a real human if extraction fails
  • HumanPagesClient class: full REST API wrapper with sync and async methods
  • 21 unit tests, example script, Jupyter notebook in docs/

Package structure

humanpages/
├── agentql_humanpages/     # Package source
│   ├── agent.py            # HumanFallbackAgent (sync + async)
│   ├── client.py           # HumanPagesClient REST wrapper
│   ├── const.py            # Endpoints and defaults
│   └── messages.py         # Error messages
├── tests/unit_tests/       # 21 tests (all pass, no network)
├── examples/               # Scraper example with README
├── docs/                   # Jupyter notebook
├── pyproject.toml          # Poetry config
└── Makefile                # Standard targets

Links

AgentQL + Human Pages: when automated extraction fails, delegate to a
real human. Includes sync/async API, unit tests, example script, and
docs notebook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@human-pages-ai has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 57 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 57 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ee0a448-cb59-4699-90d6-03c539052fa4

📥 Commits

Reviewing files that changed from the base of the PR and between cfe98e9 and a471f83.

📒 Files selected for processing (2)
  • humanpages/Makefile
  • humanpages/agentql_humanpages/agent.py
📝 Walkthrough

Walkthrough

This pull request adds the agentql-humanpages integration package. It implements HumanFallbackAgent (sync and async) that attempts AgentQL extraction and falls back to Human Pages jobs when AgentQL fails or returns empty results. It introduces HumanPagesClient for REST interactions, configuration constants and error messages, packaging and build files, examples, documentation, and unit/integration tests and CI/Makefile targets.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HumanFallbackAgent
    participant AgentQL_API as "AgentQL API"
    participant HumanPagesClient
    participant HumanPages_API as "Human Pages API"

    Client->>HumanFallbackAgent: extract(url, query/prompt)
    HumanFallbackAgent->>HumanFallbackAgent: validate query/prompt

    HumanFallbackAgent->>AgentQL_API: POST /v1/query-data (url, query/prompt, mode)
    alt AgentQL returns data
        AgentQL_API-->>HumanFallbackAgent: 200 + data
        HumanFallbackAgent-->>Client: {"source":"agentql","data":...}
    else AgentQL error or empty data
        AgentQL_API-->>HumanFallbackAgent: error or empty
        HumanFallbackAgent->>HumanPagesClient: search_humans(skill="web task", available=True)
        HumanPagesClient->>HumanPages_API: GET /api/humans/search
        HumanPages_API-->>HumanPagesClient: humans list
        HumanPagesClient-->>HumanFallbackAgent: humans list

        HumanFallbackAgent->>HumanPagesClient: create_job(humanId, title, description, priceUsdc, deadlineHours)
        HumanPagesClient->>HumanPages_API: POST /api/jobs
        HumanPages_API-->>HumanPagesClient: job created (job_id)
        HumanPagesClient-->>HumanFallbackAgent: job details

        loop poll until terminal or max attempts
            HumanFallbackAgent->>HumanPagesClient: get_job_status(job_id)
            HumanPagesClient->>HumanPages_API: GET /api/jobs/{job_id}
            HumanPages_API-->>HumanPagesClient: job status
            HumanPagesClient-->>HumanFallbackAgent: status
        end

        HumanFallbackAgent->>HumanPagesClient: get_job_messages(job_id)
        HumanPagesClient->>HumanPages_API: GET /api/jobs/{job_id}/messages
        HumanPages_API-->>HumanPagesClient: messages
        HumanPagesClient-->>HumanFallbackAgent: messages

        HumanFallbackAgent-->>Client: {"source":"humanpages","job_id":...,"status":...,"messages":[...]}
    end
Loading
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: adding a Human Pages integration package for automatic human fallback when AgentQL extraction fails.
Description check ✅ Passed The PR description is directly related to the changeset. It clearly explains the purpose (combining AgentQL with Human Pages for fallback), the main classes introduced (HumanFallbackAgent and HumanPagesClient), the package structure, and includes links to relevant documentation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (11)
humanpages/.gitignore (1)

1-1: Use a directory pattern for clarity.

__pycache__ works, but __pycache__/ is clearer and explicitly targets directories.

Suggested tweak
-__pycache__
+__pycache__/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/.gitignore` at line 1, Replace the bare "__pycache__" entry in
.gitignore with the directory-specific pattern "__pycache__/" so the rule
explicitly targets the cache directories; update the existing "__pycache__" line
to "__pycache__/" in the .gitignore file.
humanpages/pyproject.toml (2)

5-12: Consider populating the authors field.

The authors field is currently empty. While not required, populating this field with author information is recommended for published packages to provide proper attribution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/pyproject.toml` around lines 5 - 12, The authors field under the
[tool.poetry] section is empty; populate the authors array in pyproject.toml
(the authors entry under [tool.poetry]) with one or more author strings (e.g.,
"Name <email>") to provide proper attribution for the package release and
metadata.

1-3: Consider updating the poetry-core version constraint.

The minimum version constraint >=1.0.0 includes poetry-core versions from 2020. Modern Poetry projects typically specify more recent versions to benefit from bug fixes and improvements.

📦 Suggested update
-requires = ["poetry-core>=1.0.0"]
+requires = ["poetry-core>=1.9.0"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/pyproject.toml` around lines 1 - 3, Update the build-system
requirement for poetry-core to a more recent minimum to get modern fixes: edit
the pyproject.toml's [build-system] requires entry (the "requires" key
referencing "poetry-core") and change the version constraint from
"poetry-core>=1.0.0" to a newer minimum (for example "poetry-core>=1.4.0" or
your chosen supported minimum), keeping build-backend =
"poetry.core.masonry.api" unchanged and then run your packaging/build checks to
confirm compatibility.
humanpages/LICENSE (1)

1-21: LGTM! Consider copyright year.

The MIT License text is standard and correct. The copyright year is set to 2024, which may represent the original creation date. If this package is being published in 2026, you may optionally want to update it to reflect the current year or use a range (e.g., "2024-2026").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/LICENSE` around lines 1 - 21, Update the copyright line in the
LICENSE file: change the year "2024" to the current year or a year range (for
example "2024-2026") so the MIT header reflects the publication timeframe; edit
the top lines of the LICENSE where the copyright notice appears.
humanpages/tests/unit_tests/test_agent.py (2)

36-42: Test may leak environment values — use clear=True.

Unlike the two tests below (Lines 45, 50), this one does not clear the environment before patching, so a developer who exports AGENTQL_API_KEY/HUMANPAGES_API_KEY in their shell can still pass if patch.dict ordering surprises them. Safer to mirror the other tests:

Proposed change
-        with patch.dict("os.environ", {
-            "AGENTQL_API_KEY": "env-aql",
-            "HUMANPAGES_API_KEY": "env-hp",
-        }):
+        with patch.dict(
+            "os.environ",
+            {"AGENTQL_API_KEY": "env-aql", "HUMANPAGES_API_KEY": "env-hp"},
+            clear=True,
+        ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/tests/unit_tests/test_agent.py` around lines 36 - 42, The
test_init_from_env uses patch.dict("os.environ", {...}) without clear=True which
can allow real environment variables to leak into the test; update the patch
call in test_init_from_env to use patch.dict("os.environ", {...}, clear=True) to
match the other tests and ensure HumanFallbackAgent initialization reads only
the provided keys (reference test_init_from_env, patch.dict usage, and
HumanFallbackAgent).

78-107: Mock doesn't reflect real AgentQL failure path.

In humanpages/agentql_humanpages/agent.py the flow is response = httpx.post(...); response.raise_for_status(), so HTTP errors are surfaced from raise_for_status, not from httpx.post itself. Making httpx.post raise directly via side_effect works today because the agent's except (httpx.HTTPError, httpx.TimeoutException, ValueError) catches it, but it bypasses the real raise_for_status code path (and would silently stop validating it if someone refactors to a Client/retry wrapper). Prefer returning a 500 response and letting raise_for_status trigger the fallback:

Proposed change
-        with (
-            patch.object(
-                httpx, "post",
-                side_effect=httpx.HTTPStatusError(
-                    "500", request=agentql_error.request, response=agentql_error
-                ),
-            ),
+        with (
+            patch.object(httpx, "post", return_value=agentql_error),
             patch.object(hp_client, "search_humans", return_value=mock_humans),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/tests/unit_tests/test_agent.py` around lines 78 - 107, The test
currently makes httpx.post raise an httpx.HTTPStatusError directly, but the real
AgentQL error path is triggered by response.raise_for_status(); modify
test_fallback_on_agentql_http_error so httpx.post returns the mocked 500
response (agentql_error) instead of raising, e.g. patch.object(httpx, "post",
return_value=agentql_error), leaving the hp_client method patches and assertions
unchanged so response.raise_for_status() inside agent._call/agent.extract
triggers the fallback.
humanpages/docs/human_fallback.ipynb (1)

30-35: Nit: use os.environ.setdefault or getpass to avoid overwriting real keys in a learning notebook.

Running this cell as-is will overwrite any real AGENTQL_API_KEY/HUMANPAGES_API_KEY the user already has configured with the placeholder values, which then causes the HumanFallbackAgent() cell below to succeed instantiation but fail at request time with a confusing 401. Consider os.environ.setdefault(...) or prompting with getpass.getpass().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/docs/human_fallback.ipynb` around lines 30 - 35, The notebook cell
currently overwrites real credentials by setting os.environ["AGENTQL_API_KEY"]
and os.environ["HUMANPAGES_API_KEY"]; change this to avoid clobbering real keys
by using os.environ.setdefault("AGENTQL_API_KEY", "<placeholder>") and
os.environ.setdefault("HUMANPAGES_API_KEY", "<placeholder>") or prompt for
secrets with getpass.getpass() before instantiating HumanFallbackAgent(), so
existing environment values are preserved and users must explicitly enter
placeholders.
humanpages/examples/human_fallback_scraper/human_fallback_scraper.py (1)

36-47: Minor: guard against missing keys in fallback branch.

If the Human Pages job was cancelled or otherwise incomplete, result["messages"] may be an empty list (see test_fallback_cancelled_job in humanpages/tests/unit_tests/test_agent.py), and callers currently get a silent "no output" with no indication something went wrong. Consider printing the status upfront and a hint when messages is empty so the example surfaces the cancelled/timeout case instead of appearing to succeed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/examples/human_fallback_scraper/human_fallback_scraper.py` around
lines 36 - 47, The fallback branch handling the human pages result should guard
against missing or empty messages from agent.extract(url, query); update the
else branch that inspects result (the block that prints Job status and iterates
result["messages"]) to use result.get("messages", []) and to print the job
status first, then if the messages list is empty print a clear hint like "No
messages returned — job may be cancelled or timed out (status: ...)" instead of
silently doing nothing; ensure you still iterate and print each message when
present so human_fallback_scraper.py surfaces cancelled/timeout cases.
humanpages/README.md (1)

33-37: Docs inconsistency: fallback return shape.

The quick-start prints result["messages"] in the humanpages branch, but per the section below (Line 67-69) and test_fallback_cancelled_job, messages can be an empty list when a job is cancelled. Consider showing result["status"] alongside messages in the quickstart so users don't think an empty output means success.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/README.md` around lines 33 - 37, The quick-start prints only
result["messages"] for the human branch which can be empty on cancelled jobs;
update the README's conditional (the block checking if result["source"] ==
"agentql") so that in the else branch it prints both result["status"] and
result["messages"] (or otherwise includes status text) to make
cancellation/empty-result cases explicit — reference the result dict, the
"agentql" branch, the "messages" key and the "status" key (and note
test_fallback_cancelled_job) when making the change.
humanpages/tests/unit_tests/test_client.py (2)

78-95: Strengthen the payload assertion.

test_create_job_with_custom_params checks priceUsdc/deadlineHours but not that the request hit CREATE_JOB_ENDPOINT nor that humanId/title/description are serialized with the expected camelCase keys. Given the payload shape is the API contract, asserting the URL and required keys here would catch silent field-rename regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/tests/unit_tests/test_client.py` around lines 78 - 95, The test
test_create_job_with_custom_params should also assert that the HTTP request was
made to the expected endpoint and that all required fields are serialized with
the API's camelCase keys; update the test to check that the mocked httpx.post
was called with the CREATE_JOB_ENDPOINT (or the same URL string used by
HumanPagesClient.create_job) and assert that the JSON payload includes
"humanId", "title", and "description" (in addition to the existing checks for
"priceUsdc" and "deadlineHours") to catch silent field-rename regressions.

52-60: Same mock-vs-reality gap as in test_agent.py.

HumanPagesClient.search_humans (see humanpages/agentql_humanpages/client.py Lines 74-100) calls httpx.get(...) then response.raise_for_status() inside a try/except httpx.HTTPStatusError. Raising HTTPStatusError from httpx.get itself sidesteps raise_for_status; a return-a-401-response pattern mirrors production more faithfully:

Proposed change
-        with patch.object(httpx, "get", side_effect=httpx.HTTPStatusError(
-            "401", request=resp.request, response=resp
-        )):
+        with patch.object(httpx, "get", return_value=resp):
             with pytest.raises(ValueError, match="Invalid Human Pages API key"):
                 client.search_humans()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/tests/unit_tests/test_client.py` around lines 52 - 60, The test
currently raises httpx.HTTPStatusError from httpx.get itself which bypasses the
client code's response.raise_for_status path; change
test_search_humans_unauthorized to have httpx.get return the mocked 401 Response
(use the existing resp from _mock_response(401, ...)) instead of raising, so
that HumanPagesClient.search_humans calls response.raise_for_status and that
triggers the HTTPStatusError; update the patch to patch.object(httpx, "get",
return_value=resp) (or make resp.raise_for_status raise httpx.HTTPStatusError)
while keeping the pytest.raises(ValueError, match="Invalid Human Pages API key")
assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@humanpages/agentql_humanpages/agent.py`:
- Around line 249-250: The code currently checks that at least one of query or
prompt is provided but does not enforce mutual exclusivity; update the public
entry points (add the reciprocal check in aextract() and the other public method
that currently forwards both to AgentQL) to raise a ValueError when both query
and prompt are supplied (e.g., add an if query and prompt: raise
ValueError("Only one of 'query' or 'prompt' may be provided.")). Ensure you
apply this exact mutual-exclusion check in the functions that call AgentQL so
neither method ever sends both parameters.
- Around line 258-262: The logs currently output full user-provided URLs (in the
logger.info call with message "AgentQL returned empty data for %s, falling back
to human." and the AGENTQL_EXTRACTION_FAILED log) which may include sensitive
query strings/fragments; update both places (the extract() callsite and
aextract()) to log a redacted URL instead by introducing or using a helper like
redact_url(url) that strips query, fragment and userinfo (keeping only
scheme+host+path or masking the netloc) and replace the raw url with
redact_url(url) in the logger.info calls so no sensitive data is persisted.
- Around line 310-317: aextract() currently calls the synchronous
_delegate_to_human() which blocks the event loop; implement an async counterpart
_adelegate_to_human() that mirrors _delegate_to_human() but uses the
HumanPagesClient async methods (asearch_humans(), acreate_job(),
aget_job_status(), aget_job_messages()) and uses await asyncio.sleep(...) for
polling instead of time.sleep, then update aextract() to await
_adelegate_to_human(...) instead of calling the sync method so delegation no
longer blocks the event loop.

In `@humanpages/agentql_humanpages/client.py`:
- Around line 149-170: The job_id is being interpolated directly into endpoint
paths (JOB_STATUS_ENDPOINT, JOB_MESSAGES_ENDPOINT) which allows chars like '/',
'?', or '#' to break the URL; before formatting the endpoints in get_job_status,
get_job_messages (and the async counterparts aget_job_status, aget_job_messages)
URL-encode job_id (e.g., encoded_job_id = urllib.parse.quote(job_id, safe=""))
and use that encoded_job_id when calling .format(...); update imports to include
urllib.parse.quote if needed and ensure all four methods use the same
encoded_job_id pattern.
- Around line 66-72: When extracting an error message from e.response.json(),
guard against non-dict JSON bodies: call e.response.json() into error_json,
check if isinstance(error_json, dict) before using error_json.get(...); if it is
not a dict (e.g. list or string), set msg = str(error_json). Update the
exception handling around the try block (where msg, error_json and e are used)
so it doesn't call .get on non-dict objects and still falls back to f"HTTP {e}"
on JSON parse errors.

In `@humanpages/Makefile`:
- Line 38: The current Makefile line using PYTHON_FILES and MYPY_CACHE lets
poetry run mypy even when PYTHON_FILES is empty because the guard isn't grouped;
update the rule that references PYTHON_FILES and MYPY_CACHE so the existence
check and mkdir are grouped (e.g., wrap the guard and mkdir together) before the
&& poetry run mypy invocation, ensuring poetry run mypy only executes when
PYTHON_FILES is non-empty.
- Line 30: The lint_diff/format_diff target is using the wrong path and base
branch; update the shell command that defines PYTHON_FILES to diff the
humanpages package and compare against the main branch instead of master by
replacing the path token "libs/partners/agentql" with "humanpages/" (or the
exact package subpath under humanpages if applicable) and changing the base
branch string "master" to "main" so the grep for '\.py$$|\.ipynb$$' still runs
on the correct changed files.

In `@humanpages/pyproject.toml`:
- Around line 56-72: Update the outdated dev dependency constraints in the
pyproject.toml groups: under [tool.poetry.group.test.dependencies] bump pytest
from ^7.4.3 to a constraint that allows v9 (e.g., ^9.0.3 or >=9.0.3,<10.0.0),
bump pytest-asyncio from ^0.23.2 to allow v1.x (e.g., ^1.3.0 or >=1.3.0,<2.0.0),
and update respx from ^0.21.1 to a newer constraint that includes 0.23.1 (e.g.,
^0.23.1); under [tool.poetry.group.typing.dependencies] update mypy from ^1.10
to a constraint that includes 1.20.1 (e.g., ^1.20.1 or >=1.20.1,<2.0.0); also
verify and, if needed, update pytest-watcher and codespell in
[tool.poetry.group.test.dependencies] and
[tool.poetry.group.codespell.dependencies] respectively so all dev deps allow
the current stable releases.
- Around line 14-15: In the [tool.mypy] section change the disallow_untyped_defs
setting from a quoted string to a TOML boolean by replacing
disallow_untyped_defs = "True" with disallow_untyped_defs = true so mypy sees a
proper boolean value; locate the setting under the [tool.mypy] header and update
the literal accordingly.
- Around line 20-24: The pydantic dependency spec is currently `^2.0` which
allows installing versions vulnerable to CVE-2024-3772; update the dependency
line for pydantic in pyproject.toml to require at least 2.4.0 (for example
change the version spec to `^2.4.0`) so the resolver will only allow patched
releases and then run your dependency update/install to refresh the lockfile.

---

Nitpick comments:
In `@humanpages/.gitignore`:
- Line 1: Replace the bare "__pycache__" entry in .gitignore with the
directory-specific pattern "__pycache__/" so the rule explicitly targets the
cache directories; update the existing "__pycache__" line to "__pycache__/" in
the .gitignore file.

In `@humanpages/docs/human_fallback.ipynb`:
- Around line 30-35: The notebook cell currently overwrites real credentials by
setting os.environ["AGENTQL_API_KEY"] and os.environ["HUMANPAGES_API_KEY"];
change this to avoid clobbering real keys by using
os.environ.setdefault("AGENTQL_API_KEY", "<placeholder>") and
os.environ.setdefault("HUMANPAGES_API_KEY", "<placeholder>") or prompt for
secrets with getpass.getpass() before instantiating HumanFallbackAgent(), so
existing environment values are preserved and users must explicitly enter
placeholders.

In `@humanpages/examples/human_fallback_scraper/human_fallback_scraper.py`:
- Around line 36-47: The fallback branch handling the human pages result should
guard against missing or empty messages from agent.extract(url, query); update
the else branch that inspects result (the block that prints Job status and
iterates result["messages"]) to use result.get("messages", []) and to print the
job status first, then if the messages list is empty print a clear hint like "No
messages returned — job may be cancelled or timed out (status: ...)" instead of
silently doing nothing; ensure you still iterate and print each message when
present so human_fallback_scraper.py surfaces cancelled/timeout cases.

In `@humanpages/LICENSE`:
- Around line 1-21: Update the copyright line in the LICENSE file: change the
year "2024" to the current year or a year range (for example "2024-2026") so the
MIT header reflects the publication timeframe; edit the top lines of the LICENSE
where the copyright notice appears.

In `@humanpages/pyproject.toml`:
- Around line 5-12: The authors field under the [tool.poetry] section is empty;
populate the authors array in pyproject.toml (the authors entry under
[tool.poetry]) with one or more author strings (e.g., "Name <email>") to provide
proper attribution for the package release and metadata.
- Around line 1-3: Update the build-system requirement for poetry-core to a more
recent minimum to get modern fixes: edit the pyproject.toml's [build-system]
requires entry (the "requires" key referencing "poetry-core") and change the
version constraint from "poetry-core>=1.0.0" to a newer minimum (for example
"poetry-core>=1.4.0" or your chosen supported minimum), keeping build-backend =
"poetry.core.masonry.api" unchanged and then run your packaging/build checks to
confirm compatibility.

In `@humanpages/README.md`:
- Around line 33-37: The quick-start prints only result["messages"] for the
human branch which can be empty on cancelled jobs; update the README's
conditional (the block checking if result["source"] == "agentql") so that in the
else branch it prints both result["status"] and result["messages"] (or otherwise
includes status text) to make cancellation/empty-result cases explicit —
reference the result dict, the "agentql" branch, the "messages" key and the
"status" key (and note test_fallback_cancelled_job) when making the change.

In `@humanpages/tests/unit_tests/test_agent.py`:
- Around line 36-42: The test_init_from_env uses patch.dict("os.environ", {...})
without clear=True which can allow real environment variables to leak into the
test; update the patch call in test_init_from_env to use
patch.dict("os.environ", {...}, clear=True) to match the other tests and ensure
HumanFallbackAgent initialization reads only the provided keys (reference
test_init_from_env, patch.dict usage, and HumanFallbackAgent).
- Around line 78-107: The test currently makes httpx.post raise an
httpx.HTTPStatusError directly, but the real AgentQL error path is triggered by
response.raise_for_status(); modify test_fallback_on_agentql_http_error so
httpx.post returns the mocked 500 response (agentql_error) instead of raising,
e.g. patch.object(httpx, "post", return_value=agentql_error), leaving the
hp_client method patches and assertions unchanged so response.raise_for_status()
inside agent._call/agent.extract triggers the fallback.

In `@humanpages/tests/unit_tests/test_client.py`:
- Around line 78-95: The test test_create_job_with_custom_params should also
assert that the HTTP request was made to the expected endpoint and that all
required fields are serialized with the API's camelCase keys; update the test to
check that the mocked httpx.post was called with the CREATE_JOB_ENDPOINT (or the
same URL string used by HumanPagesClient.create_job) and assert that the JSON
payload includes "humanId", "title", and "description" (in addition to the
existing checks for "priceUsdc" and "deadlineHours") to catch silent
field-rename regressions.
- Around line 52-60: The test currently raises httpx.HTTPStatusError from
httpx.get itself which bypasses the client code's response.raise_for_status
path; change test_search_humans_unauthorized to have httpx.get return the mocked
401 Response (use the existing resp from _mock_response(401, ...)) instead of
raising, so that HumanPagesClient.search_humans calls response.raise_for_status
and that triggers the HTTPStatusError; update the patch to patch.object(httpx,
"get", return_value=resp) (or make resp.raise_for_status raise
httpx.HTTPStatusError) while keeping the pytest.raises(ValueError,
match="Invalid Human Pages API key") assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13fe43ea-9e4a-40ce-9cfa-359bb35418f6

📥 Commits

Reviewing files that changed from the base of the PR and between 53b4898 and 9de5283.

📒 Files selected for processing (19)
  • humanpages/.gitignore
  • humanpages/LICENSE
  • humanpages/Makefile
  • humanpages/README.md
  • humanpages/agentql_humanpages/__init__.py
  • humanpages/agentql_humanpages/agent.py
  • humanpages/agentql_humanpages/client.py
  • humanpages/agentql_humanpages/const.py
  • humanpages/agentql_humanpages/messages.py
  • humanpages/agentql_humanpages/py.typed
  • humanpages/docs/human_fallback.ipynb
  • humanpages/examples/human_fallback_scraper/README.md
  • humanpages/examples/human_fallback_scraper/human_fallback_scraper.py
  • humanpages/pyproject.toml
  • humanpages/tests/__init__.py
  • humanpages/tests/integration_tests/__init__.py
  • humanpages/tests/unit_tests/__init__.py
  • humanpages/tests/unit_tests/test_agent.py
  • humanpages/tests/unit_tests/test_client.py

Comment thread humanpages/agentql_humanpages/agent.py Outdated
Comment on lines +258 to +262
logger.info("AgentQL returned empty data for %s, falling back to human.", url)
except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:
logger.info(
AGENTQL_EXTRACTION_FAILED.format(url=url, detail=str(e))
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging full user-provided URLs.

These info logs can persist query strings/fragments that may contain tokens, emails, or other sensitive values. Log a redacted URL instead.

🛡️ Proposed fix
 import logging
 import os
 import time
 from typing import Any, Optional
+from urllib.parse import urlsplit, urlunsplit
+def _redact_url_for_log(url: str) -> str:
+    parsed = urlsplit(url)
+    return urlunsplit((parsed.scheme, parsed.netloc, parsed.path, "", ""))
+
+
 logger = logging.getLogger(__name__)
+            safe_url = _redact_url_for_log(url)
-            logger.info("AgentQL returned empty data for %s, falling back to human.", url)
+            logger.info("AgentQL returned empty data for %s, falling back to human.", safe_url)
         except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:
+            safe_url = _redact_url_for_log(url)
             logger.info(
-                AGENTQL_EXTRACTION_FAILED.format(url=url, detail=str(e))
+                AGENTQL_EXTRACTION_FAILED.format(url=safe_url, detail=str(e))
             )

Apply the same redaction in aextract().

Also applies to: 297-301

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/agentql_humanpages/agent.py` around lines 258 - 262, The logs
currently output full user-provided URLs (in the logger.info call with message
"AgentQL returned empty data for %s, falling back to human." and the
AGENTQL_EXTRACTION_FAILED log) which may include sensitive query
strings/fragments; update both places (the extract() callsite and aextract()) to
log a redacted URL instead by introducing or using a helper like redact_url(url)
that strips query, fragment and userinfo (keeping only scheme+host+path or
masking the netloc) and replace the raw url with redact_url(url) in the
logger.info calls so no sensitive data is persisted.

Comment thread humanpages/agentql_humanpages/agent.py Outdated
Comment thread humanpages/agentql_humanpages/client.py
Comment thread humanpages/agentql_humanpages/client.py Outdated
Comment thread humanpages/Makefile Outdated
Comment thread humanpages/Makefile Outdated
Comment thread humanpages/pyproject.toml Outdated
Comment thread humanpages/pyproject.toml Outdated
Comment thread humanpages/pyproject.toml
Comment on lines +56 to +72
[tool.poetry.group.test.dependencies]
pytest = "^7.4.3"
pytest-asyncio = "^0.23.2"
pytest-socket = "^0.7.0"
pytest-watcher = "^0.3.4"
respx = "^0.21.1"

[tool.poetry.group.codespell.dependencies]
codespell = "^2.2.6"

[tool.poetry.group.test_integration.dependencies]

[tool.poetry.group.lint.dependencies]
ruff = "^0.5"

[tool.poetry.group.typing.dependencies]
mypy = "^1.10"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Latest versions of pytest, pytest-asyncio, pytest-socket, respx, ruff, mypy in April 2026

💡 Result:

pytest: 9.0.3 (released 2026-04-07) pytest-asyncio: 1.3.0 (latest stable, released 2025-11-10; note: 1.4.0a1 prerelease on 2026-04-15) pytest-socket: 0.7.0 (released 2024-01-28) respx: 0.23.1 (released 2026-04-08) ruff: 0.15.10 (released 2026-04-09) mypy: 1.20.1 (released 2026-04-13)

Citations:


Update outdated development dependencies to match current releases.

Several development dependencies are significantly outdated:

  • pytest: Constraint ^7.4.3 caps at <8.0.0; latest is 9.0.3
  • pytest-asyncio: Constraint ^0.23.2 caps at <1.0.0; latest is 1.3.0 (major version bump available)
  • respx: Constraint ^0.21.1 is behind; latest is 0.23.1
  • mypy: Constraint ^1.10 is behind; latest is 1.20.1

pytest-socket (0.7.0) is current. Verify pytest-watcher and codespell versions separately, then update constraints to allow compatible newer releases or pin to latest stable versions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/pyproject.toml` around lines 56 - 72, Update the outdated dev
dependency constraints in the pyproject.toml groups: under
[tool.poetry.group.test.dependencies] bump pytest from ^7.4.3 to a constraint
that allows v9 (e.g., ^9.0.3 or >=9.0.3,<10.0.0), bump pytest-asyncio from
^0.23.2 to allow v1.x (e.g., ^1.3.0 or >=1.3.0,<2.0.0), and update respx from
^0.21.1 to a newer constraint that includes 0.23.1 (e.g., ^0.23.1); under
[tool.poetry.group.typing.dependencies] update mypy from ^1.10 to a constraint
that includes 1.20.1 (e.g., ^1.20.1 or >=1.20.1,<2.0.0); also verify and, if
needed, update pytest-watcher and codespell in
[tool.poetry.group.test.dependencies] and
[tool.poetry.group.codespell.dependencies] respectively so all dev deps allow
the current stable releases.

- Fix TOML boolean syntax for mypy disallow_untyped_defs
- Bump pydantic minimum to ^2.4.0 (CVE-2024-3772)
- Add isinstance check before calling .get() on error_json
- URL-encode job_id in endpoint path interpolation
- Enforce mutual exclusivity of query/prompt parameters
- Add async _adelegate_to_human to avoid blocking event loop
- Fix Makefile lint_diff path and branch, group mypy guard
- Update LICENSE year to 2025
- Add trailing slash to __pycache__ in .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
humanpages/Makefile (1)

30-30: ⚠️ Potential issue | 🟡 Minor

Make the diff base configurable and use merge-base semantics.

Line 30 fixed the old package path, but it still hard-codes main and compares against that ref directly. This can fail when only origin/main exists, or lint files from target-branch-only changes after branch divergence.

🛠️ Proposed fix
 MYPY_CACHE=.mypy_cache
+BASE_REF ?= main
 lint format: PYTHON_FILES=.
-lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative --name-only --diff-filter=d main | grep -E '\.py$$|\.ipynb$$')
+lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative --name-only --diff-filter=d $(BASE_REF)...HEAD | grep -E '\.py$$|\.ipynb$$')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/Makefile` at line 30, The PYTHON_FILES assignment currently
hard-codes "main" in the git diff call; make the diff base configurable and
compute a proper merge-base (fork-point fallback) before running git diff. Add a
DIFF_BASE variable (defaulting to origin/main), compute MERGE_BASE using git
merge-base --fork-point ${DIFF_BASE} HEAD with a fallback to git merge-base
${DIFF_BASE} HEAD, and then use that MERGE_BASE in the git diff command used by
the PYTHON_FILES assignment instead of the literal "main". Ensure the new
variables replace the old literal so CI works when only origin/main exists or
when target-branch-only changes are present.
🧹 Nitpick comments (2)
humanpages/agentql_humanpages/agent.py (2)

309-309: Remove redundant httpx.TimeoutException from exception handlers.

httpx.TimeoutException is a subclass of httpx.HTTPError, so catching it explicitly is redundant. Apply at L309 and L348:

♻️ Proposed changes
-        except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:
+        except (httpx.HTTPError, ValueError) as e:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/agentql_humanpages/agent.py` at line 309, The except clauses
currently catching (httpx.HTTPError, httpx.TimeoutException, ValueError) are
redundant because httpx.TimeoutException subclasses httpx.HTTPError; remove
httpx.TimeoutException from those tuples so they become (httpx.HTTPError,
ValueError) in both places (the except that reads "except (httpx.HTTPError,
httpx.TimeoutException, ValueError) as e:" and the other similar handler) to
avoid duplicate handling while preserving error semantics.

161-169: Consider reusing a single httpx.AsyncClient across calls.

The _agentql_extract_async method instantiates a new AsyncClient on every invocation, which prevents connection pooling and incurs connection/TLS overhead for each call. Holding a client on the instance (created lazily, closed via aclose() or async context manager on the agent) would enable connection pooling.

However, note that the synchronous _agentql_extract method has the same pattern with httpx.post(), so this is a design consistency issue rather than an async-specific concern. Also, since the agent appears designed for single-call usage per instance (based on the current API and tests), the practical benefit depends on use cases that reuse a single agent instance across multiple extractions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@humanpages/agentql_humanpages/agent.py` around lines 161 - 169, The code
creates a new httpx.AsyncClient for every call in _agentql_extract_async (and
similarly uses httpx.post in _agentql_extract), preventing connection pooling;
change the class to hold a lazily-created instance attribute (e.g.,
self._async_client) and reuse it for requests to AGENTQL_EXTRACT_DATA_ENDPOINT
with timeout self._agentql_timeout, and add an async close method (or implement
__aenter__/__aexit__ / aclose usage) that calls self._async_client.aclose() to
clean up; ensure the synchronous path is made consistent (reuse a persistent
httpx.Client or document single-use behavior) and guard client creation to be
idempotent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@humanpages/Makefile`:
- Line 16: The ptw invocation places pytest-specific args before the watched
path; move reserved ptw options (e.g., --now) before the path, put the watched
path (.) next, then add the separator -- and the pytest options (e.g., -vv,
--snapshot-update, $(TEST_FILE)) afterwards; update the command string used in
the Makefile so it follows the pattern "ptw [ptw-opts] <watched-path> --
[pytest-opts]" to ensure pytest-watcher parses options correctly.

---

Duplicate comments:
In `@humanpages/Makefile`:
- Line 30: The PYTHON_FILES assignment currently hard-codes "main" in the git
diff call; make the diff base configurable and compute a proper merge-base
(fork-point fallback) before running git diff. Add a DIFF_BASE variable
(defaulting to origin/main), compute MERGE_BASE using git merge-base
--fork-point ${DIFF_BASE} HEAD with a fallback to git merge-base ${DIFF_BASE}
HEAD, and then use that MERGE_BASE in the git diff command used by the
PYTHON_FILES assignment instead of the literal "main". Ensure the new variables
replace the old literal so CI works when only origin/main exists or when
target-branch-only changes are present.

---

Nitpick comments:
In `@humanpages/agentql_humanpages/agent.py`:
- Line 309: The except clauses currently catching (httpx.HTTPError,
httpx.TimeoutException, ValueError) are redundant because httpx.TimeoutException
subclasses httpx.HTTPError; remove httpx.TimeoutException from those tuples so
they become (httpx.HTTPError, ValueError) in both places (the except that reads
"except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:" and the
other similar handler) to avoid duplicate handling while preserving error
semantics.
- Around line 161-169: The code creates a new httpx.AsyncClient for every call
in _agentql_extract_async (and similarly uses httpx.post in _agentql_extract),
preventing connection pooling; change the class to hold a lazily-created
instance attribute (e.g., self._async_client) and reuse it for requests to
AGENTQL_EXTRACT_DATA_ENDPOINT with timeout self._agentql_timeout, and add an
async close method (or implement __aenter__/__aexit__ / aclose usage) that calls
self._async_client.aclose() to clean up; ensure the synchronous path is made
consistent (reuse a persistent httpx.Client or document single-use behavior) and
guard client creation to be idempotent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdb6c6ef-cf96-4b31-828b-bac2221c9a29

📥 Commits

Reviewing files that changed from the base of the PR and between 9de5283 and cfe98e9.

📒 Files selected for processing (6)
  • humanpages/.gitignore
  • humanpages/LICENSE
  • humanpages/Makefile
  • humanpages/agentql_humanpages/agent.py
  • humanpages/agentql_humanpages/client.py
  • humanpages/pyproject.toml
✅ Files skipped from review due to trivial changes (3)
  • humanpages/LICENSE
  • humanpages/.gitignore
  • humanpages/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • humanpages/agentql_humanpages/client.py

Comment thread humanpages/Makefile Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@human-pages-ai
Copy link
Copy Markdown
Author

Disclosure: I'm a maintainer of Human Pages. Happy to adjust the integration if there's anything you'd like changed to better fit the repo's conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant