Add Human Pages integration for human fallback#27
Add Human Pages integration for human fallback#27human-pages-ai wants to merge 3 commits intotinyfish-io:mainfrom
Conversation
AgentQL + Human Pages: when automated extraction fails, delegate to a real human. Includes sync/async API, unit tests, example script, and docs notebook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 57 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis pull request adds the agentql-humanpages integration package. It implements HumanFallbackAgent (sync and async) that attempts AgentQL extraction and falls back to Human Pages jobs when AgentQL fails or returns empty results. It introduces HumanPagesClient for REST interactions, configuration constants and error messages, packaging and build files, examples, documentation, and unit/integration tests and CI/Makefile targets. Sequence Diagram(s)sequenceDiagram
participant Client
participant HumanFallbackAgent
participant AgentQL_API as "AgentQL API"
participant HumanPagesClient
participant HumanPages_API as "Human Pages API"
Client->>HumanFallbackAgent: extract(url, query/prompt)
HumanFallbackAgent->>HumanFallbackAgent: validate query/prompt
HumanFallbackAgent->>AgentQL_API: POST /v1/query-data (url, query/prompt, mode)
alt AgentQL returns data
AgentQL_API-->>HumanFallbackAgent: 200 + data
HumanFallbackAgent-->>Client: {"source":"agentql","data":...}
else AgentQL error or empty data
AgentQL_API-->>HumanFallbackAgent: error or empty
HumanFallbackAgent->>HumanPagesClient: search_humans(skill="web task", available=True)
HumanPagesClient->>HumanPages_API: GET /api/humans/search
HumanPages_API-->>HumanPagesClient: humans list
HumanPagesClient-->>HumanFallbackAgent: humans list
HumanFallbackAgent->>HumanPagesClient: create_job(humanId, title, description, priceUsdc, deadlineHours)
HumanPagesClient->>HumanPages_API: POST /api/jobs
HumanPages_API-->>HumanPagesClient: job created (job_id)
HumanPagesClient-->>HumanFallbackAgent: job details
loop poll until terminal or max attempts
HumanFallbackAgent->>HumanPagesClient: get_job_status(job_id)
HumanPagesClient->>HumanPages_API: GET /api/jobs/{job_id}
HumanPages_API-->>HumanPagesClient: job status
HumanPagesClient-->>HumanFallbackAgent: status
end
HumanFallbackAgent->>HumanPagesClient: get_job_messages(job_id)
HumanPagesClient->>HumanPages_API: GET /api/jobs/{job_id}/messages
HumanPages_API-->>HumanPagesClient: messages
HumanPagesClient-->>HumanFallbackAgent: messages
HumanFallbackAgent-->>Client: {"source":"humanpages","job_id":...,"status":...,"messages":[...]}
end
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (11)
humanpages/.gitignore (1)
1-1: Use a directory pattern for clarity.
__pycache__works, but__pycache__/is clearer and explicitly targets directories.Suggested tweak
-__pycache__ +__pycache__/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/.gitignore` at line 1, Replace the bare "__pycache__" entry in .gitignore with the directory-specific pattern "__pycache__/" so the rule explicitly targets the cache directories; update the existing "__pycache__" line to "__pycache__/" in the .gitignore file.humanpages/pyproject.toml (2)
5-12: Consider populating the authors field.The
authorsfield is currently empty. While not required, populating this field with author information is recommended for published packages to provide proper attribution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/pyproject.toml` around lines 5 - 12, The authors field under the [tool.poetry] section is empty; populate the authors array in pyproject.toml (the authors entry under [tool.poetry]) with one or more author strings (e.g., "Name <email>") to provide proper attribution for the package release and metadata.
1-3: Consider updating the poetry-core version constraint.The minimum version constraint
>=1.0.0includes poetry-core versions from 2020. Modern Poetry projects typically specify more recent versions to benefit from bug fixes and improvements.📦 Suggested update
-requires = ["poetry-core>=1.0.0"] +requires = ["poetry-core>=1.9.0"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/pyproject.toml` around lines 1 - 3, Update the build-system requirement for poetry-core to a more recent minimum to get modern fixes: edit the pyproject.toml's [build-system] requires entry (the "requires" key referencing "poetry-core") and change the version constraint from "poetry-core>=1.0.0" to a newer minimum (for example "poetry-core>=1.4.0" or your chosen supported minimum), keeping build-backend = "poetry.core.masonry.api" unchanged and then run your packaging/build checks to confirm compatibility.humanpages/LICENSE (1)
1-21: LGTM! Consider copyright year.The MIT License text is standard and correct. The copyright year is set to 2024, which may represent the original creation date. If this package is being published in 2026, you may optionally want to update it to reflect the current year or use a range (e.g., "2024-2026").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/LICENSE` around lines 1 - 21, Update the copyright line in the LICENSE file: change the year "2024" to the current year or a year range (for example "2024-2026") so the MIT header reflects the publication timeframe; edit the top lines of the LICENSE where the copyright notice appears.humanpages/tests/unit_tests/test_agent.py (2)
36-42: Test may leak environment values — useclear=True.Unlike the two tests below (Lines 45, 50), this one does not clear the environment before patching, so a developer who exports
AGENTQL_API_KEY/HUMANPAGES_API_KEYin their shell can still pass ifpatch.dictordering surprises them. Safer to mirror the other tests:Proposed change
- with patch.dict("os.environ", { - "AGENTQL_API_KEY": "env-aql", - "HUMANPAGES_API_KEY": "env-hp", - }): + with patch.dict( + "os.environ", + {"AGENTQL_API_KEY": "env-aql", "HUMANPAGES_API_KEY": "env-hp"}, + clear=True, + ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/tests/unit_tests/test_agent.py` around lines 36 - 42, The test_init_from_env uses patch.dict("os.environ", {...}) without clear=True which can allow real environment variables to leak into the test; update the patch call in test_init_from_env to use patch.dict("os.environ", {...}, clear=True) to match the other tests and ensure HumanFallbackAgent initialization reads only the provided keys (reference test_init_from_env, patch.dict usage, and HumanFallbackAgent).
78-107: Mock doesn't reflect real AgentQL failure path.In
humanpages/agentql_humanpages/agent.pythe flow isresponse = httpx.post(...); response.raise_for_status(), so HTTP errors are surfaced fromraise_for_status, not fromhttpx.postitself. Makinghttpx.postraise directly viaside_effectworks today because the agent'sexcept (httpx.HTTPError, httpx.TimeoutException, ValueError)catches it, but it bypasses the realraise_for_statuscode path (and would silently stop validating it if someone refactors to aClient/retry wrapper). Prefer returning a 500 response and lettingraise_for_statustrigger the fallback:Proposed change
- with ( - patch.object( - httpx, "post", - side_effect=httpx.HTTPStatusError( - "500", request=agentql_error.request, response=agentql_error - ), - ), + with ( + patch.object(httpx, "post", return_value=agentql_error), patch.object(hp_client, "search_humans", return_value=mock_humans),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/tests/unit_tests/test_agent.py` around lines 78 - 107, The test currently makes httpx.post raise an httpx.HTTPStatusError directly, but the real AgentQL error path is triggered by response.raise_for_status(); modify test_fallback_on_agentql_http_error so httpx.post returns the mocked 500 response (agentql_error) instead of raising, e.g. patch.object(httpx, "post", return_value=agentql_error), leaving the hp_client method patches and assertions unchanged so response.raise_for_status() inside agent._call/agent.extract triggers the fallback.humanpages/docs/human_fallback.ipynb (1)
30-35: Nit: useos.environ.setdefaultorgetpassto avoid overwriting real keys in a learning notebook.Running this cell as-is will overwrite any real
AGENTQL_API_KEY/HUMANPAGES_API_KEYthe user already has configured with the placeholder values, which then causes theHumanFallbackAgent()cell below to succeed instantiation but fail at request time with a confusing 401. Consideros.environ.setdefault(...)or prompting withgetpass.getpass().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/docs/human_fallback.ipynb` around lines 30 - 35, The notebook cell currently overwrites real credentials by setting os.environ["AGENTQL_API_KEY"] and os.environ["HUMANPAGES_API_KEY"]; change this to avoid clobbering real keys by using os.environ.setdefault("AGENTQL_API_KEY", "<placeholder>") and os.environ.setdefault("HUMANPAGES_API_KEY", "<placeholder>") or prompt for secrets with getpass.getpass() before instantiating HumanFallbackAgent(), so existing environment values are preserved and users must explicitly enter placeholders.humanpages/examples/human_fallback_scraper/human_fallback_scraper.py (1)
36-47: Minor: guard against missing keys in fallback branch.If the Human Pages job was cancelled or otherwise incomplete,
result["messages"]may be an empty list (seetest_fallback_cancelled_jobinhumanpages/tests/unit_tests/test_agent.py), and callers currently get a silent "no output" with no indication something went wrong. Consider printing the status upfront and a hint whenmessagesis empty so the example surfaces the cancelled/timeout case instead of appearing to succeed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/examples/human_fallback_scraper/human_fallback_scraper.py` around lines 36 - 47, The fallback branch handling the human pages result should guard against missing or empty messages from agent.extract(url, query); update the else branch that inspects result (the block that prints Job status and iterates result["messages"]) to use result.get("messages", []) and to print the job status first, then if the messages list is empty print a clear hint like "No messages returned — job may be cancelled or timed out (status: ...)" instead of silently doing nothing; ensure you still iterate and print each message when present so human_fallback_scraper.py surfaces cancelled/timeout cases.humanpages/README.md (1)
33-37: Docs inconsistency: fallback return shape.The quick-start prints
result["messages"]in the humanpages branch, but per the section below (Line 67-69) andtest_fallback_cancelled_job,messagescan be an empty list when a job is cancelled. Consider showingresult["status"]alongsidemessagesin the quickstart so users don't think an empty output means success.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/README.md` around lines 33 - 37, The quick-start prints only result["messages"] for the human branch which can be empty on cancelled jobs; update the README's conditional (the block checking if result["source"] == "agentql") so that in the else branch it prints both result["status"] and result["messages"] (or otherwise includes status text) to make cancellation/empty-result cases explicit — reference the result dict, the "agentql" branch, the "messages" key and the "status" key (and note test_fallback_cancelled_job) when making the change.humanpages/tests/unit_tests/test_client.py (2)
78-95: Strengthen the payload assertion.
test_create_job_with_custom_paramscheckspriceUsdc/deadlineHoursbut not that the request hitCREATE_JOB_ENDPOINTnor thathumanId/title/descriptionare serialized with the expected camelCase keys. Given the payload shape is the API contract, asserting the URL and required keys here would catch silent field-rename regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/tests/unit_tests/test_client.py` around lines 78 - 95, The test test_create_job_with_custom_params should also assert that the HTTP request was made to the expected endpoint and that all required fields are serialized with the API's camelCase keys; update the test to check that the mocked httpx.post was called with the CREATE_JOB_ENDPOINT (or the same URL string used by HumanPagesClient.create_job) and assert that the JSON payload includes "humanId", "title", and "description" (in addition to the existing checks for "priceUsdc" and "deadlineHours") to catch silent field-rename regressions.
52-60: Same mock-vs-reality gap as intest_agent.py.
HumanPagesClient.search_humans(seehumanpages/agentql_humanpages/client.pyLines 74-100) callshttpx.get(...)thenresponse.raise_for_status()inside atry/except httpx.HTTPStatusError. RaisingHTTPStatusErrorfromhttpx.getitself sidestepsraise_for_status; a return-a-401-response pattern mirrors production more faithfully:Proposed change
- with patch.object(httpx, "get", side_effect=httpx.HTTPStatusError( - "401", request=resp.request, response=resp - )): + with patch.object(httpx, "get", return_value=resp): with pytest.raises(ValueError, match="Invalid Human Pages API key"): client.search_humans()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/tests/unit_tests/test_client.py` around lines 52 - 60, The test currently raises httpx.HTTPStatusError from httpx.get itself which bypasses the client code's response.raise_for_status path; change test_search_humans_unauthorized to have httpx.get return the mocked 401 Response (use the existing resp from _mock_response(401, ...)) instead of raising, so that HumanPagesClient.search_humans calls response.raise_for_status and that triggers the HTTPStatusError; update the patch to patch.object(httpx, "get", return_value=resp) (or make resp.raise_for_status raise httpx.HTTPStatusError) while keeping the pytest.raises(ValueError, match="Invalid Human Pages API key") assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@humanpages/agentql_humanpages/agent.py`:
- Around line 249-250: The code currently checks that at least one of query or
prompt is provided but does not enforce mutual exclusivity; update the public
entry points (add the reciprocal check in aextract() and the other public method
that currently forwards both to AgentQL) to raise a ValueError when both query
and prompt are supplied (e.g., add an if query and prompt: raise
ValueError("Only one of 'query' or 'prompt' may be provided.")). Ensure you
apply this exact mutual-exclusion check in the functions that call AgentQL so
neither method ever sends both parameters.
- Around line 258-262: The logs currently output full user-provided URLs (in the
logger.info call with message "AgentQL returned empty data for %s, falling back
to human." and the AGENTQL_EXTRACTION_FAILED log) which may include sensitive
query strings/fragments; update both places (the extract() callsite and
aextract()) to log a redacted URL instead by introducing or using a helper like
redact_url(url) that strips query, fragment and userinfo (keeping only
scheme+host+path or masking the netloc) and replace the raw url with
redact_url(url) in the logger.info calls so no sensitive data is persisted.
- Around line 310-317: aextract() currently calls the synchronous
_delegate_to_human() which blocks the event loop; implement an async counterpart
_adelegate_to_human() that mirrors _delegate_to_human() but uses the
HumanPagesClient async methods (asearch_humans(), acreate_job(),
aget_job_status(), aget_job_messages()) and uses await asyncio.sleep(...) for
polling instead of time.sleep, then update aextract() to await
_adelegate_to_human(...) instead of calling the sync method so delegation no
longer blocks the event loop.
In `@humanpages/agentql_humanpages/client.py`:
- Around line 149-170: The job_id is being interpolated directly into endpoint
paths (JOB_STATUS_ENDPOINT, JOB_MESSAGES_ENDPOINT) which allows chars like '/',
'?', or '#' to break the URL; before formatting the endpoints in get_job_status,
get_job_messages (and the async counterparts aget_job_status, aget_job_messages)
URL-encode job_id (e.g., encoded_job_id = urllib.parse.quote(job_id, safe=""))
and use that encoded_job_id when calling .format(...); update imports to include
urllib.parse.quote if needed and ensure all four methods use the same
encoded_job_id pattern.
- Around line 66-72: When extracting an error message from e.response.json(),
guard against non-dict JSON bodies: call e.response.json() into error_json,
check if isinstance(error_json, dict) before using error_json.get(...); if it is
not a dict (e.g. list or string), set msg = str(error_json). Update the
exception handling around the try block (where msg, error_json and e are used)
so it doesn't call .get on non-dict objects and still falls back to f"HTTP {e}"
on JSON parse errors.
In `@humanpages/Makefile`:
- Line 38: The current Makefile line using PYTHON_FILES and MYPY_CACHE lets
poetry run mypy even when PYTHON_FILES is empty because the guard isn't grouped;
update the rule that references PYTHON_FILES and MYPY_CACHE so the existence
check and mkdir are grouped (e.g., wrap the guard and mkdir together) before the
&& poetry run mypy invocation, ensuring poetry run mypy only executes when
PYTHON_FILES is non-empty.
- Line 30: The lint_diff/format_diff target is using the wrong path and base
branch; update the shell command that defines PYTHON_FILES to diff the
humanpages package and compare against the main branch instead of master by
replacing the path token "libs/partners/agentql" with "humanpages/" (or the
exact package subpath under humanpages if applicable) and changing the base
branch string "master" to "main" so the grep for '\.py$$|\.ipynb$$' still runs
on the correct changed files.
In `@humanpages/pyproject.toml`:
- Around line 56-72: Update the outdated dev dependency constraints in the
pyproject.toml groups: under [tool.poetry.group.test.dependencies] bump pytest
from ^7.4.3 to a constraint that allows v9 (e.g., ^9.0.3 or >=9.0.3,<10.0.0),
bump pytest-asyncio from ^0.23.2 to allow v1.x (e.g., ^1.3.0 or >=1.3.0,<2.0.0),
and update respx from ^0.21.1 to a newer constraint that includes 0.23.1 (e.g.,
^0.23.1); under [tool.poetry.group.typing.dependencies] update mypy from ^1.10
to a constraint that includes 1.20.1 (e.g., ^1.20.1 or >=1.20.1,<2.0.0); also
verify and, if needed, update pytest-watcher and codespell in
[tool.poetry.group.test.dependencies] and
[tool.poetry.group.codespell.dependencies] respectively so all dev deps allow
the current stable releases.
- Around line 14-15: In the [tool.mypy] section change the disallow_untyped_defs
setting from a quoted string to a TOML boolean by replacing
disallow_untyped_defs = "True" with disallow_untyped_defs = true so mypy sees a
proper boolean value; locate the setting under the [tool.mypy] header and update
the literal accordingly.
- Around line 20-24: The pydantic dependency spec is currently `^2.0` which
allows installing versions vulnerable to CVE-2024-3772; update the dependency
line for pydantic in pyproject.toml to require at least 2.4.0 (for example
change the version spec to `^2.4.0`) so the resolver will only allow patched
releases and then run your dependency update/install to refresh the lockfile.
---
Nitpick comments:
In `@humanpages/.gitignore`:
- Line 1: Replace the bare "__pycache__" entry in .gitignore with the
directory-specific pattern "__pycache__/" so the rule explicitly targets the
cache directories; update the existing "__pycache__" line to "__pycache__/" in
the .gitignore file.
In `@humanpages/docs/human_fallback.ipynb`:
- Around line 30-35: The notebook cell currently overwrites real credentials by
setting os.environ["AGENTQL_API_KEY"] and os.environ["HUMANPAGES_API_KEY"];
change this to avoid clobbering real keys by using
os.environ.setdefault("AGENTQL_API_KEY", "<placeholder>") and
os.environ.setdefault("HUMANPAGES_API_KEY", "<placeholder>") or prompt for
secrets with getpass.getpass() before instantiating HumanFallbackAgent(), so
existing environment values are preserved and users must explicitly enter
placeholders.
In `@humanpages/examples/human_fallback_scraper/human_fallback_scraper.py`:
- Around line 36-47: The fallback branch handling the human pages result should
guard against missing or empty messages from agent.extract(url, query); update
the else branch that inspects result (the block that prints Job status and
iterates result["messages"]) to use result.get("messages", []) and to print the
job status first, then if the messages list is empty print a clear hint like "No
messages returned — job may be cancelled or timed out (status: ...)" instead of
silently doing nothing; ensure you still iterate and print each message when
present so human_fallback_scraper.py surfaces cancelled/timeout cases.
In `@humanpages/LICENSE`:
- Around line 1-21: Update the copyright line in the LICENSE file: change the
year "2024" to the current year or a year range (for example "2024-2026") so the
MIT header reflects the publication timeframe; edit the top lines of the LICENSE
where the copyright notice appears.
In `@humanpages/pyproject.toml`:
- Around line 5-12: The authors field under the [tool.poetry] section is empty;
populate the authors array in pyproject.toml (the authors entry under
[tool.poetry]) with one or more author strings (e.g., "Name <email>") to provide
proper attribution for the package release and metadata.
- Around line 1-3: Update the build-system requirement for poetry-core to a more
recent minimum to get modern fixes: edit the pyproject.toml's [build-system]
requires entry (the "requires" key referencing "poetry-core") and change the
version constraint from "poetry-core>=1.0.0" to a newer minimum (for example
"poetry-core>=1.4.0" or your chosen supported minimum), keeping build-backend =
"poetry.core.masonry.api" unchanged and then run your packaging/build checks to
confirm compatibility.
In `@humanpages/README.md`:
- Around line 33-37: The quick-start prints only result["messages"] for the
human branch which can be empty on cancelled jobs; update the README's
conditional (the block checking if result["source"] == "agentql") so that in the
else branch it prints both result["status"] and result["messages"] (or otherwise
includes status text) to make cancellation/empty-result cases explicit —
reference the result dict, the "agentql" branch, the "messages" key and the
"status" key (and note test_fallback_cancelled_job) when making the change.
In `@humanpages/tests/unit_tests/test_agent.py`:
- Around line 36-42: The test_init_from_env uses patch.dict("os.environ", {...})
without clear=True which can allow real environment variables to leak into the
test; update the patch call in test_init_from_env to use
patch.dict("os.environ", {...}, clear=True) to match the other tests and ensure
HumanFallbackAgent initialization reads only the provided keys (reference
test_init_from_env, patch.dict usage, and HumanFallbackAgent).
- Around line 78-107: The test currently makes httpx.post raise an
httpx.HTTPStatusError directly, but the real AgentQL error path is triggered by
response.raise_for_status(); modify test_fallback_on_agentql_http_error so
httpx.post returns the mocked 500 response (agentql_error) instead of raising,
e.g. patch.object(httpx, "post", return_value=agentql_error), leaving the
hp_client method patches and assertions unchanged so response.raise_for_status()
inside agent._call/agent.extract triggers the fallback.
In `@humanpages/tests/unit_tests/test_client.py`:
- Around line 78-95: The test test_create_job_with_custom_params should also
assert that the HTTP request was made to the expected endpoint and that all
required fields are serialized with the API's camelCase keys; update the test to
check that the mocked httpx.post was called with the CREATE_JOB_ENDPOINT (or the
same URL string used by HumanPagesClient.create_job) and assert that the JSON
payload includes "humanId", "title", and "description" (in addition to the
existing checks for "priceUsdc" and "deadlineHours") to catch silent
field-rename regressions.
- Around line 52-60: The test currently raises httpx.HTTPStatusError from
httpx.get itself which bypasses the client code's response.raise_for_status
path; change test_search_humans_unauthorized to have httpx.get return the mocked
401 Response (use the existing resp from _mock_response(401, ...)) instead of
raising, so that HumanPagesClient.search_humans calls response.raise_for_status
and that triggers the HTTPStatusError; update the patch to patch.object(httpx,
"get", return_value=resp) (or make resp.raise_for_status raise
httpx.HTTPStatusError) while keeping the pytest.raises(ValueError,
match="Invalid Human Pages API key") assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 13fe43ea-9e4a-40ce-9cfa-359bb35418f6
📒 Files selected for processing (19)
humanpages/.gitignorehumanpages/LICENSEhumanpages/Makefilehumanpages/README.mdhumanpages/agentql_humanpages/__init__.pyhumanpages/agentql_humanpages/agent.pyhumanpages/agentql_humanpages/client.pyhumanpages/agentql_humanpages/const.pyhumanpages/agentql_humanpages/messages.pyhumanpages/agentql_humanpages/py.typedhumanpages/docs/human_fallback.ipynbhumanpages/examples/human_fallback_scraper/README.mdhumanpages/examples/human_fallback_scraper/human_fallback_scraper.pyhumanpages/pyproject.tomlhumanpages/tests/__init__.pyhumanpages/tests/integration_tests/__init__.pyhumanpages/tests/unit_tests/__init__.pyhumanpages/tests/unit_tests/test_agent.pyhumanpages/tests/unit_tests/test_client.py
| logger.info("AgentQL returned empty data for %s, falling back to human.", url) | ||
| except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e: | ||
| logger.info( | ||
| AGENTQL_EXTRACTION_FAILED.format(url=url, detail=str(e)) | ||
| ) |
There was a problem hiding this comment.
Avoid logging full user-provided URLs.
These info logs can persist query strings/fragments that may contain tokens, emails, or other sensitive values. Log a redacted URL instead.
🛡️ Proposed fix
import logging
import os
import time
from typing import Any, Optional
+from urllib.parse import urlsplit, urlunsplit+def _redact_url_for_log(url: str) -> str:
+ parsed = urlsplit(url)
+ return urlunsplit((parsed.scheme, parsed.netloc, parsed.path, "", ""))
+
+
logger = logging.getLogger(__name__)+ safe_url = _redact_url_for_log(url)
- logger.info("AgentQL returned empty data for %s, falling back to human.", url)
+ logger.info("AgentQL returned empty data for %s, falling back to human.", safe_url)
except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:
+ safe_url = _redact_url_for_log(url)
logger.info(
- AGENTQL_EXTRACTION_FAILED.format(url=url, detail=str(e))
+ AGENTQL_EXTRACTION_FAILED.format(url=safe_url, detail=str(e))
)Apply the same redaction in aextract().
Also applies to: 297-301
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@humanpages/agentql_humanpages/agent.py` around lines 258 - 262, The logs
currently output full user-provided URLs (in the logger.info call with message
"AgentQL returned empty data for %s, falling back to human." and the
AGENTQL_EXTRACTION_FAILED log) which may include sensitive query
strings/fragments; update both places (the extract() callsite and aextract()) to
log a redacted URL instead by introducing or using a helper like redact_url(url)
that strips query, fragment and userinfo (keeping only scheme+host+path or
masking the netloc) and replace the raw url with redact_url(url) in the
logger.info calls so no sensitive data is persisted.
| [tool.poetry.group.test.dependencies] | ||
| pytest = "^7.4.3" | ||
| pytest-asyncio = "^0.23.2" | ||
| pytest-socket = "^0.7.0" | ||
| pytest-watcher = "^0.3.4" | ||
| respx = "^0.21.1" | ||
|
|
||
| [tool.poetry.group.codespell.dependencies] | ||
| codespell = "^2.2.6" | ||
|
|
||
| [tool.poetry.group.test_integration.dependencies] | ||
|
|
||
| [tool.poetry.group.lint.dependencies] | ||
| ruff = "^0.5" | ||
|
|
||
| [tool.poetry.group.typing.dependencies] | ||
| mypy = "^1.10" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Latest versions of pytest, pytest-asyncio, pytest-socket, respx, ruff, mypy in April 2026
💡 Result:
pytest: 9.0.3 (released 2026-04-07) pytest-asyncio: 1.3.0 (latest stable, released 2025-11-10; note: 1.4.0a1 prerelease on 2026-04-15) pytest-socket: 0.7.0 (released 2024-01-28) respx: 0.23.1 (released 2026-04-08) ruff: 0.15.10 (released 2026-04-09) mypy: 1.20.1 (released 2026-04-13)
Citations:
- 1: https://github.com/pytest-dev/pytest/releases
- 2: https://github.com/pytest-dev/pytest/releases/tag/9.0.3
- 3: https://docs.pytest.org/en/stable/changelog.html
- 4: https://pypi.org/project/pytest/
- 5: https://pypi.org/project/pytest-asyncio/
- 6: https://pypi.org/project/pytest-socket/
- 7: https://pypi.org/project/respx/
- 8: https://github.com/astral-sh/ruff/releases/tag/0.15.9
- 9: https://github.com/astral-sh/ruff/releases/tag/0.15.10
- 10: https://pypi.org/project/mypy/
Update outdated development dependencies to match current releases.
Several development dependencies are significantly outdated:
- pytest: Constraint
^7.4.3caps at <8.0.0; latest is 9.0.3 - pytest-asyncio: Constraint
^0.23.2caps at <1.0.0; latest is 1.3.0 (major version bump available) - respx: Constraint
^0.21.1is behind; latest is 0.23.1 - mypy: Constraint
^1.10is behind; latest is 1.20.1
pytest-socket (0.7.0) is current. Verify pytest-watcher and codespell versions separately, then update constraints to allow compatible newer releases or pin to latest stable versions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@humanpages/pyproject.toml` around lines 56 - 72, Update the outdated dev
dependency constraints in the pyproject.toml groups: under
[tool.poetry.group.test.dependencies] bump pytest from ^7.4.3 to a constraint
that allows v9 (e.g., ^9.0.3 or >=9.0.3,<10.0.0), bump pytest-asyncio from
^0.23.2 to allow v1.x (e.g., ^1.3.0 or >=1.3.0,<2.0.0), and update respx from
^0.21.1 to a newer constraint that includes 0.23.1 (e.g., ^0.23.1); under
[tool.poetry.group.typing.dependencies] update mypy from ^1.10 to a constraint
that includes 1.20.1 (e.g., ^1.20.1 or >=1.20.1,<2.0.0); also verify and, if
needed, update pytest-watcher and codespell in
[tool.poetry.group.test.dependencies] and
[tool.poetry.group.codespell.dependencies] respectively so all dev deps allow
the current stable releases.
- Fix TOML boolean syntax for mypy disallow_untyped_defs - Bump pydantic minimum to ^2.4.0 (CVE-2024-3772) - Add isinstance check before calling .get() on error_json - URL-encode job_id in endpoint path interpolation - Enforce mutual exclusivity of query/prompt parameters - Add async _adelegate_to_human to avoid blocking event loop - Fix Makefile lint_diff path and branch, group mypy guard - Update LICENSE year to 2025 - Add trailing slash to __pycache__ in .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
humanpages/Makefile (1)
30-30:⚠️ Potential issue | 🟡 MinorMake the diff base configurable and use merge-base semantics.
Line 30 fixed the old package path, but it still hard-codes
mainand compares against that ref directly. This can fail when onlyorigin/mainexists, or lint files from target-branch-only changes after branch divergence.🛠️ Proposed fix
MYPY_CACHE=.mypy_cache +BASE_REF ?= main lint format: PYTHON_FILES=. -lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative --name-only --diff-filter=d main | grep -E '\.py$$|\.ipynb$$') +lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative --name-only --diff-filter=d $(BASE_REF)...HEAD | grep -E '\.py$$|\.ipynb$$')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/Makefile` at line 30, The PYTHON_FILES assignment currently hard-codes "main" in the git diff call; make the diff base configurable and compute a proper merge-base (fork-point fallback) before running git diff. Add a DIFF_BASE variable (defaulting to origin/main), compute MERGE_BASE using git merge-base --fork-point ${DIFF_BASE} HEAD with a fallback to git merge-base ${DIFF_BASE} HEAD, and then use that MERGE_BASE in the git diff command used by the PYTHON_FILES assignment instead of the literal "main". Ensure the new variables replace the old literal so CI works when only origin/main exists or when target-branch-only changes are present.
🧹 Nitpick comments (2)
humanpages/agentql_humanpages/agent.py (2)
309-309: Remove redundanthttpx.TimeoutExceptionfrom exception handlers.
httpx.TimeoutExceptionis a subclass ofhttpx.HTTPError, so catching it explicitly is redundant. Apply at L309 and L348:♻️ Proposed changes
- except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e: + except (httpx.HTTPError, ValueError) as e:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/agentql_humanpages/agent.py` at line 309, The except clauses currently catching (httpx.HTTPError, httpx.TimeoutException, ValueError) are redundant because httpx.TimeoutException subclasses httpx.HTTPError; remove httpx.TimeoutException from those tuples so they become (httpx.HTTPError, ValueError) in both places (the except that reads "except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:" and the other similar handler) to avoid duplicate handling while preserving error semantics.
161-169: Consider reusing a singlehttpx.AsyncClientacross calls.The
_agentql_extract_asyncmethod instantiates a newAsyncClienton every invocation, which prevents connection pooling and incurs connection/TLS overhead for each call. Holding a client on the instance (created lazily, closed viaaclose()or async context manager on the agent) would enable connection pooling.However, note that the synchronous
_agentql_extractmethod has the same pattern withhttpx.post(), so this is a design consistency issue rather than an async-specific concern. Also, since the agent appears designed for single-call usage per instance (based on the current API and tests), the practical benefit depends on use cases that reuse a single agent instance across multiple extractions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@humanpages/agentql_humanpages/agent.py` around lines 161 - 169, The code creates a new httpx.AsyncClient for every call in _agentql_extract_async (and similarly uses httpx.post in _agentql_extract), preventing connection pooling; change the class to hold a lazily-created instance attribute (e.g., self._async_client) and reuse it for requests to AGENTQL_EXTRACT_DATA_ENDPOINT with timeout self._agentql_timeout, and add an async close method (or implement __aenter__/__aexit__ / aclose usage) that calls self._async_client.aclose() to clean up; ensure the synchronous path is made consistent (reuse a persistent httpx.Client or document single-use behavior) and guard client creation to be idempotent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@humanpages/Makefile`:
- Line 16: The ptw invocation places pytest-specific args before the watched
path; move reserved ptw options (e.g., --now) before the path, put the watched
path (.) next, then add the separator -- and the pytest options (e.g., -vv,
--snapshot-update, $(TEST_FILE)) afterwards; update the command string used in
the Makefile so it follows the pattern "ptw [ptw-opts] <watched-path> --
[pytest-opts]" to ensure pytest-watcher parses options correctly.
---
Duplicate comments:
In `@humanpages/Makefile`:
- Line 30: The PYTHON_FILES assignment currently hard-codes "main" in the git
diff call; make the diff base configurable and compute a proper merge-base
(fork-point fallback) before running git diff. Add a DIFF_BASE variable
(defaulting to origin/main), compute MERGE_BASE using git merge-base
--fork-point ${DIFF_BASE} HEAD with a fallback to git merge-base ${DIFF_BASE}
HEAD, and then use that MERGE_BASE in the git diff command used by the
PYTHON_FILES assignment instead of the literal "main". Ensure the new variables
replace the old literal so CI works when only origin/main exists or when
target-branch-only changes are present.
---
Nitpick comments:
In `@humanpages/agentql_humanpages/agent.py`:
- Line 309: The except clauses currently catching (httpx.HTTPError,
httpx.TimeoutException, ValueError) are redundant because httpx.TimeoutException
subclasses httpx.HTTPError; remove httpx.TimeoutException from those tuples so
they become (httpx.HTTPError, ValueError) in both places (the except that reads
"except (httpx.HTTPError, httpx.TimeoutException, ValueError) as e:" and the
other similar handler) to avoid duplicate handling while preserving error
semantics.
- Around line 161-169: The code creates a new httpx.AsyncClient for every call
in _agentql_extract_async (and similarly uses httpx.post in _agentql_extract),
preventing connection pooling; change the class to hold a lazily-created
instance attribute (e.g., self._async_client) and reuse it for requests to
AGENTQL_EXTRACT_DATA_ENDPOINT with timeout self._agentql_timeout, and add an
async close method (or implement __aenter__/__aexit__ / aclose usage) that calls
self._async_client.aclose() to clean up; ensure the synchronous path is made
consistent (reuse a persistent httpx.Client or document single-use behavior) and
guard client creation to be idempotent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bdb6c6ef-cf96-4b31-828b-bac2221c9a29
📒 Files selected for processing (6)
humanpages/.gitignorehumanpages/LICENSEhumanpages/Makefilehumanpages/agentql_humanpages/agent.pyhumanpages/agentql_humanpages/client.pyhumanpages/pyproject.toml
✅ Files skipped from review due to trivial changes (3)
- humanpages/LICENSE
- humanpages/.gitignore
- humanpages/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- humanpages/agentql_humanpages/client.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Disclosure: I'm a maintainer of Human Pages. Happy to adjust the integration if there's anything you'd like changed to better fit the repo's conventions. |
Summary
humanpagesintegration package that combines AgentQL with Human Pages for automatic human fallback when extraction failsHumanFallbackAgentclass: tries AgentQL first, delegates to a real human if extraction failsHumanPagesClientclass: full REST API wrapper with sync and async methodsPackage structure
Links