feat(analytics): filter bots and non-content endpoints from page-view tracking (#77)#82
Merged
Merged
Conversation
… tracking The analytics middleware was recording every successful request, so /robots.txt, /sitemap.xml, /feed.xml, /favicon.ico, /pygments.css, and crawler hits all inflated view counts. Add two filters: 1. Content-Type must start with text/html — excludes XML, JSON, CSS, plain text, and image responses without needing per-path allowlists. 2. User-Agent must not look like a bot/crawler — pattern covers Googlebot, Bingbot, Baidu/Yandex, social card fetchers (Twitterbot, facebookexternalhit, Slackbot), and scripted clients (curl, wget, python-requests, httpx). Missing UA is treated as a bot since real browsers always send one. Refactor the middleware to use early returns so the order is obvious: status_code -> Content-Type -> path prefix -> User-Agent -> track. Existing path-prefix exclusions (/static, /admin, /health, /auth, /webhooks) are preserved. Tests cover is_bot_user_agent across known bots and real browsers, plus integration tests that the existing excluded paths still don't get tracked. Closes #77 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates SquishMark’s analytics middleware to avoid inflating page-view metrics by skipping bot traffic and non-content responses, aligning tracked “page views” more closely with real human HTML page loads.
Changes:
- Added a bot User-Agent detection helper (regex-based) and used it in the analytics middleware.
- Added a
Content-Typegate (text/htmlonly) and refactored middleware logic to use early returns for clearer filter ordering. - Added unit + integration-style tests intended to validate bot/non-HTML/non-content filtering behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/squishmark/main.py |
Adds bot UA detection + updates analytics middleware to early-return on non-200, non-HTML, excluded paths, and bot UAs before tracking. |
tests/test_analytics_filtering.py |
Adds tests for bot UA detection and middleware tracking suppression for selected endpoints/headers. |
Copilot review on PR #82 flagged that test_bot_request_to_html_page_not_tracked used /health, which is filtered earlier by both Content-Type (JSON) and path prefix — so the test would pass even if the UA gate were removed. Register a stub /_test/html route on the test app (path chosen to avoid the /{slug} catch-all in pages.py) and assert the browser-UA hit IS tracked while the bot-UA hit is NOT. The bot UA filter now has a test that genuinely exercises it. Refs #77, #82 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stops the analytics middleware from recording bot traffic and non-content endpoints. Two filters added; existing path-prefix exclusions preserved.
Filters
text/html. Cleanly excludes/robots.txt,/sitemap.xml,/feed.xml,/favicon.ico,/pygments.csswithout needing per-path allowlists.bot|crawler|spider|slurp|facebookexternalhit|curl|wget|python-requests|httpxcase-insensitive. Missing UA is treated as a bot (real browsers always send one).Middleware refactored to use early returns so the filter order is obvious: status → Content-Type → path prefix → User-Agent → track.
Test plan
python scripts/run-checks.py— 4/4 (format, lint, 209 tests, pyright)is_bot_user_agentunit-tested against 12 known bots + 4 real browsers + missing/empty/robots.txt,/health, and bot UAs don't triggertrack_page_viewScope note
Issue Implementation Notes suggest staging bot filtering as a follow-up. Bundling here because the title says "bot traffic AND non-content endpoints" and the Possible Approaches list "Combination of the above." See #77 comment.
Closes #77
🤖 Generated with Claude Code