Skip to content

feat(analytics): filter bots and non-content endpoints from page-view tracking (#77)#82

Merged
x3ek merged 2 commits into
mainfrom
feat/77-exclude-bot-traffic-analytics
May 16, 2026
Merged

feat(analytics): filter bots and non-content endpoints from page-view tracking (#77)#82
x3ek merged 2 commits into
mainfrom
feat/77-exclude-bot-traffic-analytics

Conversation

@x3ek
Copy link
Copy Markdown
Contributor

@x3ek x3ek commented May 16, 2026

Summary

Stops the analytics middleware from recording bot traffic and non-content endpoints. Two filters added; existing path-prefix exclusions preserved.

Filters

  1. Content-Type — must start with text/html. Cleanly excludes /robots.txt, /sitemap.xml, /feed.xml, /favicon.ico, /pygments.css without needing per-path allowlists.
  2. User-Agent — pattern matches bot|crawler|spider|slurp|facebookexternalhit|curl|wget|python-requests|httpx case-insensitive. Missing UA is treated as a bot (real browsers always send one).

Middleware refactored to use early returns so the filter order is obvious: status → Content-Type → path prefix → User-Agent → track.

Test plan

  • python scripts/run-checks.py — 4/4 (format, lint, 209 tests, pyright)
  • is_bot_user_agent unit-tested against 12 known bots + 4 real browsers + missing/empty
  • Integration tests confirm /robots.txt, /health, and bot UAs don't trigger track_page_view

Scope note

Issue Implementation Notes suggest staging bot filtering as a follow-up. Bundling here because the title says "bot traffic AND non-content endpoints" and the Possible Approaches list "Combination of the above." See #77 comment.

Closes #77

🤖 Generated with Claude Code

… tracking

The analytics middleware was recording every successful request, so /robots.txt, /sitemap.xml, /feed.xml, /favicon.ico, /pygments.css, and crawler hits all inflated view counts. Add two filters:

1. Content-Type must start with text/html — excludes XML, JSON, CSS, plain text, and image responses without needing per-path allowlists.

2. User-Agent must not look like a bot/crawler — pattern covers Googlebot, Bingbot, Baidu/Yandex, social card fetchers (Twitterbot, facebookexternalhit, Slackbot), and scripted clients (curl, wget, python-requests, httpx). Missing UA is treated as a bot since real browsers always send one.

Refactor the middleware to use early returns so the order is obvious: status_code -> Content-Type -> path prefix -> User-Agent -> track. Existing path-prefix exclusions (/static, /admin, /health, /auth, /webhooks) are preserved.

Tests cover is_bot_user_agent across known bots and real browsers, plus integration tests that the existing excluded paths still don't get tracked.

Closes #77

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SquishMark’s analytics middleware to avoid inflating page-view metrics by skipping bot traffic and non-content responses, aligning tracked “page views” more closely with real human HTML page loads.

Changes:

  • Added a bot User-Agent detection helper (regex-based) and used it in the analytics middleware.
  • Added a Content-Type gate (text/html only) and refactored middleware logic to use early returns for clearer filter ordering.
  • Added unit + integration-style tests intended to validate bot/non-HTML/non-content filtering behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/squishmark/main.py Adds bot UA detection + updates analytics middleware to early-return on non-200, non-HTML, excluded paths, and bot UAs before tracking.
tests/test_analytics_filtering.py Adds tests for bot UA detection and middleware tracking suppression for selected endpoints/headers.

Comment thread tests/test_analytics_filtering.py
Copilot review on PR #82 flagged that test_bot_request_to_html_page_not_tracked used /health, which is filtered earlier by both Content-Type (JSON) and path prefix — so the test would pass even if the UA gate were removed.

Register a stub /_test/html route on the test app (path chosen to avoid the /{slug} catch-all in pages.py) and assert the browser-UA hit IS tracked while the bot-UA hit is NOT. The bot UA filter now has a test that genuinely exercises it.

Refs #77, #82

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@x3ek x3ek merged commit 821ee65 into main May 16, 2026
5 checks passed
@x3ek x3ek deleted the feat/77-exclude-bot-traffic-analytics branch May 16, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exclude bot traffic and non-content endpoints from analytics

2 participants