Skip to content

Add W024 select-distinct-suspicious (DISTINCT + JOIN band-aid heuristic)#49

Merged
Pawansingh3889 merged 6 commits into
mainfrom
feat/w024-select-distinct-suspicious
May 9, 2026
Merged

Add W024 select-distinct-suspicious (DISTINCT + JOIN band-aid heuristic)#49
Pawansingh3889 merged 6 commits into
mainfrom
feat/w024-select-distinct-suspicious

Conversation

@Pawansingh3889

Copy link
Copy Markdown
Owner

What

New warning rule W024 select-distinct-suspicious that fires when a single statement contains both top-level SELECT DISTINCT and any JOIN keyword.

Developers reach for SELECT DISTINCT to swallow row duplication caused by an over-wide JOIN cardinality, often because a join condition is missing or because the right tool was actually GROUP BY. The result reads more rows than necessary and hides the underlying join-condition bug. The pattern is common enough in real codebases that it earns its own rule rather than living inside S001.

What's flagged

SELECT DISTINCT c.id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id;

What's not flagged

-- single-table DISTINCT, no JOIN cardinality blow-up to mask
SELECT DISTINCT country FROM customers;

-- aggregate-DISTINCT is a different pattern; regex anchors on SELECT directly preceding DISTINCT
SELECT COUNT(DISTINCT c.id) FROM customers c JOIN orders o ON c.id = o.customer_id;
SELECT SUM(DISTINCT amount) FROM payments p JOIN customers c ON p.cust_id = c.id;

Implementation notes

  • multiline = True plus check_statement so DISTINCT and JOIN can sit on different lines of the same statement.
    • strip_strings_and_comments (lifted into base.py previously for E009) is reused so DISTINCT or JOIN tokens inside string literals or -- / /* */ comments cannot fire the rule.
    • The regex \bSELECT\s+DISTINCT\b only matches top-level usage. COUNT(DISTINCT ...), SUM(DISTINCT ...), etc. fail to match because the keyword preceding DISTINCT is a (, not SELECT.

Tests

11 new tests in tests/test_new_rules.py covering inner join, left join, multi-line, case-insensitive, single-table DISTINCT (passes), COUNT(DISTINCT) with JOIN (passes), SUM(DISTINCT) with JOIN (passes), JOIN without DISTINCT (passes), DISTINCT inside a string literal (passes), DISTINCT inside a -- comment (passes), and a message-content assertion. Registry counts bumped to 42 / 31 warnings. Full suite: 254 passed locally, 1 skipped.

Closes #40.

Fires when a statement contains both top-level SELECT DISTINCT and any
JOIN keyword. The pattern is a frequent band-aid for a missing join
condition or a missing GROUP BY: developers reach for DISTINCT to
swallow row duplication caused by an over-wide JOIN cardinality
rather than fixing the join.

The regex anchors on SELECT directly preceding DISTINCT so aggregate-
DISTINCT forms like COUNT(DISTINCT col) and SUM(DISTINCT amount) are
not flagged. String literals and comments are stripped first via
strip_strings_and_comments so a mention of SELECT DISTINCT inside a
literal or a comment cannot trigger the rule. multiline = True so
the DISTINCT and JOIN keywords can sit on different lines of the
same statement.

Resolves #40.
Covers: INNER JOIN, LEFT JOIN, multi-line, case-insensitive, single-table
DISTINCT (passes), COUNT(DISTINCT) with JOIN (passes), SUM(DISTINCT)
with JOIN (passes), JOIN without DISTINCT (passes), DISTINCT inside a
string literal (passes), DISTINCT inside a -- comment (passes), and a
message-content assertion.
@Pawansingh3889 Pawansingh3889 force-pushed the feat/w024-select-distinct-suspicious branch from 4507d3a to 9f158ab Compare May 9, 2026 16:30
[pre-commit.ci] auto-applied fixes from configured hooks
@Pawansingh3889 Pawansingh3889 changed the title Feat/w024 select distinct suspicious Add W024 select-distinct-suspicious (DISTINCT + JOIN band-aid heuristic) May 9, 2026
@Pawansingh3889 Pawansingh3889 merged commit bb3fbd5 into main May 9, 2026
5 of 8 checks passed
Pawansingh3889 added a commit that referenced this pull request May 9, 2026
…eck (#50)

The previous pr_title.yml workflow used amannn/action-semantic-pull-request@v5
to enforce strict Conventional Commits format on every PR title (feat:,
fix:, etc.) plus a lowercase-subject regex. That conflicted with the
project's actual title convention ("Add Wxxx rule-name (description)",
"Fix Snnn false negatives ...") and so the validate check went red on
every merged PR (e.g. #46, #47, #48, #49). Branch protection does not
require this check, so the merge button stayed enabled and the failures
shipped silently in the merged commits' status timelines.

Replacing the strict format enforcer with a small inline shell script
that catches the actual problems we have hit in practice:

1. Empty / whitespace-only titles -- noticeable in `gh pr list`.
2. Auto-generated branch-name titles (e.g. "Feat/w024-select-distinct
   -suspicious") that GitHub picks when an async PR-template render
   wipes out a freshly typed title. Matches a regex of the form
   ^[A-Z][a-z]+/[a-z0-9-] which catches the real cases without
   touching anything that looks like a sentence.
3. Titles longer than 100 characters, which truncate badly in the UI.

Project titles like "Add W024 select-distinct-suspicious (...)" pass
the sanity check; titles like "Feat/foo-bar" fail. Verified the regex
matrix locally before committing.

To restore stricter enforcement later, swap the inline script back for
amannn/action-semantic-pull-request@v5 and adjust the project title
convention to start with feat: / fix: prefixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rule: W024 select-distinct-suspicious - flag SELECT DISTINCT that often masks a missing JOIN or GROUP BY

1 participant