Add W024 select-distinct-suspicious (DISTINCT + JOIN band-aid heuristic)#49
Merged
Merged
Conversation
Fires when a statement contains both top-level SELECT DISTINCT and any JOIN keyword. The pattern is a frequent band-aid for a missing join condition or a missing GROUP BY: developers reach for DISTINCT to swallow row duplication caused by an over-wide JOIN cardinality rather than fixing the join. The regex anchors on SELECT directly preceding DISTINCT so aggregate- DISTINCT forms like COUNT(DISTINCT col) and SUM(DISTINCT amount) are not flagged. String literals and comments are stripped first via strip_strings_and_comments so a mention of SELECT DISTINCT inside a literal or a comment cannot trigger the rule. multiline = True so the DISTINCT and JOIN keywords can sit on different lines of the same statement. Resolves #40.
Covers: INNER JOIN, LEFT JOIN, multi-line, case-insensitive, single-table DISTINCT (passes), COUNT(DISTINCT) with JOIN (passes), SUM(DISTINCT) with JOIN (passes), JOIN without DISTINCT (passes), DISTINCT inside a string literal (passes), DISTINCT inside a -- comment (passes), and a message-content assertion.
4507d3a to
9f158ab
Compare
[pre-commit.ci] auto-applied fixes from configured hooks
11 tasks
Pawansingh3889
added a commit
that referenced
this pull request
May 9, 2026
…eck (#50) The previous pr_title.yml workflow used amannn/action-semantic-pull-request@v5 to enforce strict Conventional Commits format on every PR title (feat:, fix:, etc.) plus a lowercase-subject regex. That conflicted with the project's actual title convention ("Add Wxxx rule-name (description)", "Fix Snnn false negatives ...") and so the validate check went red on every merged PR (e.g. #46, #47, #48, #49). Branch protection does not require this check, so the merge button stayed enabled and the failures shipped silently in the merged commits' status timelines. Replacing the strict format enforcer with a small inline shell script that catches the actual problems we have hit in practice: 1. Empty / whitespace-only titles -- noticeable in `gh pr list`. 2. Auto-generated branch-name titles (e.g. "Feat/w024-select-distinct -suspicious") that GitHub picks when an async PR-template render wipes out a freshly typed title. Matches a regex of the form ^[A-Z][a-z]+/[a-z0-9-] which catches the real cases without touching anything that looks like a sentence. 3. Titles longer than 100 characters, which truncate badly in the UI. Project titles like "Add W024 select-distinct-suspicious (...)" pass the sanity check; titles like "Feat/foo-bar" fail. Verified the regex matrix locally before committing. To restore stricter enforcement later, swap the inline script back for amannn/action-semantic-pull-request@v5 and adjust the project title convention to start with feat: / fix: prefixes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
New warning rule W024
select-distinct-suspiciousthat fires when a single statement contains both top-levelSELECT DISTINCTand anyJOINkeyword.Developers reach for
SELECT DISTINCTto swallow row duplication caused by an over-wide JOIN cardinality, often because a join condition is missing or because the right tool was actuallyGROUP BY. The result reads more rows than necessary and hides the underlying join-condition bug. The pattern is common enough in real codebases that it earns its own rule rather than living inside S001.What's flagged
What's not flagged
Implementation notes
multiline = Truepluscheck_statementso DISTINCT and JOIN can sit on different lines of the same statement.strip_strings_and_comments(lifted intobase.pypreviously for E009) is reused so DISTINCT or JOIN tokens inside string literals or--//* */comments cannot fire the rule.\bSELECT\s+DISTINCT\bonly matches top-level usage.COUNT(DISTINCT ...),SUM(DISTINCT ...), etc. fail to match because the keyword precedingDISTINCTis a(, notSELECT.Tests
11 new tests in
tests/test_new_rules.pycovering inner join, left join, multi-line, case-insensitive, single-table DISTINCT (passes),COUNT(DISTINCT)with JOIN (passes),SUM(DISTINCT)with JOIN (passes), JOIN without DISTINCT (passes), DISTINCT inside a string literal (passes), DISTINCT inside a--comment (passes), and a message-content assertion. Registry counts bumped to 42 / 31 warnings. Full suite: 254 passed locally, 1 skipped.Closes #40.