chore(crtsh): optim + drop unused x509/ct_log joins, group-by cert blob#1775
chore(crtsh): optim + drop unused x509/ct_log joins, group-by cert blob#1775x-stp wants to merge 4 commits into
Conversation
SQL query introed in projectdiscovery#700 mirrored the crt.sh web UI query, computing x.509 metadata, grouped on certificate blobs, joined ct_log_entry per group; none of which subfinder uses. Replaced with a DISTINCT NAME_VALUE projection. Keeps the tsquery + ILIKE pair that made projectdiscovery#700 better than projectdiscovery#666. Issue per ref: projectdiscovery#1773, reduces i/o on crt.shs side.
WalkthroughA SQL query in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/subscraping/sources/crtsh/crtsh.go`:
- Around line 82-86: The query builds a SELECT into the variable query using
limitClause but when LIMIT is present there is no ORDER BY, making results
non-deterministic; update the SQL construction in crtsh.go so that when
limitClause (e.g., "LIMIT 10000") is appended you also append "ORDER BY
cai.CERTIFICATE_ID DESC" before the limit to preserve the most recent
certificates (keep the ORDER BY placement immediately before limitClause in the
same fmt.Sprintf call or adjust limitClause to include the ORDER BY),
referencing the query variable and the limitClause symbol to locate and modify
the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fb9e1289-5466-4a26-a1f5-7f2759105401
📒 Files selected for processing (1)
pkg/subscraping/sources/crtsh/crtsh.go
| query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE | ||
| FROM certificate_and_identities cai | ||
| WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE) | ||
| AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%') | ||
| %s;`, limitClause) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
crt.sh certwatch schema certificate_and_identities columns CERTIFICATE_ID
💡 Result:
The certificate_and_identities view in the crt.sh certwatch database schema has the following columns, including CERTIFICATE_ID: - CERTIFICATE_ID (from certificate.ID, bigserial/bigint) - CERTIFICATE (from certificate.CERTIFICATE, bytea) - NAME_TYPE (from x509_nameAttributes_raw.ATTRIBUTE_OID or 'san:' + type) - NAME_VALUE (encode(RAW_VALUE, 'escape')) - ISSUER_CA_ID (from certificate.ISSUER_CA_ID, integer)
Citations:
- 1: https://github.com/crtsh/certwatch_db/blob/master/sql/create_schema.sql
- 2: https://github.com/crtsh/certwatch_db/blob/master/fnc/web_apis.fnc
- 3: https://gist.github.com/bbhunter/0e0c5d325e1c344062fc85795e661f02
LIMIT without ORDER BY makes the truncated result set non-deterministic.
When --all is not passed, limitClause is LIMIT 10000 and the new query no longer has any ORDER BY. Postgres is then free to return any 10000 rows matching the predicate. For large targets (domains with tens of thousands of certificates on crt.sh) this means:
- Results can differ between runs even against the same data.
- The previous query ordered by the latest
ct_log_entry.ENTRY_TIMESTAMP, so the cut kept the most recent subdomains; now the 10000 returned may skew to arbitrary/older rows and miss recently issued certs — which is usually the more interesting signal for passive recon.
If the goal is to drop the ct_log_entry join for I/O reasons, add ORDER BY cai.CERTIFICATE_ID DESC to keep the 10000 most recent certificates (CERTIFICATE_ID is auto-incrementing in the certwatch schema):
♻️ Suggested tweak
- query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
- FROM certificate_and_identities cai
- WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
- AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
- %s;`, limitClause)
+ query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
+ FROM certificate_and_identities cai
+ WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
+ AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
+ ORDER BY cai.CERTIFICATE_ID DESC
+ %s;`, limitClause)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE | |
| FROM certificate_and_identities cai | |
| WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE) | |
| AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%') | |
| %s;`, limitClause) | |
| query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE | |
| FROM certificate_and_identities cai | |
| WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE) | |
| AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%') | |
| ORDER BY cai.CERTIFICATE_ID DESC | |
| %s;`, limitClause) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/subscraping/sources/crtsh/crtsh.go` around lines 82 - 86, The query
builds a SELECT into the variable query using limitClause but when LIMIT is
present there is no ORDER BY, making results non-deterministic; update the SQL
construction in crtsh.go so that when limitClause (e.g., "LIMIT 10000") is
appended you also append "ORDER BY cai.CERTIFICATE_ID DESC" before the limit to
preserve the most recent certificates (keep the ORDER BY placement immediately
before limitClause in the same fmt.Sprintf call or adjust limitClause to include
the ORDER BY), referencing the query variable and the limitClause symbol to
locate and modify the code.
There was a problem hiding this comment.
You're incorrect. Look at what the code did before.
There was a problem hiding this comment.
@x-stp, you're right — I apologize for the incorrect analysis.
Looking at the old query more carefully, the limitClause (LIMIT 10000) was placed inside the innermost subquery (sub), i.e.:
FROM (SELECT *
FROM certificate_and_identities cai
WHERE ...
%s -- ← LIMIT 10000 was here
) sub
GROUP BY sub.CERTIFICATEThe ORDER BY le.ENTRY_TIMESTAMP DESC NULLS LAST was on the outer SELECT with no LIMIT there. This means the LIMIT truncated the raw inner rows before any grouping or ordering — it was equally non-deterministic. My claim that the old query kept the "most recent" subdomains via the ct_log_entry timestamp ordering was wrong; the LIMIT and the ORDER BY operated at different levels and never interacted as a "top-N by recency" pattern.
The new query moves the LIMIT to the outer level, which is consistent behavior. I'm retracting my concern — it was based on a misreading of the old query structure.
✏️ Learnings added
Learnt from: x-stp
Repo: projectdiscovery/subfinder PR: 1775
File: pkg/subscraping/sources/crtsh/crtsh.go:82-86
Timestamp: 2026-04-17T18:13:32.865Z
Learning: In `pkg/subscraping/sources/crtsh/crtsh.go`, the old (pre-PR-#1775) SQL query placed the `LIMIT 10000` clause inside the innermost subquery `sub` (before `GROUP BY`), while the `ORDER BY le.ENTRY_TIMESTAMP DESC NULLS LAST` was on the outermost SELECT with no LIMIT. The LIMIT and ORDER BY therefore operated at different levels and did NOT guarantee "most recent N" results — the truncation was non-deterministic in the old query too. Do not flag non-determinism from missing ORDER BY as a regression introduced by PR `#1775`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Neo - PR Security ReviewCaution Neo couldn't finish analyzing this pull request during this run. Please run the review again. Comment |
SQL query introed in #700 mirrored the crt.sh web UI query, computing x.509 metadata, grouped on certificate blobs, joined ct_log_entry per group; none of which subfinder uses.
Proposed changes
Replaced with a DISTINCT NAME_VALUE projection. Keeps the tsquery + ILIKE pair that made #700 better than #666. Issue per ref: #1773, reduces i/o on crt.shs side.
Proof
Ran tests. Ran the query. Saw the speed; didn't benchmark a full blown set against crt.sh as that's resouce exhaustion; compute can be calculated though; see issue and also in the query the post-compute grouping on tz; it's never used even by any Caller.
Checklist
Summary by CodeRabbit