Skip to content

chore(crtsh): optim + drop unused x509/ct_log joins, group-by cert blob#1775

Open
x-stp wants to merge 4 commits into
projectdiscovery:devfrom
x-stp:fix/crtsh-query
Open

chore(crtsh): optim + drop unused x509/ct_log joins, group-by cert blob#1775
x-stp wants to merge 4 commits into
projectdiscovery:devfrom
x-stp:fix/crtsh-query

Conversation

@x-stp
Copy link
Copy Markdown
Contributor

@x-stp x-stp commented Apr 17, 2026

SQL query introed in #700 mirrored the crt.sh web UI query, computing x.509 metadata, grouped on certificate blobs, joined ct_log_entry per group; none of which subfinder uses.

Proposed changes

Replaced with a DISTINCT NAME_VALUE projection. Keeps the tsquery + ILIKE pair that made #700 better than #666. Issue per ref: #1773, reduces i/o on crt.shs side.

Proof

Ran tests. Ran the query. Saw the speed; didn't benchmark a full blown set against crt.sh as that's resouce exhaustion; compute can be calculated though; see issue and also in the query the post-compute grouping on tz; it's never used even by any Caller.

Checklist

  • [Yes] Pull request is created against the dev branch
  • [Yep] All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • [n/a] I have added tests that prove my fix is effective or that my feature works
  • [See Issue] I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • Refactor
    • Optimized subdomain discovery performance by streamlining the query structure for certificate data retrieval.

SQL query introed in projectdiscovery#700 mirrored the crt.sh web UI query, computing
x.509 metadata, grouped on certificate blobs, joined ct_log_entry
per group; none of which subfinder uses. Replaced with a DISTINCT NAME_VALUE projection.
Keeps the tsquery + ILIKE pair that made projectdiscovery#700 better than projectdiscovery#666.
Issue per ref: projectdiscovery#1773, reduces i/o on crt.shs side.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

Walkthrough

A SQL query in getSubdomainsFromSQL was simplified from a complex CTE-based aggregation with issuer CA joins and CT log ordering to a direct SELECT DISTINCT query that filters by the same conditions while removing certificate-related joins and timestamp-based ordering.

Changes

Cohort / File(s) Summary
SQL Query Simplification
pkg/subscraping/sources/crtsh/crtsh.go
Replaced complex CTE-based aggregation with direct SELECT DISTINCT cai.NAME_VALUE query; removed issuer CA joins, array_agg, and CT log timestamp ordering while maintaining existing filter conditions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hops through queries, swift and bright,
A simpler path now shines so right,
Less joins and CTEs, cleaner code,
The rabbit's work eases the load!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: SQL query optimization by removing unused x509/ct_log joins and grouping operations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/subscraping/sources/crtsh/crtsh.go`:
- Around line 82-86: The query builds a SELECT into the variable query using
limitClause but when LIMIT is present there is no ORDER BY, making results
non-deterministic; update the SQL construction in crtsh.go so that when
limitClause (e.g., "LIMIT 10000") is appended you also append "ORDER BY
cai.CERTIFICATE_ID DESC" before the limit to preserve the most recent
certificates (keep the ORDER BY placement immediately before limitClause in the
same fmt.Sprintf call or adjust limitClause to include the ORDER BY),
referencing the query variable and the limitClause symbol to locate and modify
the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb9e1289-5466-4a26-a1f5-7f2759105401

📥 Commits

Reviewing files that changed from the base of the PR and between 85929ab and 3ae01d0.

📒 Files selected for processing (1)
  • pkg/subscraping/sources/crtsh/crtsh.go

Comment on lines +82 to +86
query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
FROM certificate_and_identities cai
WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
%s;`, limitClause)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

crt.sh certwatch schema certificate_and_identities columns CERTIFICATE_ID

💡 Result:

The certificate_and_identities view in the crt.sh certwatch database schema has the following columns, including CERTIFICATE_ID: - CERTIFICATE_ID (from certificate.ID, bigserial/bigint) - CERTIFICATE (from certificate.CERTIFICATE, bytea) - NAME_TYPE (from x509_nameAttributes_raw.ATTRIBUTE_OID or 'san:' + type) - NAME_VALUE (encode(RAW_VALUE, 'escape')) - ISSUER_CA_ID (from certificate.ISSUER_CA_ID, integer)

Citations:


LIMIT without ORDER BY makes the truncated result set non-deterministic.

When --all is not passed, limitClause is LIMIT 10000 and the new query no longer has any ORDER BY. Postgres is then free to return any 10000 rows matching the predicate. For large targets (domains with tens of thousands of certificates on crt.sh) this means:

  • Results can differ between runs even against the same data.
  • The previous query ordered by the latest ct_log_entry.ENTRY_TIMESTAMP, so the cut kept the most recent subdomains; now the 10000 returned may skew to arbitrary/older rows and miss recently issued certs — which is usually the more interesting signal for passive recon.

If the goal is to drop the ct_log_entry join for I/O reasons, add ORDER BY cai.CERTIFICATE_ID DESC to keep the 10000 most recent certificates (CERTIFICATE_ID is auto-incrementing in the certwatch schema):

♻️ Suggested tweak
-	query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
-			FROM certificate_and_identities cai
-			WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
-			  AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
-			%s;`, limitClause)
+	query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
+			FROM certificate_and_identities cai
+			WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
+			  AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
+			ORDER BY cai.CERTIFICATE_ID DESC
+			%s;`, limitClause)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
FROM certificate_and_identities cai
WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
%s;`, limitClause)
query := fmt.Sprintf(`SELECT DISTINCT cai.NAME_VALUE
FROM certificate_and_identities cai
WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
AND cai.NAME_VALUE ILIKE ('%%' || $1 || '%%')
ORDER BY cai.CERTIFICATE_ID DESC
%s;`, limitClause)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/subscraping/sources/crtsh/crtsh.go` around lines 82 - 86, The query
builds a SELECT into the variable query using limitClause but when LIMIT is
present there is no ORDER BY, making results non-deterministic; update the SQL
construction in crtsh.go so that when limitClause (e.g., "LIMIT 10000") is
appended you also append "ORDER BY cai.CERTIFICATE_ID DESC" before the limit to
preserve the most recent certificates (keep the ORDER BY placement immediately
before limitClause in the same fmt.Sprintf call or adjust limitClause to include
the ORDER BY), referencing the query variable and the limitClause symbol to
locate and modify the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're incorrect. Look at what the code did before.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x-stp, you're right — I apologize for the incorrect analysis.

Looking at the old query more carefully, the limitClause (LIMIT 10000) was placed inside the innermost subquery (sub), i.e.:

FROM (SELECT *
        FROM certificate_and_identities cai
        WHERE ...
            %s   -- ← LIMIT 10000 was here
      ) sub
GROUP BY sub.CERTIFICATE

The ORDER BY le.ENTRY_TIMESTAMP DESC NULLS LAST was on the outer SELECT with no LIMIT there. This means the LIMIT truncated the raw inner rows before any grouping or ordering — it was equally non-deterministic. My claim that the old query kept the "most recent" subdomains via the ct_log_entry timestamp ordering was wrong; the LIMIT and the ORDER BY operated at different levels and never interacted as a "top-N by recency" pattern.

The new query moves the LIMIT to the outer level, which is consistent behavior. I'm retracting my concern — it was based on a misreading of the old query structure.


✏️ Learnings added
Learnt from: x-stp
Repo: projectdiscovery/subfinder PR: 1775
File: pkg/subscraping/sources/crtsh/crtsh.go:82-86
Timestamp: 2026-04-17T18:13:32.865Z
Learning: In `pkg/subscraping/sources/crtsh/crtsh.go`, the old (pre-PR-#1775) SQL query placed the `LIMIT 10000` clause inside the innermost subquery `sub` (before `GROUP BY`), while the `ORDER BY le.ENTRY_TIMESTAMP DESC NULLS LAST` was on the outermost SELECT with no LIMIT. The LIMIT and ORDER BY therefore operated at different levels and did NOT guarantee "most recent N" results — the truncation was non-deterministic in the old query too. Do not flag non-determinism from missing ORDER BY as a regression introduced by PR `#1775`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@neo-by-projectdiscovery-dev
Copy link
Copy Markdown

neo-by-projectdiscovery-dev Bot commented Apr 18, 2026

Neo - PR Security Review

Caution

Neo couldn't finish analyzing this pull request during this run. Please run the review again.

Comment @pdneo help for available commands. · Open in Neo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant