Skip to content

defrag: drop unsigned underflow in RB_NFIND search key#15403

Open
ssam18 wants to merge 1 commit into
OISF:mainfrom
ssam18:defrag-rb-key-underflow-v1
Open

defrag: drop unsigned underflow in RB_NFIND search key#15403
ssam18 wants to merge 1 commit into
OISF:mainfrom
ssam18:defrag-rb-key-underflow-v1

Conversation

@ssam18
Copy link
Copy Markdown

@ssam18 ssam18 commented May 17, 2026

Make sure these boxes are checked accordingly before submitting your Pull Request. Thank you

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Link to ticket: https://redmine.openinfosecfoundation.org/issues/8232

Describe changes:

  • DefragInsertFrag seeds the RB tree search with frag_offset - 1 as the key. frag_offset is uint16_t, so the first fragment of any IP datagram (frag_offset == 0) wraps the key to 65535. RB_NFIND then returns NULL and the existing RB_MIN fallback walks the tree forward. The end result is correct but the path is reached by accident, and the underflow is what fuzzer integer-overflow sanitizers flag.
  • The -1 is functionally a no-op even on inputs where it does not underflow. Fragment offsets are always multiples of 8 (IPV4_GET_RAW_FRAGOFFSET << 3 for IPv4, the IPv6 fragment header offset field for IPv6), so no fragment can ever have offset == frag_offset - 1. RB_NFIND returns the same neighbour for keys frag_offset and frag_offset - 1 on every realistic input.
  • Patch is one character: drop the - 1. The RB_PREV / RB_MIN fallback logic that follows is intentionally left intact. In the no-strict-greater case the existing code starts at the tree minimum and walks forward, which is the only safe choice given the defrag policies do not enforce a non-overlapping invariant on offset + data_len (overlaps are handled through ltrim, not by trimming data_len). Switching to RB_MAX would miss overlaps where an earlier long fragment extends past later ones.

Verification:

  • All 35 defrag unit tests pass locally with --enable-unittests, including every Sturges-Novak overlap-evasion pattern for BSD, Linux, Windows, Solaris, First, and Last policies on IPv4 and IPv6, the BSD subsequent-overlap-of-original tests, the missing-fragment tests, and the trailing no-MF-fragment tests.

Provide values to any of the below to override the defaults.

  • To use a Suricata-Verify or Suricata-Update pull request,
    link to the pull request in the respective _BRANCH variable.
  • Leave unused overrides blank or remove.

SV_REPO=https://github.com/ssam18/suricata-verify
SV_BRANCH=bug-8232-defrag-first-frag-regress
SU_REPO=
SU_BRANCH=

@ssam18 ssam18 requested a review from victorjulien as a code owner May 17, 2026 00:09
@victorjulien
Copy link
Copy Markdown
Member

Do you have a pcap based test case? If not, can you craft one?

@github-actions
Copy link
Copy Markdown

NOTE: This PR may contain new authors.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.63%. Comparing base (367ca7f) to head (a8f49b6).
⚠️ Report is 45 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15403      +/-   ##
==========================================
- Coverage   82.66%   82.63%   -0.03%     
==========================================
  Files         996      996              
  Lines      271604   271609       +5     
==========================================
- Hits       224512   224442      -70     
- Misses      47092    47167      +75     
Flag Coverage Δ
fuzzcorpus 60.92% <100.00%> (+0.05%) ⬆️
livemode 18.35% <0.00%> (+0.01%) ⬆️
netns 22.67% <69.23%> (-0.09%) ⬇️
pcap 45.07% <100.00%> (-0.06%) ⬇️
suricata-verify 66.42% <100.00%> (-0.05%) ⬇️
unittests 58.46% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 18, 2026

Do you have a pcap based test case? If not, can you craft one?

Crafted the PCAP and added another PR OISF/suricata-verify#3098

@ssam18 ssam18 force-pushed the defrag-rb-key-underflow-v1 branch from c7dd282 to a8f49b6 Compare May 18, 2026 15:40
@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 20, 2026

The CI failures (CodeQL cpp and rust) are both failing at the Adding fingerprints to SARIF file upload step with API rate limit exceeded for installation, not on any real finding. The actual analysis completed cleanly. Could a maintainer re-run those two jobs once the rate limit window has reset? Thanks.

@catenacyber
Copy link
Copy Markdown
Contributor

I am not sure there is a bug if the test pass as is.

How did you do the fuzzing ? Which sanitizers ?

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 21, 2026

The original underflow was reported by Sergey Zhidkih in Redmine #8232. The report says it surfaced during fuzzing, but I have no visibility into the specific fuzzer or sanitizer setup that was used on their side.

You are right that on a normal release build the algorithm still produces correct output. uint16_t wraparound is defined in C, RB_NFIND with key 65535 always returns NULL on a real fragment tree (no IPv4 fragment can have offset 65535, since the 13-bit field shifted by 3 caps at 65528), and the existing RB_MIN fallback walks forward and reaches the same end state. So depending on where the line is drawn, this is either a latent issue worth fixing (sanitizer finding, dead -1, defensively wrong intent) or a no-op cleanup.

What the PR adds beyond removing the -1 is the explicit first-fragment branch plus DEBUG_VALIDATE_BUG_ON(frag_offset == 0) in the else. That assertion is what gives the SV test its teeth on a debug-validation build. With the pre-fix code structure (no special case, -1 still in the key) and the same BUG_ON in place, the SV test fails with exit -6 from SIGABRT on the first fragment of the pcap. With the fix in place the test passes. So the fail-on-old, pass-on-new property holds under --enable-debug-validation, but does not hold on a vanilla release build.

If you would rather drop the SV test and keep this as a pure cleanup commit, that works too. Just let me know which way you want it.

Comment thread src/defrag.c Outdated
@catenacyber
Copy link
Copy Markdown
Contributor

I would prefer to have the right test (with the right sanitizer if needed) that fails now and passes with this PR.

Comment thread src/defrag.c Outdated
Copy link
Copy Markdown
Contributor

@catenacyber catenacyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs a better and smaller patch

@jasonish
Copy link
Copy Markdown
Member

This probably needs multiple PRs to demonstrate.

  1. A Suricata PR just adding a debug validation along with a SV PR to show this actually failing.
  2. Build on the above PRs to show the fix.

@ssam18 ssam18 force-pushed the defrag-rb-key-underflow-v1 branch from a8f49b6 to e31c44a Compare May 21, 2026 18:07
@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 21, 2026

@catenacyber thanks, your version is cleaner. Force-pushed e31c44a71 with the simpler shape. The block is now just if (frag_offset > 0) { ... RB_NFIND(...); } followed by the existing if (next == NULL) fallback. No duplicated RB_MIN code, no special-case branch, and no defensive BUG_ON since the guard already keeps the subtraction off the zero-offset path. All 35 defrag unit tests still pass.

@jasonish on the multi-PR demonstrator idea, with the simpler patch shape the SV test (OISF/suricata-verify#3098) no longer fails on the pre-fix code under --enable-debug-validation, because there is no longer a special-case branch for the BUG_ON to live in. To actually demonstrate the bug we would need a separate tiny Suricata PR that adds DEBUG_VALIDATE_BUG_ON(frag_offset == 0) just before the frag_offset - 1 line, paired with the SV test, which together would show the wrap firing on the first fragment of the pcap. That demonstrator PR is intentionally a one-off (landing it would break debug-validation CI), so it would need to merge back-to-back with this one.

If you want me to put that pair together, I can open it as a small companion PR. Otherwise this PR stands as a code-quality cleanup that silences the sanitizer finding and tightens intent.

@Rx1513
Copy link
Copy Markdown
Contributor

Rx1513 commented May 21, 2026

How did you do the fuzzing ?

Simply ran fuzz_decodepcapfile.

Which sanitizers ?

    export CLANG_SANITIZERS="                   \
        -fsanitize=undefined                    \
        -fsanitize=float-divide-by-zero         \
        -fsanitize=integer                      \
        -fsanitize=implicit-conversion          \
        -fsanitize=local-bounds                 \
        -fsanitize=nullability                  \
        -fsanitize=vptr                         \
        -fsanitize=address                      \
        -fsanitize=alignment                    \
        "

The actual sanitizer report is runtime error: implicit conversion from type 'int' of value -1 (32-bit, signed) to type 'uint16_t' (aka 'unsigned short') changed the value to 65535 (16-bit, unsigned). So the main issue I wanted to address is why frag_offset has to be decreased by one and is it really necessary?

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 21, 2026

@Rx1513 thanks for the details. You are right that the sanitizer trip is -fsanitize=implicit-conversion flagging the narrowing of -1 (the int result of the subtraction) to uint16_t, not unsigned wrap on uint16_t itself. The subtraction promotes to int first, so the value change happens during the assignment to .offset, not inside frag_offset. I will update the commit message wording to reflect that.

On whether the -1 is necessary at all, your reading matches mine. Fragment offsets are always multiples of 8 (IPV4_GET_RAW_FRAGOFFSET << 3 for IPv4, and the IPv6 fragment header offset field), so RB_NFIND with key = frag_offset - 1 and with key = frag_offset return the same node on every realistic input. The -1 is dead code that landed with the RB tree refactor in fe6e96a8c, and dropping it is genuinely the smallest correct fix.

The current branch took @catenacyber's earlier suggestion of wrapping the call with if (frag_offset > 0) and keeping the -1 inside, which avoids the sanitizer trip but preserves the dead subtraction. I can drop the -1 entirely instead, which is even smaller and matches what you originally pointed at on Redmine.

@catenacyber which shape do you prefer?

@catenacyber
Copy link
Copy Markdown
Contributor

I think the test would be to run CI with -fsanitize=implicit-conversion and I think there are other bugs to fix, but this looks like a right goal.

If just drop the -1 entirely works, that is a small patch and good for me...

The seek into the fragment tree used frag_offset - 1 as the key
for RB_NFIND. frag_offset is uint16_t, the subtraction promotes
to int, and on the first fragment of an IP datagram the int
result -1 is then narrowed to uint16_t when assigned to the key
field, wrapping to 65535. Builds with -fsanitize=implicit-conversion
flag this narrowing.

The wrap is harmless in practice because no fragment can have
offset 65535 (IPv4 limits the field to 8191 << 3 = 65528, and
IPv6 follows the same alignment), so RB_NFIND returns NULL and
the existing RB_MIN fallback walks the tree forward to reach the
same end state. The path is just reached by accident.

The -1 is also functionally a no-op even on inputs where it does
not underflow. Fragment offsets are always multiples of 8, so
RB_NFIND with key = frag_offset - 1 and with key = frag_offset
return the same node on every realistic input. Drop the -1
entirely. The 35 defrag unit tests pass unchanged.

Bug: OISF#8232.
@ssam18 ssam18 force-pushed the defrag-rb-key-underflow-v1 branch from e31c44a to 67efa92 Compare May 22, 2026 15:33
@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 22, 2026

Force-pushed 67efa924e with just the -1 dropped. Commit message now describes the sanitizer trip as -fsanitize=implicit-conversion on the narrowing assignment rather than unsigned wrap. Single-line diff.

@catenacyber
Copy link
Copy Markdown
Contributor

catenacyber commented May 22, 2026

With -fsanitize=implicit-conversion I first trip on

detect-engine-build.c:1442:24: runtime error: implicit conversion from type 'int' of value 65536 (32-bit, signed) to type 'uint16_t' (aka 'unsigned short') changed the value to 0 (16-bit, unsigned)
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior detect-engine-build.c:1442:24

Don't you ?

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 23, 2026

@catenacyber No, fuzz_decodepcapfile only runs PostConfLoadedSetup and then decodes pcaps. It never calls SigLoadSignatures so the port-interval code isn't reached, which is why I didn't see that trip. But yeah, port2 + 1 with port2 == 65535 is the same class of bug. I can take it as a follow-up PR after this one lands, plus another adding -fsanitize=implicit-conversion to a CI job so we catch the rest in one pass.

@catenacyber
Copy link
Copy Markdown
Contributor

I meant for SV tests with -fsanitize=implicit-conversion

To sum up, I think what this PR is fixing here is just some UBSAN report.
This is indeed good to fix, but I think we would want as a first phase, to pass all unit tests + all SV tests with this sanitizer -fsanitize=implicit-conversion
I am not sure this change is valuable if we cannot ensure that we will not have regressions thanks to some test.

Do you agree with this ?

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 25, 2026

I meant for SV tests with -fsanitize=implicit-conversion

To sum up, I think what this PR is fixing here is just some UBSAN report. This is indeed good to fix, but I think we would want as a first phase, to pass all unit tests + all SV tests with this sanitizer -fsanitize=implicit-conversion I am not sure this change is valuable if we cannot ensure that we will not have regressions thanks to some test.

Do you agree with this ?

Agreed, the CI gate is the right end state. This fix is correct on its own (the -1 is dead code per Rx1513's analysis), and silencing the UBSAN trip is a side benefit, but you are right that without the sanitizer enabled in CI, we can't catch regressions to it. Two ways to sequence from here...

a. Land this as the first PR in a series: fix the other -fsanitize=implicit-conversion trips you flagged (detect-engine-build.c:1442 etc.), then a CI PR enable the sanitizer on unit + SV tests
b. Hold, until the whole gate is ready and land them together

Either works for me. Let me know what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants