Skip to content

fakeip: persist metadata on every save interval, not just on Close#4140

Open
arthur109 wants to merge 93 commits into
SagerNet:testingfrom
arthur109:fakeip-metadata-async-save
Open

fakeip: persist metadata on every save interval, not just on Close#4140
arthur109 wants to merge 93 commits into
SagerNet:testingfrom
arthur109:fakeip-metadata-async-save

Conversation

@arthur109

Copy link
Copy Markdown

Summary

(*CacheFile).FakeIPSaveMetadataAsync has two bugs that compound to make the on-disk fakeip allocation counter advance only when Close() runs. On mobile (Android BoxService killed by OOM / am force-stop / phone reboot before clean teardown) the counter on disk falls behind reality. The next start loads the stale counter and the allocator in Store.Create() silently overwrites existing reverse-map entries because it doesn't check for collisions before storing.

The two bugs

// experimental/cachefile/fakeip.go (before)
func (c *CacheFile) FakeIPSaveMetadataAsync(metadata *adapter.FakeIPMetadata) {
    if c.saveMetadataTimer == nil {
        c.saveMetadataTimer = time.AfterFunc(C.FakeIPMetadataSaveInterval, func() {
            _ = c.FakeIPSaveMetadata(metadata)   // captures FIRST metadata
        })
    } else {
        c.saveMetadataTimer.Reset(C.FakeIPMetadataSaveInterval)   // never updates the closure
    }
}

Bug A — timer never fires under load. Create() calls this on every allocation. Every call after the first goes through .Reset(), pushing the deadline another 10 s. Any active workload (continuously resolving new domains) keeps the timer alive forever.

Bug B — even if it fires, it persists the wrong data. The first call's metadata pointer is captured by the closure. Subsequent calls construct fresh FakeIPMetadata{...} values and pass them in, but only the timer is reset — the closure is never replaced. A delayed fire would write the very first snapshot of the session.

Together, the only code path that ever writes correct metadata is Store.Close() → CacheFile.FakeIPSaveMetadata (synchronous, with current values).

Why it matters

Store.Create() does not check whether the proposed next IP is already in fakeip_address:

nextAddress := s.inet4Current.Next()
// ... range / wrap check ...
s.inet4Current = nextAddress
err := s.storage.FakeIPStore(address, domain)   // overwrites silently

So when Start() restores inet4Current from stale metadata, the first ~N allocations of the new session overwrite the reverse-map entries for IPs [stale_counter+1, actual_bucket_max]. The forward map (fakeip_domain*) still points to those IPs for the old domains, so any app that cached the prior DNS answer keeps connecting to fake IPs whose reverse-map now resolves to a different domain. The router dials the wrong outbound → TLS handshake fails with a cert mismatch → the affected hosts break, while everything else (newly-allocated or unaffected) keeps working.

Reproduced this in production on Android with Instagram: the file's fakeip_metadata had Inet4Current = 198.18.0.6 while fakeip_address held entries up to 198.18.0.40. After restart, ~34 allocations clobbered existing entries before the counter caught up. Profile pictures and chats kept working (those endpoints didn't get clobbered); reels, stories, posts, profile pages failed (their endpoints got reverse-map rewritten).

Fix

Track the latest metadata in a mutex-protected field on CacheFile, let the timer fire on its own schedule (no Reset), and on fire snapshot the latest pointer and clear the timer so the next allocation reschedules.

// after
func (c *CacheFile) FakeIPSaveMetadataAsync(metadata *adapter.FakeIPMetadata) {
    c.saveMetadataAccess.Lock()
    c.latestFakeIPMetadata = metadata
    if c.saveMetadataTimer == nil {
        c.saveMetadataTimer = time.AfterFunc(C.FakeIPMetadataSaveInterval, func() {
            c.saveMetadataAccess.Lock()
            m := c.latestFakeIPMetadata
            c.saveMetadataTimer = nil
            c.saveMetadataAccess.Unlock()
            if m != nil {
                _ = c.FakeIPSaveMetadata(m)
            }
        })
    }
    c.saveMetadataAccess.Unlock()
}

Two new fields on CacheFile: saveMetadataAccess sync.Mutex and latestFakeIPMetadata *adapter.FakeIPMetadata. Total: 12 added lines, 3 removed.

Behaviour after the patch

  • Allocations within a FakeIPMetadataSaveInterval window all see their latest metadata captured. The timer fires at most one save per interval; subsequent allocations reschedule a fresh interval.
  • Metadata on disk now lags reality by at most FakeIPMetadataSaveInterval (10 s by default) under continuous load, instead of being unbounded.
  • Close() semantics are unchanged.
  • No new goroutines, no busy-looping, no behavioural regression for users on platforms with reliable clean shutdown.

Verification

Tested end-to-end on a Samsung Android device running an embedded libbox:

Scenario counter on disk bucket max result
Pre-patch, after am force-stop cycles 198.18.0.6 198.18.0.40 next start: ~34 silent overwrites → IG breaks
Post-patch, idle session 198.18.0.12 198.18.0.12 matches; next allocation lands at .13
Post-patch, after am force-stop 198.18.0.12 198.18.0.12 survives unclean shutdown
Post-patch, after real phone reboot + new IG session 198.18.0.33 198.18.0.33 survives reboot; IG loads reels/stories/profiles

Test plan

  • Compiles (go build ./experimental/cachefile/)
  • Reproduced bug pre-patch on Android (Instagram broke after phone reboot)
  • Post-patch: metadata advances within ~10 s of allocations
  • Post-patch: metadata survives am force-stop
  • Post-patch: metadata survives full phone reboot
  • Post-patch: new allocations after restart land at bucket_max + 1, no collisions

🤖 Generated with Claude Code

@nekohasekai nekohasekai force-pushed the testing branch 9 times, most recently from 2141845 to b6c416b Compare May 26, 2026 09:45
@nekohasekai nekohasekai force-pushed the testing branch 3 times, most recently from de161cf to 83b7304 Compare June 3, 2026 05:05
@nekohasekai nekohasekai force-pushed the testing branch 7 times, most recently from 1ed57eb to 8247670 Compare June 13, 2026 15:01
@nekohasekai nekohasekai force-pushed the testing branch 10 times, most recently from f7ca395 to f27d0e3 Compare June 21, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants