Skip to content

fix: prevent unbounded memory leak during concurrent validation under…#239

Open
KyleOps wants to merge 1 commit intohapifhir:masterfrom
KyleOps:fix/validation-service-memory-leak
Open

fix: prevent unbounded memory leak during concurrent validation under…#239
KyleOps wants to merge 1 commit intohapifhir:masterfrom
KyleOps:fix/validation-service-memory-leak

Conversation

@KyleOps
Copy link
Copy Markdown

@KyleOps KyleOps commented Feb 26, 2026

This PR fixes a memory and thread exhaustion vulnerability in ValidationServiceFactoryImpl.kt.

Under high concurrency, if the JVM's free memory dips below the engineReloadThreshold, the getValidationService() method re-initializes ValidationService. Because this check was not synchronized, a burst of incoming validation requests during a low-memory state would all evaluate the condition as true simultaneously.

This resulted in an unbounded loop where the application spawned dozens of massive ValidationEngine instances and hundreds of detached background downloader threads to reload presets. The JVM quickly spiraled into maxed CPU usage, gigabytes of leaked memory, and eventual crashes.

Impact & Context

We discovered this in while utilizing the validator API in conjunction with au-fhir-inferno. Under heavy testing load, the validator-wrapper would hit the memory threshold and trigger this un-synchronized cache reset. The true impact was three-fold:

  1. Severe API Latency: The heavy I/O of re-building the cache blocks threads for several seconds. When 5 requests initialize 5 engines simultaneously, the CPU maxes out and all incoming requests hang, causing massive latency spikes (5-20s+) and upstream application timeouts.
  2. Caching "Thrashing": Under load, if the threshold is reached frequently, the server spins up a death spiral of actively deleting and downloading large FHIR packages, completely negating the benefit of an in-memory cache.
  3. Thread, File & Network Exhaustion: Even if the environment's memory can survive the spike, the background threads consume vast network bandwidth pulling .tgz files from packages.fhir.org and leak file descriptors on the host container.

Changes

  • Applied @Volatile to validationService.
  • Added a ReentrantLock (reloadLock) to ensure only one thread can lock and initialize the cache.
  • Added a 5-minute cooldown (RELOAD_COOLDOWN_MS) so the engine isn't perpetually thrashed under sustained heavy load.
  • Concurrent threads that hit the low memory state while the lock is acquired will now safely log and skip initialization, returning the existing engine and preventing the death spiral.

… load

Introduced a ReentrantLock and rate-limiting to `getValidationService()` in ValidationServiceFactoryImpl.kt. When the JVM hits the low memory threshold during high concurrency, the application previously allowed all concurrent requests to spin up parallel initialization threads, spiraling memory usage and crashing the application.

This fix ensures only a single thread can trigger the engine reload within a 5-minute cooldown period, allowing concurrent requests to queue or utilize the existing engine rather than flooding the JVM with unbounded heavy cache initialization threads.
@dotasek
Copy link
Copy Markdown
Collaborator

dotasek commented Apr 6, 2026

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dotasek
Copy link
Copy Markdown
Collaborator

dotasek commented Apr 6, 2026

@KyleOps thank you for this contribution.

I tested this locally and discovered that even with this in place, I get out of memory errors. I suspect that what's happening is that a new ValidationService is being instantiated before the previous one can be collected in garbage collection.

I made a change while reviewing that addresses this:

// New convenience function
 private fun createEmptyValidationServiceInstance() : ValidationService {
        val service = object : ValidationService() {};
        return service;
    }
// Logic change
     if (java.lang.Runtime.getRuntime().freeMemory() < validationServiceConfig.engineReloadThreshold &&
                        (now - lastReloadTime) > RELOAD_COOLDOWN_MS) {
                        
                        println(
                            "Free memory ${
                                java.lang.Runtime.getRuntime().freeMemory()
                            } is less than ${validationServiceConfig.engineReloadThreshold} and cooldown passed. Re-initializing validationService"
                        );
                        //First create an empty validationService so the older one can be garbage collected before the
                        //new one is constructed.
                        validationService = createEmptyValidationServiceInstance()
                        //Then 'suggest' this might be a good time to do garbage collection
                        System.gc()
                        //Then build our new service instance.
                        validationService = createValidationServiceInstance()
                        lastReloadTime = System.currentTimeMillis()
                    }

Does that make sense? Does it work with your use case as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants