Fix file saving error during parallel operations (#388)#440
Open
devdattatalele wants to merge 3 commits into
Open
Fix file saving error during parallel operations (#388)#440devdattatalele wants to merge 3 commits into
devdattatalele wants to merge 3 commits into
Conversation
Implemented FileLock synchronization to resolve "Too many open files" error that occurs during concurrent file operations in training pipelines. Changes: - Added filelock dependency to pyproject.toml - Implemented FileLock synchronization in all file I/O operations: * save_json(), save_csv(), save_pickle() for write operations * load_json(), load_pickle(), load_jsonl() for read operations * append_to_jsonl(), write_list_to_jsonl() for append operations - Standardized exception handling to use IOError consistently - Maintained 100% backward compatibility with existing APIs Technical Details: - Each file operation uses a dedicated lock file (filename + ".lock") - Context managers ensure proper cleanup of locks and file handles - Prevents file descriptor exhaustion in multi-threaded training scenarios - Eliminates race conditions in concurrent checkpoint saving and logging Testing: - Comprehensive parallel operation tests completed successfully - Simulated training scenarios with 25 steps and 8 concurrent workers - Stress tested with 300+ concurrent file operations - All existing tests pass, confirming no breaking changes Resolves: SylphAI-Inc#388
This commit addresses all identified code smells and architectural issues in the previous implementation while maintaining the core functionality of preventing "too many open files" errors. Major Architectural Improvements: - Introduced Strategy Pattern for configurable file operation strategies - Added dependency injection for testing and flexibility - Implemented reader-writer optimization for better concurrency - Created proper abstraction layer separating concerns - Added comprehensive lock file cleanup mechanisms Code Quality Improvements: - Eliminated shotgun surgery by centralizing locking logic - Removed duplicate code through proper abstraction - Fixed overly broad exception handling - Reduced unnecessary blocking for read operations - Restored single responsibility principle Performance Enhancements: - Reader optimization allows concurrent reads when safe - Configurable timeout and strategy selection - Automatic stale lock file cleanup - Minimal overhead for NoLock and ThreadLocal strategies Testing & Maintainability: - Full dependency injection for unit testing - Mockable strategy interfaces - Configurable behavior for different environments - Clean separation of concerns Backward Compatibility: - All existing file_io.py APIs unchanged - No breaking changes to function signatures - Drop-in replacement with better architecture Technical Details: - FileOperationStrategy interface with multiple implementations - ThreadSafeFileOperations with singleton pattern for global access - Automatic fallback from FileLock to ThreadLocal locks - Comprehensive error handling and logging - Lock file cleanup with age-based orphan removal The implementation now follows SOLID principles and provides a clean, testable, and maintainable solution to the concurrent file access issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR resolves the "Too many open files" error that occurs during parallel operations in training pipeline, specifically addressing issue #388. The solution implements a robust, configurable file operations architecture with proper abstraction layers.
Root Cause Analysis
The issue occurred because multiple parallel threads/processes attempted to perform file operations simultaneously without synchronization, leading to:
Strategy Pattern Implementation
Available Strategies:
NoLockStrategy: No synchronization (development/single-threaded)ThreadLocalLockStrategy: In-process thread synchronizationFileLockStrategy: Cross-process file locking with reader optimization** Performance Optimizations:**
Implementation
Core Classes
1. File Operations Strategies
2. Dependency Injection
3. Clean File I/O Interface
Testing & Maintainability
** Full Testability:**
Test Results
Performance Impact
Closes #388