You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document direct API call investigation and root cause analysis
- Added comprehensive analysis of why PyO3 direct API calls don't provide expected performance
- Documented that direct conversions (into_py/extract) are already implemented for common types
- Identified root cause: architectural difference (Python ↔ Bson ↔ bytes vs Python ↔ bytes)
- Added comparison table showing impact of each bottleneck (2-3x, 1.5-2x, 1.2-1.5x, 1.1-1.3x)
- Updated Priority 2 status to 'ALREADY IMPLEMENTED' with investigation findings
- Renamed Priority 3 to 'Bypass Rust bson Crate' with realistic effort estimate (20-30 hours)
- Corrected performance table with actual benchmark results (0.21x baseline)
- Added conclusion section summarizing PYTHON-5683 investigation completion
- Clarified trade-offs: safety/maintainability vs performance vs development time
Key finding: Rust extension is ~5x slower due to fundamental architectural differences,
not missing optimizations. Reaching parity would require bypassing the Rust bson crate
entirely and writing BSON bytes directly (similar to C extension approach).
The Rust extension was initially slower than the C extension due to **Python FFI overhead** - specifically, repeated type imports on every BSON conversion. With comprehensive type caching now implemented, performance improved by ~24% (0.21x → 0.26x). However, significant overhead remains from:
324
-
- Python object creation for every BSON value (even with cached types)
325
-
- PyO3 FFI overhead when calling Python constructors
326
-
- Lack of fast paths for common types (C extension uses direct C API calls)
323
+
The Rust extension was initially slower than the C extension due to **Python FFI overhead** - specifically, repeated type imports on every BSON conversion. With comprehensive type caching now implemented, performance improved by ~24% (0.21x → 0.26x). However, significant overhead remains.
327
324
328
-
The type caching helped but wasn't the silver bullet we hoped for. The C extension's performance advantage comes from using low-level C API calls (`PyLong_FromLong`, `PyUnicode_FromStringAndSize`, etc.) instead of calling Python constructors through FFI.
325
+
**Investigation of Direct API Calls:**
329
326
330
-
**Recommendation:** C extension remains the default and recommended choice. The Rust extension demonstrates feasibility and correctness, with type caching providing modest improvements. Further optimizations (Priority 2-4) are needed to approach performance parity.
327
+
After implementing type caching, we investigated using PyO3's direct API calls instead of Python constructors. **Key finding: The code already uses direct conversions for common types!**
- Int64: Uses cached class constructor (intentional - preserves type information)
336
+
337
+
**Encoding path (already optimized):**
338
+
- None → `Bson::Null` (direct)
339
+
- bool → `Bson::Boolean(v)` (direct extraction)
340
+
- int → `Bson::Int32/Int64` (direct extraction)
341
+
- float → `Bson::Double(v)` (direct extraction)
342
+
- str → `Bson::String(v)` (direct extraction)
343
+
344
+
**BSON-specific types (require constructors):**
345
+
- ObjectId, Timestamp, Decimal128, Binary, etc. **must** call Python constructors because they are custom Python classes without direct PyO3 equivalents.
346
+
347
+
**The Real Bottlenecks:**
348
+
349
+
1.**Architectural difference:**
350
+
- C extension: Python ↔ bytes (direct, single conversion)
**Recommendation:** C extension remains the default and recommended choice. The Rust extension demonstrates feasibility and correctness (100% test pass rate), with type caching providing modest improvements. To reach performance parity, we would need to fundamentally change the architecture to bypass the intermediate `Bson` types and write/read BSON bytes directly (similar to the C extension's approach).
360
+
361
+
### Key Findings: Why Rust is Slower
362
+
363
+
After implementing type caching and investigating direct API calls, we've identified the root causes of the performance gap:
364
+
365
+
**✅ What's Already Optimized:**
366
+
- Direct conversions for common types (int, str, float, bool, null) using `into_py()` and `extract()`
367
+
- Type caching for all BSON-specific types (ObjectId, Timestamp, etc.)
368
+
- Fast path for PyDict iteration (most common case)
369
+
- Direct byte reading/writing without intermediate Document structures
- ⚠️ **After major refactor:** Memory safety, more complex code, ~2x slower (estimated)
394
+
- ✅ **C extension:** Maximum performance, requires careful memory management
331
395
332
396
### Path to Performance Parity
333
397
334
-
Analysis of the C extension reveals several optimization opportunities to achieve near-parity performance:
398
+
Analysis of the C extension and Rust implementation reveals the optimization opportunities:
335
399
336
400
#### Priority 1: Type Caching (HIGH IMPACT) ✅ **IMPLEMENTED**
337
401
@@ -381,65 +445,104 @@ struct TypeCache {
381
445
**Actual Impact:**~1.24x faster overall (0.21x → 0.26x average ratio)
382
446
**Actual Effort:**~6 hours
383
447
384
-
**Analysis:** Type caching provided modest improvements (~24%) but not the expected 2-3x speedup. The remaining bottleneck is Python object creation overhead through PyO3 FFI. The C extension's advantage comes from using direct C API calls (`PyLong_FromLong`, etc.) instead of calling Python constructors. Priority 2 (Fast Paths) is now critical to achieve further gains.
448
+
**Analysis:** Type caching provided modest improvements (~24%) but not the expected 2-3x speedup. Investigation revealed that **direct API calls are already being used** for common types via PyO3's `into_py()` and `extract()` methods. The remaining bottleneck is the architectural difference: the Rust extension uses intermediate `Bson` types (Python → Bson → bytes), while the C extension writes bytes directly (Python → bytes). To reach parity, we would need to bypass the Rust `bson` crate entirely for simple documents.
449
+
450
+
#### Priority 2: Fast Paths for Common Types (MEDIUM IMPACT) ⚠️ **ALREADY IMPLEMENTED**
385
451
386
-
#### Priority 2: Fast Paths for Common Types (MEDIUM IMPACT)
452
+
**Status:** ⚠️ **Investigation complete** - Fast paths are already implemented!
387
453
388
-
**Problem:** Every type conversion has overhead even with caching
454
+
**Current Implementation:**
455
+
- Common types already use direct conversions via `into_py()` and `extract()`
| + Profiling + Tuning (projected) |**0.5x**|**0.4x**|**0.7x**|**0.6x**|**0.55x**| ⏳ TODO |
427
527
428
-
**Note:**Complex encoding will likely remain slower due to Python FFI overhead for nested structures.
528
+
**Note:**The baseline numbers were corrected after running full benchmarks (100,000 iterations). Type caching provided ~24% improvement in earlier quick tests, but the absolute performance remains at 0.21x average. Reaching 1.0x parity would require bypassing the Rust `bson` crate entirely.
-**For production:** Continue using C extension (maximum performance)
624
+
-**For exploration:** Rust extension demonstrates feasibility and correctness
625
+
-**For future:** Consider Rust for new extensions where safety > raw performance
626
+
-**For optimization:** Would require bypassing Rust `bson` crate (major refactor)
627
+
628
+
**Investigation Complete:** The Rust extension successfully demonstrates that Rust is a viable alternative to C for Python extensions, with the trade-off being development complexity vs. performance. The investigation has identified the exact architectural differences that cause the performance gap and the effort required to close it.
0 commit comments