Skip to content

Daily Perf Improver: Research and Plan #1560

@github-actions

Description

@github-actions

FSharp.Data Performance Research and Improvement Plan

Executive Summary

FSharp.Data is a mature F# library providing type providers and data access tools for CSV, HTML, JSON, and XML. After analyzing the codebase, this plan outlines performance improvement opportunities with focus on JSON parsing, string processing, and memory allocation optimizations.

Current Performance Infrastructure

Benchmarking Setup ✅

  • Location: tests/FSharp.Data.Benchmarks/
  • Framework: BenchmarkDotNet with memory diagnostics
  • Coverage: JSON parsing benchmarks (Simple, Nested, GitHub, Twitter, WorldBank)
  • Commands:
    dotnet run --project build/build.fsproj -- -t RunBenchmarks
    cd tests/FSharp.Data.Benchmarks && ./run-benchmarks.sh

Build & Test Infrastructure ✅

  • Build: FAKE-based build system (build/build.fs)
  • CI: GitHub Actions for Windows/Ubuntu
  • Commands: ./build.sh or dotnet run --project build/build.fsproj -- -t Build
  • Test: dotnet run --project build/build.fsproj -- -t RunTests

Performance Analysis

1. JSON Processing Bottlenecks 🎯

Primary Target: src/FSharp.Data.Json.Core/JsonValue.fs

Identified Issues:

  1. String Building: Heavy use of StringBuilder in JSON serialization (WriteTo method)
  2. Parsing Algorithm: Recursive descent parser with potential stack overhead
  3. Memory Allocations:
    • Array allocations for Record properties and Array elements
    • String interpolation and concatenation
    • Buffer management in parser state

Current Performance Characteristics:

  • ParseSimpleJson: Small documents (~1KB)
  • ParseGitHubJson: Medium documents (~75KB)
  • ParseTwitterJson: Medium documents (~74KB)
  • ParseWorldBankJson: Small-medium documents (~20KB)

2. Type Provider Performance 🔍

Type providers generate code at compile time, affecting:

  • Design-time: IntelliSense responsiveness
  • Runtime: Generated type instantiation
  • Memory: Schema inference caching

3. CSV/XML/HTML Processing 📊

Similar patterns exist across:

  • src/FSharp.Data.Csv.Core/: CSV parsing and inference
  • src/FSharp.Data.Html.Core/: HTML parsing with CSS selectors
  • src/FSharp.Data.Xml.Core/: XML processing and XSD inference

Performance Goals & Roadmap

Round 1: JSON Parsing Optimization (Quick Wins) 🚀

Target: 15-30% improvement in JSON parsing speed, 10-20% memory reduction

  1. JsonValue Parser Optimizations:

    • Replace StringBuilder with span-based approaches
    • Optimize number parsing with span methods
    • Implement object/array pooling for common sizes
    • Cache decoded strings in hot paths
  2. Memory Management:

    • Pre-size collections based on content hints
    • Reduce intermediate allocations
    • Use ReadOnlySpan<char> for tokenization

Round 2: Advanced JSON Performance (Medium Impact) ⚡

Target: 30-50% total improvement over baseline

  1. Parser Architecture:

    • Implement single-pass parsing with minimal backtracking
    • Add SIMD acceleration for string operations where available
    • Optimize UTF-8 vs UTF-16 handling
  2. Serialization Optimizations:

    • Buffer writer patterns for JSON output
    • Streaming serialization for large objects

Round 3: Ecosystem-Wide Optimizations (Long Term) 🎯

Target: Comprehensive performance improvements

  1. CSV Performance: Optimize delimiter detection and field parsing
  2. HTML Performance: Improve CSS selector performance and DOM traversal
  3. Type Provider Efficiency: Cache schema inference results
  4. HTTP Performance: Connection pooling and request optimization

Technical Implementation Strategy

Benchmarking Workflow 📈

  1. Baseline Measurement: Run existing benchmarks to establish baseline
  2. Incremental Testing: Validate each optimization with A/B comparisons
  3. Regression Detection: Ensure no performance regressions in unchanged code paths
  4. Memory Profiling: Use dotMemory/PerfView for allocation analysis

Development Environment Setup ⚙️

# Standard build
./build.sh

# Run benchmarks  
dotnet run --project build/build.fsproj -- -t RunBenchmarks

# Quick performance validation
cd tests/FSharp.Data.Benchmarks
./run-benchmarks.sh quick

# Development iteration
./run-benchmarks.sh simple  # Simple + Nested JSON only

Performance Validation Process 🔬

  1. Micro-benchmarks: BenchmarkDotNet for specific operations
  2. Macro-benchmarks: Real-world JSON document processing
  3. Stress Testing: Large document handling (>10MB JSON files)
  4. Memory Analysis: Allocation patterns and GC pressure

Success Metrics 📊

Primary KPIs:

  • JSON Parse Speed: 20-40% improvement in ops/second
  • Memory Usage: 15-25% reduction in allocations
  • Latency: Lower p95 parsing times for medium documents

Secondary Metrics:

  • Build Time: No regression in compilation speed
  • Test Suite: All existing tests continue to pass
  • API Compatibility: Zero breaking changes to public API

Risk Assessment & Mitigation 🛡️

High Risk:

  • Breaking Changes: Maintain backward compatibility
  • Correctness: Extensive testing of edge cases
  • Platform Dependencies: Keep .NET Standard compatibility

Mitigation Strategy:

  • Feature Flags: Allow fallback to original implementations
  • Extensive Testing: Leverage existing comprehensive test suite
  • Incremental Rollout: Small, verifiable changes

Next Steps 🚀

  1. Environment Setup: Validate benchmarking infrastructure
  2. Baseline Establishment: Run full benchmark suite and document results
  3. Low-Hanging Fruit: Start with JSON StringBuilder optimizations
  4. Iterative Improvement: Implement, measure, and validate each optimization

Resources & Documentation 📚

  • Benchmark Results: Store in /tests/FSharp.Data.Benchmarks/BenchmarkDotNet.Artifacts/
  • Performance Guide: Document optimization techniques for contributors
  • Profiling Data: Use dotTrace/dotMemory for detailed analysis

AI-generated content by Daily Perf Improver may contain mistakes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions