Skip to content

Hardcoded double in generic <T> classes — ~4,563 instances across ~4,665 files break generic type contract #932

@ooples

Description

@ooples

Summary

The codebase declares generic type parameters <T> on classes but uses hardcoded double internally, breaking the generic contract. Users who instantiate these classes with float, decimal, or other numeric types will get silent precision loss, type conversion failures, or incorrect results.

Scale: ~4,563 instances of hardcoded double types across ~4,665 of the 5,552 generic <T> class files in src/.

The Problem

What's happening

Classes are declared as generic <T> (accepting float, double, decimal, etc.) but internally use double for:

  • Field declarations (Vector<double>, Matrix<double>, double[])
  • Local variables and intermediate computations
  • Return types from helper methods
  • Parameter types in private methods

Why it matters

  1. Silent precision loss: A user instantiating SuperLearner<decimal> for financial ML gets their high-precision data silently truncated to double precision internally
  2. Type conversion failures: NumOps.ToDouble() → compute in doubleNumOps.FromDouble() loses data on types wider than double
  3. Broken generic contract: The <T> parameter is a lie — the class doesn't actually operate in type T
  4. Inconsistent behavior: Some code paths use T correctly while others use double, causing subtle inconsistencies

Concrete Examples

Example 1: Regression/SuperLearner.cs (38 instances)

// Class declared as generic <T>
public class SuperLearner<T> : NonLinearRegressionBase<T>
{
    // BUT internal state uses hardcoded double!
    private Vector<double>? _cvPerformance;    // Line 62 — should be Vector<T>
    private Vector<double>? _predMeans;         // Line 67 — should be Vector<T>
    private Vector<double>? _predStds;          // Line 72 — should be Vector<T>

    // Internal computations hardcoded to double
    var metaFeatures = new Matrix<double>(n, numModels);  // Line 132
    var yData = new Vector<double>(n);                     // Line 133
    double foldMse = 0;                                    // Line 165

    // Return types hardcoded
    public Vector<double> GetCVPerformance()               // Line 264 — should be Vector<T>
    public Vector<double> GetModelContributions()           // Line 273 — should be Vector<T>
}

Example 2: Regression/MixedEffectsModel.cs (40 instances)

public class MixedEffectsModel<T> : NonLinearRegressionBase<T>
{
    // Hardcoded double in conversions
    var yData = new Vector<double>(y.Length);              // Line 138
    var beta = new Vector<double>(_numFeatures + 1);      // Line 149
    
    // Return type hardcoded
    public double ComputeICC()                             // Line 330 — should be T
    public double GetLogLikelihood(...)                     // Line 352 — should be T
    
    // Private helpers all use double
    private Matrix<double> InitializeRandomEffectVariance(int dim)  // Line 421
    private Dictionary<int, Vector<double>> ComputeBLUPs(...)       // Line 481
}

Example 3: Preprocessing/FeatureSelection/ (1,512 instances in 100+ files)

This is the worst offender directory. Statistical computations are entirely in double:

// GenericUnivariateSelect.cs
private double[]? _scores;                                 // Line 48
private double[]? _pValues;                                // Line 49
public double[]? Scores => _scores;                        // Line 55

private (double[] Scores, double[] PValues) ComputeFClassif(  // Line 240
    Matrix<T> data, Vector<T> target, int n, int p)
{
    var scores = new double[p];                            // Line 242
    var pValues = new double[p];                           // Line 243
    double overallMean = 0;                                // Line 258
    double ssb = 0, ssw = 0;                               // Line 263
}

Top files in Preprocessing/FeatureSelection/:

File double count
Filter/Univariate/SelectPercentile.cs 69
Filter/Univariate/SelectKBest.cs 68
Helpers/StatisticalTestHelper.cs 60
SelectPercentile.cs 54
Bioinformatics/VolcanoPlotSelector.cs 52
Causal/FCI_Selector.cs 51

Affected Directories (by instance count)

Directory Instances Description
Preprocessing/ 1,865 Feature selection, time series transforms, scalers
AiDotNet.Playground/ 212 Example service (may be acceptable here)
MetaLearning/ 197 Meta-learning algorithms
Finance/ 192 Financial forecasting (precision critical!)
AnomalyDetection/ 156 Anomaly detection algorithms
Regression/ 154 Regression models
FederatedLearning/ 153 Federated training
Clustering/ 120 Clustering algorithms
NeuralNetworks/ 115 Neural network layers
Data/ 93 Data loading
TextToSpeech/ 90 TTS models
Classification/ 79 Classification models
Evaluation/ 77 Model evaluation
Audio/ 63 Audio processing
ComputerVision/ 58 Vision models

The Correct Pattern

AiDotNet already has the right infrastructure — INumericOperations<T> — it just isn't being used consistently:

// ❌ WRONG: Hardcoded double
private Vector<double>? _cvPerformance;
double foldMse = 0;
double diff = yData[valIdx[i]] - NumOps.ToDouble(predictions[i]);

// ✅ CORRECT: Use generic T with NumericOperations
private Vector<T>? _cvPerformance;
T foldMse = NumOps.Zero;
T diff = NumOps.Subtract(yData[valIdx[i]], predictions[i]);

For statistical computations where double is genuinely needed (e.g., p-values, F-statistics):

// ✅ ACCEPTABLE: Using double for well-defined statistical outputs
// that are always in double regardless of model precision
public double[] PValues => _pValues;  // p-values are always double

// ✅ CORRECT: Convert at boundaries, compute in T internally
T featureScore = ComputeScore(data, target);  // Internal: use T
double pValue = ComputePValue(NumOps.ToDouble(featureScore));  // Boundary: convert to double for stats

Proposed Fix Strategy

Phase 1: Audit and Categorize (~1 day)

  1. Categorize each double usage as:
    • Must fix: Fields, return types, parameters that should be T
    • Acceptable: Statistical outputs, p-values, probability values that are inherently double
    • Boundary: Conversions at I/O boundaries (logging, display, serialization)

Phase 2: Fix by Module (incremental PRs)

Priority order based on impact and user visibility:

  1. Regression/ — Core regression models (user-facing)
  2. Classification/ — Core classification models (user-facing)
  3. Preprocessing/ — Feature selection and transforms (affects all pipelines)
  4. MetaLearning/ — Meta-learning algorithms
  5. Clustering/ — Clustering algorithms
  6. AnomalyDetection/ — Anomaly detection
  7. Remaining directories

Phase 3: Add Roslyn Analyzer

Create a custom Roslyn analyzer that flags double usage inside <T> generic classes to prevent regression.

Impact

  • Severity: Medium-High (silent data corruption for non-double types)
  • Probability: High for any user using float or decimal (100% of code paths affected)
  • Risk of fix: Medium (incremental module-by-module approach limits blast radius)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions