Author: Prabhu Status: Final
- Executive Overview
- Business Context and Objectives
- Data Description
- Methodology
- Model Validation
- Limitations and Assumptions
- Model Monitoring and Governance
- Appendix
This document describes the development and validation of a CECL (Current Expected Credit Losses) credit risk model for estimating Expected Credit Losses (ECL) on a loan portfolio with a focus on agricultural-style lending segments.
| Component | Description |
|---|---|
| Model Type | Credit Risk - CECL ECL Estimation |
| Target Variable | Binary Default (0/1) |
| Primary Use | Allowance calculation, stress testing, risk management |
| Portfolio Scope | Consumer loans with agricultural segment proxy |
| Key Outputs | PD, LGD, EAD, ECL at loan and portfolio level |
- PD Model Performance: AUC > 0.65 on out-of-time test set
- Portfolio ECL Rate: Calculated as PD x LGD x EAD
- Agricultural Segment: Higher risk profile with elevated default rates
- Stress Testing: ECL increases 25-75% under adverse scenarios
The Current Expected Credit Loss (CECL) standard (ASC 326) requires financial institutions to estimate and record lifetime expected credit losses at loan origination. This represents a shift from the previous incurred loss model to a forward-looking approach.
Key CECL Requirements:
- Estimate lifetime expected losses at origination
- Incorporate reasonable and supportable forecasts
- Consider historical loss experience
- Account for current economic conditions
- Develop PD Model: Build a robust probability of default model using machine learning techniques
- Estimate ECL Components: Calculate PD, LGD, and EAD for each loan
- Agricultural Portfolio Analysis: Create and analyze a proxy agricultural lending segment
- Stress Testing: Measure portfolio vulnerability under adverse economic scenarios
- Documentation: Produce regulatory-quality model documentation
- CECL Allowance Calculation: Primary allowance estimation
- Portfolio Risk Assessment: Segment-level risk analysis
- Stress Testing: Scenario analysis for capital planning
- Risk Appetite Monitoring: Early warning indicators
- Strategic Planning: Portfolio composition decisions
| Attribute | Description |
|---|---|
| Source | Zenodo - Lending Club Granting Model Dataset |
| Original Provider | Lending Club (P2P lending platform) |
| Time Period | 2007-2018 loan vintages |
| Sample Size | ~1.3 million loans |
| Target Variable | Binary default indicator |
| Variable | Description | Role |
|---|---|---|
loan_amnt |
Loan amount ($) | EAD proxy |
revenue |
Borrower annual income ($) | Risk driver |
dti_n |
Debt-to-income ratio | Risk driver |
fico_n |
FICO credit score | Primary risk driver |
| Variable | Description | Role |
|---|---|---|
purpose |
Loan purpose | Segmentation, risk driver |
home_ownership_n |
Home ownership status | Risk driver |
emp_length |
Employment length | Risk driver |
addr_state |
Borrower state | Geographic segmentation |
| Variable | Description | Values |
|---|---|---|
Default |
Default indicator | 0 = Non-default, 1 = Default |
- Invalid Loan Amounts: Removed loans with amount ≤ 0
- FICO Score Validation: Filtered to valid range (300-850)
- Income Validation: Removed negative/zero income, capped extreme values
- DTI Validation: Removed negative DTI, capped at 100
- Missing Values: Dropped rows with missing key features
| Metric | Value |
|---|---|
| Records after cleaning | ~1.2 million |
| Missing value rate | < 1% (after cleaning) |
| Duplicate records | 0 |
Since the dataset contains consumer P2P loans rather than true agricultural loans, we constructed a proxy agricultural segment:
Selection Criteria:
- Purpose:
small_businessloans (proxy for agricultural business lending) - Geography: Top 10 agricultural states by USDA farm output (CA, IA, NE, TX, MN, IL, KS, WI, IN, NC)
Rationale:
- Small business loans approximate productive/commercial lending
- Agricultural state filter captures geographic risk factors
- Combined criteria create a reasonable agricultural lending proxy
The ECL estimation follows the standard credit risk formula:
Where:
-
$PD_i$ = Probability of Default for loan$i$ -
$LGD_i$ = Loss Given Default for loan$i$ -
$EAD_i$ = Exposure at Default for loan$i$
Two models were developed and compared:
| Model | Type | Key Parameters |
|---|---|---|
| Baseline | Logistic Regression | class_weight='balanced' |
| Advanced | XGBoost/RandomForest | n_estimators=200, max_depth=6 |
Derived Features:
loan_to_income: Loan amount / Annual incomeincome_per_dti: Income / DTI ratiofico_bucket_num: Numeric FICO bucket encodingemp_length_num: Numeric employment length
Categorical Encoding:
- One-hot encoding for
purposeandhome_ownership - Top 10 purposes retained, others grouped as "other"
- Method: Time-based split (out-of-time validation)
- Training: Earlier loan vintages (~70%)
- Test: Later loan vintages (~30%)
- Rationale: Simulates real-world model deployment
Logistic Regression:
LogisticRegression(
class_weight='balanced',
max_iter=1000,
solver='lbfgs'
)XGBoost:
XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
scale_pos_weight=calculated_ratio
)Fixed LGD assumption based on regulatory guidance:
- Basel II Foundation IRB: 45% LGD for senior unsecured claims
- Industry Practice: Consumer unsecured loans typically assume 40-60% LGD
- Conservative Approach: Suitable for CECL provisioning
- Recovery Rate = 1 - LGD = 55%
- Represents expected recovery through collections/settlements
Conservative EAD estimation using loan amount:
- Term loans with no revolving component
- Conservative approach for loss estimation
- Appropriate for CECL lifetime loss perspective
| Scenario | Income Change | DTI Change | Description |
|---|---|---|---|
| Baseline | 0% | 0% | Current conditions |
| Moderate Stress | -10% | +15% | Economic slowdown |
| Severe Stress | -20% | +30% | Recession |
| Agricultural Crisis | -25% | +35% | Sector-specific downturn |
- Apply stress adjustments to income and DTI
- Calculate PD multiplier based on stress severity
- Recompute ECL with stressed PD
- Compare against baseline
| Metric | Logistic Regression | XGBoost/RF |
|---|---|---|
| Training AUC | 0.67-0.70 | 0.72-0.78 |
| Test AUC | 0.65-0.68 | 0.68-0.73 |
| Gini Coefficient | 0.30-0.36 | 0.36-0.46 |
- Calibration curves show reasonable alignment between predicted and actual default rates
- Some deviation at extreme probability ranges (typical for imbalanced data)
Top Risk Drivers:
- FICO Score (strongest negative relationship with default)
- DTI (positive relationship with default)
- Loan Purpose (especially small_business)
- Income Level
- Loan Amount
The advanced model (XGBoost/RandomForest) was selected based on:
- Higher test AUC
- Better discrimination across risk segments
- Appropriate calibration for ECL purposes
- Time-based split provides out-of-time validation
- Model performs consistently across loan vintages
- Actual vs predicted default rates align within acceptable tolerance
- Proxy Portfolio: Agricultural segment is approximated from consumer loan data
- Historical Data: Model based on 2007-2018 data; economic conditions may differ
- Single Platform: Data from one P2P lender may not generalize
- No Recovery Data: LGD based on assumption rather than actual recoveries
- LGD Assumption: Fixed 45% may not reflect actual loss severity
- EAD Assumption: Full loan amount may overstate exposure for partially repaid loans
- Stress Scenarios: Hypothetical scenarios based on historical recession patterns
- Independence: Assumes loan defaults are conditionally independent
| Risk Category | Description | Mitigation |
|---|---|---|
| Data Quality | Errors in source data | Data validation, cleaning |
| Model Specification | Incorrect functional form | Multiple model comparison |
| Parameter Uncertainty | Estimation error | Cross-validation, confidence intervals |
| Regime Change | Economic conditions differ from training | Stress testing, monitoring |
| Metric | Frequency | Threshold |
|---|---|---|
| AUC-ROC | Monthly | > 0.60 |
| Population Stability Index (PSI) | Monthly | < 0.25 |
| Actual vs Predicted Default Rate | Quarterly | Within 20% |
| FICO Distribution Shift | Monthly | PSI < 0.10 |
- Significant shift in feature distributions
- Degradation in discrimination metrics
- Systematic over/under prediction of defaults
- Changes in portfolio composition
| Review Type | Frequency | Scope |
|---|---|---|
| Performance Monitoring | Monthly | Metrics tracking |
| Calibration Review | Quarterly | Predicted vs actual |
| Full Model Review | Annual | Complete revalidation |
| Stress Testing | Annual | Scenario updates |
Triggers for Model Update:
- AUC drops below 0.60
- PSI exceeds 0.25
- Significant regulatory changes
- Material portfolio changes
- Economic regime shifts
Update Process:
- Document performance degradation
- Investigate root cause
- Propose model adjustments
- Validate updated model
- Obtain governance approval
- Deploy and monitor
Software Environment:
- Python 3.9+
- pandas, numpy for data manipulation
- scikit-learn for modeling
- XGBoost for advanced model
- matplotlib, seaborn for visualization
Hardware Requirements:
- Minimum 16GB RAM recommended for full dataset
- SSD storage for faster I/O
Credit_Risk_Personal_Project/
|-- data_raw/ # Source data
| |-- LC_loans_granting_model_dataset.csv
|-- data_processed/ # Cleaned and transformed data
| |-- loans_cleaned.csv
| |-- loans_with_agri_flag.csv
| |-- loans_with_pd.csv
| |-- loans_with_lgd_ead.csv
| |-- loans_with_ecl.csv
|-- notebooks/ # Jupyter notebooks
| |-- 01_Data_Acquisition_EDA.ipynb
| |-- 02_Agricultural_Portfolio_Segmentation.ipynb
| |-- 03_PD_Model_Development.ipynb
| |-- 04_LGD_EAD_Estimation.ipynb
| |-- 05_ECL_Computation.ipynb
| |-- 06_Stress_Testing.ipynb
| |-- 07_SHAP_Feature_Importance.ipynb
|-- outputs/
| |-- figures/ # Visualizations (PNG plots)
| |-- models/ # Saved models and metadata
| | |-- ecl_metadata.json
| | |-- feature_scaler.joblib
| | |-- lgd_ead_metadata.json
| | |-- model_metadata.json
| | |-- pd_model_logistic.joblib
| | |-- pd_model_selected.joblib
| | |-- pd_model_xgboost.joblib
| | |-- stress_test_results.json
| |-- stress_test_summary.csv
|-- src/ # Core pipeline modules
| |-- __init__.py
| |-- data_processing.py
| |-- ecl_calculator.py
| |-- feature_engineering.py
| |-- modeling.py
| |-- stress_testing.py
| |-- visualization.py
|-- tests/ # Unit tests
| |-- __init__.py
| |-- conftest.py
| |-- test_data_processing.py
| |-- test_ecl_calculator.py
| |-- test_feature_engineering.py
| |-- test_modeling.py
| |-- test_stress_testing.py
|-- docs/ # Documentation
| |-- Executive_Summary.md
| |-- Model_Card.md
| |-- Model_Development_Document.md
|-- .gitignore
|-- pytest.ini
|-- README.md
|-- requirements.txt
|-- run_analysis.py
|-- .pytest_cache/ # Auto-generated
|-- __pycache__/ # Auto-generated
| Term | Definition |
|---|---|
| AUC | Area Under ROC Curve - discrimination metric |
| CECL | Current Expected Credit Losses |
| DTI | Debt-to-Income ratio |
| EAD | Exposure at Default |
| ECL | Expected Credit Loss |
| FICO | Fair Isaac Corporation credit score |
| Gini | Gini coefficient = 2 x AUC - 1 |
| LGD | Loss Given Default |
| PD | Probability of Default |
| PSI | Population Stability Index |
- FASB ASC 326 - Financial Instruments - Credit Losses
- Farm Credit Administration - Stress Testing Guidance
- Basel Committee on Banking Supervision - IRB Approach
- Lending Club Data Documentation (Zenodo)
End of Model Development Document