Skip to content

Data Quality & Consistency Framework: Prevent Feature Drift & Regressions #59

@BPMSoftwareSolutions

Description

@BPMSoftwareSolutions

🎯 Overview

This issue addresses a critical pattern of data quality and consistency issues that have emerged as we add new features. Each new feature (resume tailoring, job linking, RAG integration, etc.) introduces subtle inconsistencies that compound over time:

  • ❌ Resume files saved to wrong directories (data/resumes/resumes/ instead of data/resumes/)
  • ❌ Index files not updated with new entries
  • ❌ Bidirectional linking broken due to path issues
  • ❌ Data quality issues (skills bunched into single strings instead of arrays)
  • ❌ Timestamp format inconsistencies (ISO 8601 with/without Z suffix)
  • ❌ Missing fields in index entries
  • ❌ Inconsistent ID generation (UUIDs vs manual IDs)
  • ❌ No validation of generated resume JSON structure

Root Cause: No comprehensive validation framework or feature tracking system to catch regressions as we evolve the codebase.


📊 Problem Analysis

Pattern of Issues

Every time we add a new feature, we introduce data inconsistencies:

  1. Issue Fix Resume-Job Linking & Data Consistency Issues #57 (Resume-Job Linking) - 8 different consistency issues identified
  2. Issue Deduplicate BPM Software Solutions Experience Entries #58 (BPM Deduplication) - 4 duplicate experience entries
  3. Today's Discovery - Path handling bug causing files to be saved to wrong location
  4. Data Quality - Skills stored as single concatenated string instead of array

Why This Happens

  1. No Feature Registry - No features.json tracking what features exist and their requirements
  2. No Data Validators - No comprehensive validation of generated resume JSON
  3. No Integration Tests - Tests pass but data consistency isn't verified
  4. No Guardrails - No checks to prevent regressions when modifying core models
  5. Inconsistent Patterns - Different parts of codebase use different conventions

Impact

  • Data Integrity: Resume data becomes corrupted or inconsistent
  • Feature Reliability: Features work in isolation but break when combined
  • Maintenance Burden: Each new feature requires manual verification
  • User Trust: Generated resumes have quality issues
  • Technical Debt: Accumulating inconsistencies make future changes risky

🔍 Root Causes Identified

1. No Feature Registry (features.json)

Problem: No single source of truth for what features exist and their data requirements

Example: When we added resume-job linking, we didn't document:

  • Resume model needs job_listing_id field
  • Job listing model needs tailored_resume_ids array
  • Both need ISO 8601 timestamps with Z suffix
  • Index entries need consistent fields

Impact: New features don't know what constraints to follow

2. No Resume JSON Validator

Problem: Generated resume JSON isn't validated against a schema

Example: Skills are stored as:

{
  "technical_proficiencies": {
    "skills": ".NET, AES-256, AI, API Gateway, ..."  // ❌ Single string!
  }
}

Should be:

{
  "technical_proficiencies": {
    "skills": [".NET", "AES-256", "AI", "API Gateway"]  // ✅ Array
  }
}

Impact: Data quality issues go undetected

3. No Integration Tests for Data Consistency

Problem: Tests verify functionality but not data consistency

Example: Test passes:

def test_tailor_from_job_description():
    response = api.post('/api/tailor-from-job-description', {...})
    assert response.status_code == 201  # ✅ Passes
    # But doesn't verify:
    # - Resume file exists in correct location
    # - Resume is in index
    # - Job listing is in index
    # - Bidirectional linking works
    # - Data quality is good

Impact: Regressions aren't caught until production

4. Inconsistent Model Instantiation

Problem: Models are instantiated inconsistently across codebase

Example: In src/api/app.py line 1222:

# ❌ WRONG - passes data_dir/resumes instead of data_dir
resume_model = Resume(DATA_DIR / "resumes")
# Creates: data/resumes/resumes/

# ✅ CORRECT - pass data_dir, let model handle subdirectory
resume_model = Resume(DATA_DIR)
# Creates: data/resumes/

Impact: Files saved to wrong locations, indexes not updated

5. No Guardrails on Core Models

Problem: Changes to Resume/JobListing models aren't validated against existing code

Example: If we change Resume model's create() signature, we need to update:

  • src/api/app.py (multiple places)
  • src/tailor.py
  • src/duplicate_resume.py
  • All CRUD scripts
  • All tests

But there's no automated check to catch these

Impact: Silent failures when models change


🛠️ Proposed Solution

Phase 1: Feature Registry & Documentation

Create features.json - Single source of truth for all features

{
  "features": {
    "multi_resume_support": {
      "status": "stable",
      "version": "1.0",
      "data_models": ["Resume", "JobListing"],
      "requirements": {
        "Resume": {
          "required_fields": ["id", "name", "created_at", "updated_at", "job_listing_id", "is_master", "description"],
          "index_fields": ["id", "name", "created_at", "updated_at", "job_listing_id", "is_master", "description"],
          "timestamp_format": "ISO 8601 with Z suffix",
          "id_format": "UUID"
        },
        "JobListing": {
          "required_fields": ["id", "title", "company", "description", "url", "location", "keywords", "tailored_resume_ids", "created_at", "updated_at"],
          "index_fields": ["id", "title", "company", "location", "url", "description", "created_at", "updated_at"],
          "timestamp_format": "ISO 8601 with Z suffix",
          "id_format": "UUID"
        }
      },
      "tests": ["test_multi_resume.py", "test_multi_resume_api.py"],
      "related_issues": ["#6", "#57"]
    },
    "resume_job_linking": {
      "status": "stable",
      "version": "1.0",
      "data_models": ["Resume", "JobListing"],
      "requirements": {
        "Resume": {
          "bidirectional_linking": "Resume.job_listing_id → JobListing.id"
        },
        "JobListing": {
          "bidirectional_linking": "JobListing.tailored_resume_ids[] ← Resume.id"
        }
      },
      "tests": ["test_multi_resume_api.py"],
      "related_issues": ["#57"]
    },
    "resume_tailoring_from_job_description": {
      "status": "stable",
      "version": "1.0",
      "data_models": ["Resume", "JobListing"],
      "requirements": {
        "Resume": {
          "technical_proficiencies": "Must be object with string values (comma-separated skills)"
        },
        "JobListing": {
          "keywords": "Must be array of strings"
        }
      },
      "tests": ["test_multi_resume_api.py"],
      "related_issues": ["#57"]
    }
  },
  "data_models": {
    "Resume": {
      "file_location": "data/resumes/{id}.json",
      "index_location": "data/resumes/index.json",
      "fields": {
        "id": {"type": "string (UUID)", "required": true},
        "name": {"type": "string", "required": true, "unique": true},
        "created_at": {"type": "string (ISO 8601 with Z)", "required": true},
        "updated_at": {"type": "string (ISO 8601 with Z)", "required": true},
        "job_listing_id": {"type": "string (UUID) or null", "required": false},
        "is_master": {"type": "boolean", "required": true},
        "description": {"type": "string", "required": false}
      }
    },
    "JobListing": {
      "file_location": "data/job_listings/{id}.json",
      "index_location": "data/job_listings/index.json",
      "fields": {
        "id": {"type": "string (UUID)", "required": true},
        "title": {"type": "string", "required": true},
        "company": {"type": "string", "required": true},
        "description": {"type": "string", "required": true},
        "url": {"type": "string (URL)", "required": false},
        "location": {"type": "string", "required": false},
        "keywords": {"type": "array of strings", "required": false},
        "tailored_resume_ids": {"type": "array of UUIDs", "required": true},
        "created_at": {"type": "string (ISO 8601 with Z)", "required": true},
        "updated_at": {"type": "string (ISO 8601 with Z)", "required": true}
      }
    }
  },
  "validation_rules": {
    "timestamp_format": "All timestamps must be ISO 8601 with Z suffix (e.g., 2025-10-26T12:36:01.645244Z)",
    "id_format": "All IDs must be UUIDs (e.g., 136c188e-659d-49cf-ba0f-983c279e80e7)",
    "index_consistency": "Every file in data/resumes/ must have entry in index.json",
    "bidirectional_linking": "If Resume.job_listing_id is set, JobListing.tailored_resume_ids must contain Resume.id",
    "unique_names": "Resume names must be unique across all resumes"
  }
}

Phase 2: Resume JSON Validator

Create src/validators/resume_validator.py

class ResumeValidator:
    """Validates resume JSON structure and data quality."""
    
    def validate(self, resume_data: Dict) -> Tuple[bool, List[str]]:
        """Validate resume data against schema and rules."""
        errors = []
        
        # Check required fields
        # Check field types
        # Check technical_proficiencies structure
        # Check experience bullets format
        # Check timestamp formats
        # Check data quality
        
        return len(errors) == 0, errors

Phase 3: Data Consistency Tests

Create tests/test_data_consistency.py

class TestDataConsistency:
    """Tests for data consistency across the system."""
    
    def test_resume_file_location(self):
        """Verify resume files are saved to correct location."""
        # Create resume
        # Verify file exists at data/resumes/{id}.json
        # Verify file does NOT exist at data/resumes/resumes/{id}.json
    
    def test_resume_index_updated(self):
        """Verify resume index is updated when resume is created."""
        # Create resume
        # Verify entry exists in data/resumes/index.json
    
    def test_bidirectional_linking(self):
        """Verify resume-job linking is bidirectional."""
        # Create resume with job_listing_id
        # Verify Resume.job_listing_id is set
        # Verify JobListing.tailored_resume_ids contains Resume.id
    
    def test_timestamp_consistency(self):
        """Verify all timestamps use ISO 8601 with Z suffix."""
        # Check all resume timestamps
        # Check all job listing timestamps
    
    def test_technical_proficiencies_format(self):
        """Verify technical_proficiencies are properly formatted."""
        # Check that skills are arrays or comma-separated strings
        # NOT single concatenated strings

Phase 4: Model Instantiation Guardrails

Create src/validators/model_instantiation_validator.py

class ModelInstantiationValidator:
    """Validates correct model instantiation patterns."""
    
    @staticmethod
    def validate_resume_instantiation(data_dir: Path) -> bool:
        """Verify Resume model is instantiated correctly."""
        # Check that data_dir is passed, not data_dir/resumes
        # Verify resumes_dir is created correctly
        # Verify index file is in correct location
    
    @staticmethod
    def validate_job_listing_instantiation(data_dir: Path) -> bool:
        """Verify JobListing model is instantiated correctly."""
        # Similar checks for JobListing

Phase 5: Pre-Commit Hooks

Create .git/hooks/pre-commit

#!/bin/bash
# Run data consistency tests before commit
python -m pytest tests/test_data_consistency.py -v
if [ $? -ne 0 ]; then
    echo "❌ Data consistency tests failed. Commit aborted."
    exit 1
fi

# Validate features.json
python -c "import json; json.load(open('features.json'))"
if [ $? -ne 0 ]; then
    echo "❌ features.json is invalid JSON. Commit aborted."
    exit 1
fi

✅ Deliverables

Phase 1: Feature Registry

  • Create features.json with all features documented
  • Document data model requirements
  • Document validation rules
  • Create docs/FEATURES.md explaining the registry

Phase 2: Resume Validator

  • Create src/validators/resume_validator.py
  • Implement schema validation
  • Implement data quality checks
  • Add unit tests (10+ tests)
  • Integrate into API endpoints

Phase 3: Data Consistency Tests

  • Create tests/test_data_consistency.py
  • Test file locations
  • Test index updates
  • Test bidirectional linking
  • Test timestamp consistency
  • Test data quality
  • Add 20+ tests

Phase 4: Model Guardrails

  • Create src/validators/model_instantiation_validator.py
  • Add validation to Resume model
  • Add validation to JobListing model
  • Add unit tests (5+ tests)

Phase 5: Pre-Commit Hooks

  • Create .git/hooks/pre-commit
  • Add data consistency test runner
  • Add features.json validator
  • Document setup instructions

🧪 Testing Strategy

Unit Tests

  • Validator tests (15+ tests)
  • Model instantiation tests (5+ tests)

Integration Tests

  • Data consistency tests (20+ tests)
  • End-to-end feature tests (10+ tests)

Regression Tests

  • Verify all 421 existing tests still pass
  • Add new tests for each discovered issue

📋 Acceptance Criteria

  • features.json created and documents all features
  • Resume validator implemented and integrated
  • Data consistency tests added (20+ tests)
  • Model instantiation guardrails in place
  • Pre-commit hooks configured
  • All 421+ tests passing
  • No data consistency issues in new features
  • Documentation updated
  • Team trained on new validation framework

🔗 Related Issues


📝 Notes

  • This framework should be implemented BEFORE adding new features
  • Each new feature should update features.json with requirements
  • Data consistency tests should be added for each new feature
  • Pre-commit hooks ensure regressions are caught early
  • This prevents "feature drift" and maintains data integrity

🎯 Success Metrics

  • ✅ Zero data consistency issues in new features
  • ✅ All tests passing (421+)
  • ✅ Features documented in features.json
  • ✅ Data validators catch issues before production
  • ✅ Team confidence in data integrity
  • ✅ Reduced maintenance burden
  • ✅ Faster feature development (less debugging)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions