Skip to content

Latest commit

 

History

History
157 lines (132 loc) · 6.41 KB

File metadata and controls

157 lines (132 loc) · 6.41 KB

Entity Extraction Enhancement Feature

📋 Document Summary

What This Document Covers:

  • Entity extraction integration into database edit and RAG insertion workflows
  • Implementation tracking with 5 phases (all complete)
  • New stages: MetadataOnlyEditPlanningStage and CaptionAndTagsRegenerationStage
  • Dual-plan architecture for generated vs database memecoins
  • Testing checklist and known limitations

Context Tags: #entity-extraction #rag #workflow #implementation-tracking


Overview

Add entity extraction to database edit and RAG insertion workflows for improved AI understanding and metadata quality.

Status: ✅ Implementation Complete - Ready for Testing

Implementation Tracking

Phase 1: Critical Fixes ✅

  • Fix RAGMemecoinEditProcessor.process() signature - Added missing action_handler parameter
  • Initialize entity_database_service when None - Auto-initializes using get_entity_database_service()
  • Verify database edit workflow no longer crashes

Phase 2: New Stages ✅

  • Create MetadataOnlyEditPlanningStage
    • Define MetadataOnlyEditPlan model (caption + tag instructions only)
    • Implement planning logic for captions/tags
    • Add comprehensive logging
    • Skip image-related planning entirely
  • Create CaptionAndTagsRegenerationStage
    • Implement targeted caption updates (4-part structure)
    • Implement tag modifications (add/remove/replace)
    • Preserve immutable fields (name, ticker, description, image)
    • Add validation to prevent immutable field changes

Phase 3: Database Edit Workflow ✅

  • Restructure RAGMemecoinEditProcessor pipeline
    • Remove FeedbackAnalysisAndRegenerationStage from pipeline
    • Add MetadataOnlyEditPlanningStage
    • Add CaptionAndTagsRegenerationStage
    • Configure EntityExtractionStage to skip reference images
    • Update stage graph to 3-stage linear pipeline
  • Deprecate FeedbackAnalysisAndRegenerationStage (added deprecation note)

Phase 4: Insertion Workflow ✅

  • Add EntityExtractionStage to RAGMemecoinInsertionProcessor
    • Position after MemecoinValidationStage
    • Configure for full entity extraction with reference images
    • Update stage graph connections (8 stages total)
    • Ensure proper context data flow
  • Update workflow documentation

Phase 5: Bug Fixes ✅

  • Fix MetadataOnlyEditPlanningStage system prompt (LLM validation errors)
    • Add explicit JSON schema examples to system prompt
    • Provide 3 concrete examples (caption-only, tag-only, combined)
    • Clarify TagEditInstructions nested structure
    • Add CRITICAL RULES and OUTPUT FORMAT sections
  • Fix integration test workflow initialization
    • Add RAGMemecoinEditAnalysisWorkflow initialization before orchestrator creation
    • Pass workflow to MemeStorageOrchestrator constructor
  • Fix inconsistent return values in accept_edit_proposal
    • Changed 3-tuple return to 2-tuple (line 1220-1223)
    • Now matches function signature Tuple[bool, str]

Phase 6: Testing ✅

  • Test database edit with caption-only changes - PASSED ✅
    • Entity caption modification working correctly
    • Planning stage creates valid structured plan
    • LLM follows JSON schema without validation errors
  • Test database edit with tag-only changes - PASSED ✅
    • Tag additions and removals working correctly
    • Tag filtering validates against 356-tag vocabulary
    • Invalid tags correctly filtered out
  • Test database edit with combined caption + tag changes - PASSED ✅
    • Visual caption modification + tag changes working
    • Both caption instructions and tag instructions generated
    • Regeneration stage applies changes correctly
  • Verify immutable fields preserved in all cases - PASSED ✅
    • Validation stage prevents modification of name, ticker, description, image
    • Database memecoins correctly restricted to caption/tag edits only
  • Integration test results: 3/3 edit proposal tests passed
    • All 3 feedback scenarios generated valid proposals with UUIDs
    • No Pydantic validation errors encountered
    • 3-stage pipeline (entity extraction → planning → regeneration) executes correctly
  • Test insertion workflow with entity extraction
  • Performance validation
  • Edge case testing (empty feedback, conflicting instructions)

Architecture Notes

Database Edit Pipeline (3 stages)

1. EntityExtractionStage
   - Extracts entities from metadata + user feedback
   - Skips reference image loading (performance optimization)
   - Outputs: extracted_entities

2. MetadataOnlyEditPlanningStage (NEW)
   - Analyzes feedback and entities
   - Creates structured plan for caption/tag modifications ONLY
   - Outputs: metadata_edit_plan

3. CaptionAndTagsRegenerationStage (NEW)
   - Applies structured plan from planning stage
   - Updates ONLY captions and tags
   - Preserves immutable fields
   - Outputs: edited_entry

Insertion Pipeline (8 stages)

1. MemecoinValidationStage
2. EntityExtractionStage (NEW - added after validation)
3. CategoryClassificationStage
4. TagClassificationStage
5. ImageCaptioningStage
6. TagsRefinementStage
7. MemecoinEmbeddingStage
8. MemecoinInsertionActionStage

Key Constraints

  • Database Memecoins: Can ONLY modify captions and tags
  • Immutable Fields: name, ticker, description, image
  • No Image Planning: MetadataOnlyEditPlanningStage has no image logic
  • Performance: Skip reference images in database workflow

Files Changed

Modified

  • src/domain/processor/rag_memecoin_edit_processor.py - Fixed signature, added service init

To Create

  • src/domain/processor/stages/metadata_only_edit_planning_stage.py
  • src/domain/processor/stages/caption_and_tags_regeneration_stage.py

To Modify

  • src/domain/processor/rag_memecoin_edit_processor.py - Pipeline restructure pending
  • src/domain/processor/rag_memecoin_insertion_processor.py - Add entity extraction pending

Testing Checklist

  • Database edit workflow runs without TypeError
  • Entity extraction provides value to planning stage
  • Plans are clear and actionable
  • Regeneration follows plan correctly
  • Immutable fields never change
  • Performance is acceptable

Notes

  • Entity extraction purpose: Provide clear entity understanding to AI stages for better planning
  • No EditDecisionStage needed since we can only modify captions/tags
  • Reference images skipped in database workflow for performance