fix: resolve critical graph extraction and audio transcription bugs#2261
Open
chrisgscott wants to merge 3 commits into
Open
fix: resolve critical graph extraction and audio transcription bugs#2261chrisgscott wants to merge 3 commits into
chrisgscott wants to merge 3 commits into
Conversation
- Fix graph extraction message formatting bug that prevented entity extraction * Handle both dict and message object formats in debug logging * Resolves 'dict object has no attribute role' error * Enables successful extraction with GPT-5, GPT-4o and other modern models - Fix audio transcription parameter filtering * Filter out text chunking parameters from audio API calls * Resolves 'Invalid chunking_strategy' error with gpt-4o-mini-transcribe * Enables successful audio transcription with new OpenAI models - Add comprehensive debug logging for troubleshooting extraction issues These are critical bug fixes that enable R2R to work properly with modern OpenAI models including GPT-5 and GPT-4o variants.
Contributor
There was a problem hiding this comment.
Caution
Changes requested ❌
Reviewed everything up to 4c5d56d in 1 minute and 43 seconds. Click for details.
- Reviewed
164lines of code in2files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. py/core/parsers/media/audio_parser.py:75
- Draft comment:
The file opened with open(temp_file_path, 'rb') is not explicitly closed. Consider using a 'with' statement to ensure the file handle is properly closed. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_vVg3QbWGZXgezhTm
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| for attempt in range(retries): | ||
| try: | ||
| # DEBUG LOGGING: Log the exact prompt being sent | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") |
Contributor
There was a problem hiding this comment.
Consider using logger.debug instead of logger.info for verbose debug logging (e.g. logging the prompt details) to avoid cluttering production logs.
Suggested change
| logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") | |
| logger.debug(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") |
|
|
||
| cleaned_xml = sanitize_xml(response_str) | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML length: {len(cleaned_xml)}") | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") |
Contributor
There was a problem hiding this comment.
Verbose logging of XML content (e.g. cleaned and wrapped XML) may expose sensitive data; consider using debug level or gating these logs behind a debug flag.
Suggested change
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") | |
| logger.debug(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") |
- Untrack docker/env/r2r-full.env to prevent API key exposure - Enhanced .gitignore to block sensitive files - File remains local for development but won't be committed
✨ Features Added: • Hierarchical chunking with parent-child linking for better context • Enhanced spreadsheet processing with narrative + structured data storage • Tool-augmented orchestration with automatic Text-to-SQL queries • Citation system with deep links and confidence scoring • Web search integration with smart fallback and user controls • Supabase integration with enhanced schema and RLS policies • MCP server for standardized API access across applications 🤖 AI Model Support: • GPT-5, O3-mini, Claude-3.7-Sonnet integration • Advanced query strategies: RAG Fusion, HyDE • Multi-modal processing: text, images, audio, spreadsheets • High-quality embeddings (3072 dimensions) 🛠️ Developer Experience: • One-command setup with ./setup-new-project.sh • Comprehensive documentation and examples • Security hardened with proper .gitignore and templates • Production-ready configuration • Complete test suite 🏗️ Architecture: • Bug fixes for graph extraction and audio transcription • Enhanced metadata providers for better citations • Supabase-optimized database schema • MCP integration for frontend applications • Ellen V2 project documentation and planning This template now provides enterprise-grade RAG capabilities that surpass the original Ellen V2 specification, with complete source transparency, intelligent fallbacks, and standardized API access.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🐛 Bug Fixes
This PR resolves two critical bugs that prevent R2R from working with modern OpenAI models:
1. Graph Extraction Message Formatting Bug
2. Audio Transcription Parameter Filtering Bug
3. Enhanced Debug Logging
🧪 Testing
📊 Impact
These fixes are critical for R2R compatibility with:
Without these fixes, graph extraction and audio transcription fail completely with newer models.
🔍 Files Changed
✅ Checklist
Important
Fixes critical bugs in graph extraction and audio transcription, adding enhanced logging for compatibility with modern OpenAI models.
_extract_graph_search_results_from_chunk_group()ingraph_service.pyto handle both dict and message object formats.ingest()inaudio_parser.pyto exclude text-specific parameters.graph_service.pyandaudio_parser.pyfor troubleshooting extraction and parsing issues.This description was created by
for 4c5d56d. You can customize this summary. It will automatically update as commits are pushed.