Skip to content

Reduce structural noise before graph extraction for mixed-source narrative inputs#583

Open
YeJe-cpu wants to merge 1 commit into
666ghj:mainfrom
YeJe-cpu:fix/preprocess-structural-noise
Open

Reduce structural noise before graph extraction for mixed-source narrative inputs#583
YeJe-cpu wants to merge 1 commit into
666ghj:mainfrom
YeJe-cpu:fix/preprocess-structural-noise

Conversation

@YeJe-cpu
Copy link
Copy Markdown

PR Title

Reduce structural noise before graph extraction for mixed-source narrative inputs


PR Body

Summary

This PR adds a small preprocessing step before Step 1 graph extraction to reduce structural noise in markdown-like mixed-source narrative inputs.

The goal is not to solve canonicalization or post-build deduplication. The goal is earlier: prevent obvious non-entity text from entering the graph in the first place.

Why this change

While testing MiroFish on real narrative packets based on Breaking Bad counterfactual simulations, I found that Step 1 could promote structural text into entities.

These packets were not plain prose. They included a mix of:

  • character notes
  • plot fragments
  • episode identifiers
  • episode titles
  • naming conventions
  • graph-building instructions

One packet explored a scenario where Gus Fring survives. Another explored a scenario where Walter White never enters the drug trade.

Under this kind of mixed-source input, Step 1 could promote structural text into entities, including:

  • section headings
  • episode / chapter identifiers
  • episode titles
  • graph instructions and naming conventions

I focused on preprocessing because the dominant failure mode I observed was structural text entering the graph before later-stage deduplication could help.

What this patch does

  • runs preprocess_text() before chunking
  • removes graph-instruction sections and instruction-like lines
  • strips episode/chapter identifiers such as S04E13
  • removes title-like non-entity anchors derived from the source text itself
  • keeps the filtering conservative and local to text preprocessing

Validation

I validated this on two real mixed-source narrative packets based on Breaking Bad counterfactual simulations.

Case A

This packet explored a “Gus survives” scenario and included multiple documents with episode titles, episode identifiers, naming conventions, and extraction notes.

  • before: 56 nodes / 156 edges
  • after: 29 nodes / 38 edges

Examples of nodes that no longer entered the graph:

  • Face Off
  • Box Cutter
  • Half Measures
  • Full Measure
  • Crawl Space
  • End Times
  • MiroFish

Key entities still preserved:

  • Walter White
  • Heisenberg
  • Jesse Pinkman
  • Gus
  • Gus Fring
  • Mike Ehrmantraut
  • Saul Goodman

Case B

This packet explored a “Walter White never enters the drug trade” scenario using earlier-season material with lighter structural noise.

  • before: 24 nodes / 27 edges
  • after: 23 nodes / 32 edges

The effect is smaller here, but it suggests the preprocessing is not only reacting to one single packet.

What this PR does not do

This PR does not attempt to solve:

  • alias / canonicalization (Gus vs Gus Fring, Walter White vs Heisenberg)
  • ontology generation quality
  • post-build entity deduplication

It only reduces one upstream source of noisy entities.

Notes

If useful, I can also share the exact validation inputs I used for reproduction.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant