Reduce structural noise before graph extraction for mixed-source narrative inputs by YeJe-cpu · Pull Request #583 · 666ghj/MiroFish

YeJe-cpu · 2026-04-26T16:03:28Z

PR Title

Reduce structural noise before graph extraction for mixed-source narrative inputs

PR Body

Summary

This PR adds a small preprocessing step before Step 1 graph extraction to reduce structural noise in markdown-like mixed-source narrative inputs.

The goal is not to solve canonicalization or post-build deduplication. The goal is earlier: prevent obvious non-entity text from entering the graph in the first place.

Why this change

While testing MiroFish on real narrative packets based on Breaking Bad counterfactual simulations, I found that Step 1 could promote structural text into entities.

These packets were not plain prose. They included a mix of:

character notes
plot fragments
episode identifiers
episode titles
naming conventions
graph-building instructions

One packet explored a scenario where Gus Fring survives. Another explored a scenario where Walter White never enters the drug trade.

Under this kind of mixed-source input, Step 1 could promote structural text into entities, including:

section headings
episode / chapter identifiers
episode titles
graph instructions and naming conventions

I focused on preprocessing because the dominant failure mode I observed was structural text entering the graph before later-stage deduplication could help.

What this patch does

runs preprocess_text() before chunking
removes graph-instruction sections and instruction-like lines
strips episode/chapter identifiers such as S04E13
removes title-like non-entity anchors derived from the source text itself
keeps the filtering conservative and local to text preprocessing

Validation

I validated this on two real mixed-source narrative packets based on Breaking Bad counterfactual simulations.

Case A

This packet explored a “Gus survives” scenario and included multiple documents with episode titles, episode identifiers, naming conventions, and extraction notes.

before: 56 nodes / 156 edges
after: 29 nodes / 38 edges

Examples of nodes that no longer entered the graph:

Face Off
Box Cutter
Half Measures
Full Measure
Crawl Space
End Times
MiroFish

Key entities still preserved:

Walter White
Heisenberg
Jesse Pinkman
Gus
Gus Fring
Mike Ehrmantraut
Saul Goodman

Case B

This packet explored a “Walter White never enters the drug trade” scenario using earlier-season material with lighter structural noise.

before: 24 nodes / 27 edges
after: 23 nodes / 32 edges

The effect is smaller here, but it suggests the preprocessing is not only reacting to one single packet.

What this PR does not do

This PR does not attempt to solve:

alias / canonicalization (Gus vs Gus Fring, Walter White vs Heisenberg)
ontology generation quality
post-build entity deduplication

It only reduces one upstream source of noisy entities.

Notes

If useful, I can also share the exact validation inputs I used for reproduction.

Reduce structural noise before graph extraction

67fbbc1

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce structural noise before graph extraction for mixed-source narrative inputs#583

Reduce structural noise before graph extraction for mixed-source narrative inputs#583
YeJe-cpu wants to merge 1 commit into
666ghj:mainfrom
YeJe-cpu:fix/preprocess-structural-noise

YeJe-cpu commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YeJe-cpu commented Apr 26, 2026

PR Title

PR Body

Summary

Why this change

What this patch does

Validation

Case A

Case B

What this PR does not do

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant