Skip to content

Add options to minimize parsed html text#354

Draft
aleks-v-k wants to merge 2 commits into
deanmalmgren:masterfrom
aleks-v-k:add-options-to-minimize-parsed-html
Draft

Add options to minimize parsed html text#354
aleks-v-k wants to merge 2 commits into
deanmalmgren:masterfrom
aleks-v-k:add-options-to-minimize-parsed-html

Conversation

@aleks-v-k
Copy link
Copy Markdown

The PR is to add the possibility to minimize text extracted from HTML:

  • merge several space symbols to one
  • remove table formatting.

The main reason for these changes: the parser is OOM killed on some large html files (there are a lot of spaces + a table).

Would you like to accept such changes? If yes, then I will add tests to cover the new code.

@traverseda
Copy link
Copy Markdown
Contributor

Hello, I've recently been made a maintainer of this project. I'd be interested in these changes. I'd also be interested in a selectolax-based text extractor if you were feeling adventurous.

@KyleKing KyleKing self-assigned this Feb 3, 2026
@KyleKing KyleKing self-requested a review February 3, 2026 02:39
@KyleKing KyleKing removed their assignment Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants