Skip to content

Properly import the wikipedia spam blocklist#52

Open
GexEnterTheGecko wants to merge 6 commits into
NotaInutilis:mainfrom
GexEnterTheGecko:main
Open

Properly import the wikipedia spam blocklist#52
GexEnterTheGecko wants to merge 6 commits into
NotaInutilis:mainfrom
GexEnterTheGecko:main

Conversation

@GexEnterTheGecko
Copy link
Copy Markdown
Contributor

last regex was awful, now it actually works, when the evaluator encounters \/, it doesn't spectacularly cause bad domain names

s|\\\[^.]||
g as in "match any that isn't a dot" works great except when it removes backslashes!!! this fixes the thing
copy paste fail
updated it due to better regex coming from advanced algorithmic optimization (not really)
@NotaInutilis
Copy link
Copy Markdown
Owner

Seeing the comments, I'll wait until tomorrow before merging that :D

@GexEnterTheGecko
Copy link
Copy Markdown
Contributor Author

"it shouldn't cause issue" and then i look at the CI and there's x.comrttereschenko in there! 1d637f4#diff-63a71dc442383a5564cfaeab39744758e49dd6a12c569a85091256b21a9017bb

wi9.biz
kazino-onlain.ru
%d1%84%d0%be%d1%80%d1%83%d0%bc-%d0%ba%d0%b0%d0%b7%d0%b8%d0%bd%d0%be.%d1%80%d1%84
%D1%84%D0%BE%D1%80%D1%83%D0%BC-%D0%BA%D0%B0%D0%B7%D0%B8%D0%BD%D0%BE.%D1%80%D1%84
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onlinger.ru
casinoru.ru
%d0%ba%d0%b0%d0%b7%d0%b8%d0%bd%d0%be-%d1%84%d0%be%d1%80%d1%83%d0%bc.%d1%80%d1%84
%D0%BA%D0%B0%D0%B7%D0%B8%D0%BD%D0%BE-%D1%84%D0%BE%D1%80%D1%83%D0%BC.%D1%80%D1%84
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat²

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was in the file, i just re-pasted the thing without running the CI, but basically this is something that leads to a cyrillic casino website thing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NotaInutilis
Copy link
Copy Markdown
Owner

Hope you don't mind me adding a couple more very useful comments. There are at least 4 weird lines like that, not sure they are from the sedding or another part of the copying. It's Sunday evening, I refuse to think.

@GexEnterTheGecko
Copy link
Copy Markdown
Contributor Author

no worries! most of these lines are a thing called https://en.wikipedia.org/wiki/Percent-encoding !

@NotaInutilis
Copy link
Copy Markdown
Owner

Ah yeah, I don't encounter these ones that often. I think there are a couple in this list, but they also work in their original character encoding so I'm not quite sure if they should be normalized in one way or the other.

@NotaInutilis
Copy link
Copy Markdown
Owner

ALSO GUESS WHAT THERE'S ANOTHER ISSUE #53

@GexEnterTheGecko
Copy link
Copy Markdown
Contributor Author

note to self: don't drink tap water at jerry garcia's

@GexEnterTheGecko
Copy link
Copy Markdown
Contributor Author

@NotaInutilis i've added it to the allowlist, please check my commit for correctness when you have the time

@NotaInutilis
Copy link
Copy Markdown
Owner

The allowlist is only used in the importation process, not in the list generation. I think I coded that quite roughly so it should clear all files in the _imported folder, I'll have to check that again.

@NotaInutilis NotaInutilis changed the title unfuck regex Properly import the wikipedia spam blacklist Nov 9, 2025
@GexEnterTheGecko GexEnterTheGecko changed the title Properly import the wikipedia spam blacklist Properly import the wikipedia spam blocklist Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants