Properly import the wikipedia spam blocklist#52
Conversation
s|\\\[^.]|| g as in "match any that isn't a dot" works great except when it removes backslashes!!! this fixes the thing
copy paste fail
updated it due to better regex coming from advanced algorithmic optimization (not really)
oh my god
|
Seeing the comments, I'll wait until tomorrow before merging that :D |
|
"it shouldn't cause issue" and then i look at the CI and there's x.comrttereschenko in there! 1d637f4#diff-63a71dc442383a5564cfaeab39744758e49dd6a12c569a85091256b21a9017bb |
| wi9.biz | ||
| kazino-onlain.ru | ||
| %d1%84%d0%be%d1%80%d1%83%d0%bc-%d0%ba%d0%b0%d0%b7%d0%b8%d0%bd%d0%be.%d1%80%d1%84 | ||
| %D1%84%D0%BE%D1%80%D1%83%D0%BC-%D0%BA%D0%B0%D0%B7%D0%B8%D0%BD%D0%BE.%D1%80%D1%84 |
| onlinger.ru | ||
| casinoru.ru | ||
| %d0%ba%d0%b0%d0%b7%d0%b8%d0%bd%d0%be-%d1%84%d0%be%d1%80%d1%83%d0%bc.%d1%80%d1%84 | ||
| %D0%BA%D0%B0%D0%B7%D0%B8%D0%BD%D0%BE-%D1%84%D0%BE%D1%80%D1%83%D0%BC.%D1%80%D1%84 |
There was a problem hiding this comment.
that was in the file, i just re-pasted the thing without running the CI, but basically this is something that leads to a cyrillic casino website thing
There was a problem hiding this comment.
aka: this URL is https://казино-форум.рф/
|
Hope you don't mind me adding a couple more very useful comments. There are at least 4 weird lines like that, not sure they are from the sedding or another part of the copying. It's Sunday evening, I refuse to think. |
|
no worries! most of these lines are a thing called https://en.wikipedia.org/wiki/Percent-encoding ! |
|
Ah yeah, I don't encounter these ones that often. I think there are a couple in this list, but they also work in their original character encoding so I'm not quite sure if they should be normalized in one way or the other. |
|
ALSO GUESS WHAT THERE'S ANOTHER ISSUE #53 |
|
note to self: don't drink tap water at jerry garcia's |
|
@NotaInutilis i've added it to the allowlist, please check my commit for correctness when you have the time |
|
The allowlist is only used in the importation process, not in the list generation. I think I coded that quite roughly so it should clear all files in the |
last regex was awful, now it actually works, when the evaluator encounters
\/, it doesn't spectacularly cause bad domain names