Use etld plus one matching for 3p#172
Conversation
| 'www.example.org' | ||
| ) | ||
| assert.equal(queryResult.matches, true) | ||
| }) |
There was a problem hiding this comment.
i suggest adding some tests that use non-trivial etld+1's like, example.co.uk and example.githubusercontent.com
There was a problem hiding this comment.
There are many, but there at the C++ level. They're in this file https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc
There was a problem hiding this comment.
|
general question about this approach: is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs |
We could, but then we'd loose the ability to run in node (which has been very valuable for crawling / measurement, getting other folks to use the code, debugging, etc) |
| @@ -0,0 +1,51 @@ | |||
| /* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2 | |||
There was a problem hiding this comment.
nit: Should be 2019. This applies to all of the new files added in this PR.
There was a problem hiding this comment.
fixed with 81cc4f4 :)
There was a problem hiding this comment.
Maybe I'm confused about how GitHub is displaying the latest version of your PR, but it seems like you fixed all files, except for etld/domain.cc?
There was a problem hiding this comment.
No, i goofed, apologies for the butter fingers. Fixed now
|
I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo? |
You're right no need for this. I removed it from the .gitignore previously, now also removed it from the set of tracked files. Should be good now |
|
@bbondy my code expects the public suffix list to be in a known location, and lazily parses the list on first use (i.e. at |
be4c62e to
33ff5d9
Compare
| .vscode | ||
|
|
||
| # These files are either fetched at build time, or generated from the build | ||
| etld/data/public_suffix_list.h |
There was a problem hiding this comment.
nit: I usually try to make these absolute paths with respect to the location of the .gitignore file (i.e. /etld/data/public_suffix.h) in order to avoid unexpected matches somewhere else in the codebase. Not very likely in this case, so feel free to ignore this suggestion, but a good habit IMHO.
|
|
||
| build: | ||
| ./node_modules/.bin/node-gyp configure && ./node_modules/.bin/node-gyp build | ||
| curl -s https://publicsuffix.org/list/public_suffix_list.dat -o etld/data/public_suffix_list.dat |
There was a problem hiding this comment.
Kudos for not using -k here :)
nit: If you want to make that curl line even stricter, you could also throw in --tlsv1.2 to enforce a minimum level of TLS.
| constructorArgs.push(isException ? "true" : "false"); | ||
|
|
||
| const wrappedLabels = labels.map(JSON.stringify); | ||
| constructorArgs.push("{" + wrappedLabels.join(", ") + "}"); |
There was a problem hiding this comment.
Should the labels be surrounded by " in case they're not all bare words? Also, is it possible they include " characters that should be escaped?
| } | ||
|
|
||
| if (previous != 0) { | ||
| labels_.push_back(string.substr(previous, current - previous)); |
There was a problem hiding this comment.
I believe this call will fail if you get invalid input like: abcd.efgh. (note the trailing dot). Might be worth adding a test for it if there isn't one already.
2919690 to
8d72272
Compare
|
@bbondy this is now ready for review again. The ways to enable the eTLD+1 checking (by parsing a public suffix list) are:
|
d1ada9e to
86fb8a9
Compare
bbondy
left a comment
There was a problem hiding this comment.
I added a commit to fix npm run perf, this is unfortunately currently regressing perf by 4-5x overall though. We'll need to optimize that so it's trivially the same in speed for matching.
…/ re-alloc-ing objects
0fcdfc4 to
76bdb66
Compare
| while (current != std::string::npos) { | ||
| current_label = label_text.substr(previous, current - previous); | ||
| if (current_label.length() == 0) { | ||
| throw PublicSuffixRuleInputException( |
There was a problem hiding this comment.
Chromium is built with exceptions disabled, and this would be the first exception.
I'd recommend instead passing in a pointer to a vector and then filling it. And make the return value the result, and propagate failures.
| // If don't include any trailing whitespace, if there is any. | ||
| current_label = label_text.substr(previous, current - previous); | ||
| if (current_label == "") { | ||
| throw PublicSuffixRuleInputException( |
| PublicSuffixRule::PublicSuffixRule(const std::string& rule_text) { | ||
| std::string trimmed_rule_text(trim_to_whitespace(rule_text)); | ||
| if (trimmed_rule_text.length() == 0) { | ||
| throw PublicSuffixRuleInputException( |
| break; | ||
|
|
||
| case '/': | ||
| throw PublicSuffixRuleInputException( |
| before(function () { | ||
| this.client = new AdBlockClient() | ||
| this.client.parse('||bannersnack.com^$third-party') | ||
| const etldRules = fs.readFileSync('./test/data/public_suffix_list.dat', 'utf8') |
There was a problem hiding this comment.
Can we do some similar tests for when we don't call parsePublicSuffixRules and it falls back to the warning with FQDN?
There was a problem hiding this comment.
You can probably just add a for loop which loops twice using different clients around all the it calls.
| this.client.parsePublicSuffixRules(etldRules) | ||
| }) | ||
| it('consider eTLD+1 domains as 1p', function () { | ||
| const altSubDomainQuery = this.client.findMatchingFilters( |
There was a problem hiding this comment.
could we add tests for matches too?
d25f24d to
ad0dab0
Compare
fixes #171