Use etld plus one matching for 3p by pes10k · Pull Request #172 · brave-experiments/ad-block

pes10k · 2019-02-04T16:58:54Z

fixes #171

diracdeltas · 2019-02-06T22:49:25Z

+        'www.example.org'
+      )
+      assert.equal(queryResult.matches, true)
+    })


i suggest adding some tests that use non-trivial etld+1's like, example.co.uk and example.githubusercontent.com

There are many, but there at the C++ level. They're in this file https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc

ex: https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc#L488

i see thanks

diracdeltas · 2019-02-06T22:54:32Z

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

pes10k · 2019-02-06T22:56:31Z

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

We could, but then we'd loose the ability to run in node (which has been very valuable for crawling / measurement, getting other folks to use the code, debugging, etc)

fmarier · 2019-02-09T01:28:26Z

@@ -0,0 +1,51 @@
+/* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2


nit: Should be 2019. This applies to all of the new files added in this PR.

fixed with 81cc4f4 :)

Maybe I'm confused about how GitHub is displaying the latest version of your PR, but it seems like you fixed all files, except for etld/domain.cc?

No, i goofed, apologies for the butter fingers. Fixed now

fmarier · 2019-02-09T02:05:30Z

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

pes10k · 2019-02-09T02:10:28Z

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

You're right no need for this. I removed it from the .gitignore previously, now also removed it from the set of tracked files. Should be good now

pes10k · 2019-02-09T02:36:46Z

@bbondy my code expects the public suffix list to be in a known location, and lazily parses the list on first use (i.e. at etld/data/<list>.dat). I have no idea if this will work when rolled into the larger browser. I'm just not familiar enough with the build process. Can you double check that aspect?

fmarier · 2019-02-11T18:53:05Z

+.vscode
+
+# These files are either fetched at build time, or generated from the build
+etld/data/public_suffix_list.h


nit: I usually try to make these absolute paths with respect to the location of the .gitignore file (i.e. /etld/data/public_suffix.h) in order to avoid unexpected matches somewhere else in the codebase. Not very likely in this case, so feel free to ignore this suggestion, but a good habit IMHO.

fmarier · 2019-02-11T18:54:29Z


 build:
-	 ./node_modules/.bin/node-gyp configure && ./node_modules/.bin/node-gyp build
+	curl -s https://publicsuffix.org/list/public_suffix_list.dat -o etld/data/public_suffix_list.dat


Kudos for not using -k here :)

nit: If you want to make that curl line even stricter, you could also throw in --tlsv1.2 to enforce a minimum level of TLS.

fmarier · 2019-02-11T19:05:12Z

+  constructorArgs.push(isException ? "true" : "false");
+
+  const wrappedLabels = labels.map(JSON.stringify);
+  constructorArgs.push("{" + wrappedLabels.join(", ") + "}");


Should the labels be surrounded by " in case they're not all bare words? Also, is it possible they include " characters that should be escaped?

fmarier · 2019-02-11T19:15:20Z

+  }
+
+  if (previous != 0) {
+    labels_.push_back(string.substr(previous, current - previous));


I believe this call will fail if you get invalid input like: abcd.efgh. (note the trailing dot). Might be worth adding a test for it if there isn't one already.

pes10k · 2019-02-22T06:10:02Z

@bbondy this is now ready for review again. The ways to enable the eTLD+1 checking (by parsing a public suffix list) are:

when using the check.js script, use the new -P, --public-suffix-rules-path option, and point it to a text including public suffix rules.
use the js AdBlockClient.parsePublicSuffixRules method and give it a string containing public suffix rules
use the C++ AdBlockClient::parsePublicSuffixRules method with a char* / std::string of rules
use either the C++ or JS deserialize methods with a dat file that includes public suffix rules data (serializing after doing 1, 2 or 3 will include the public suffix rule data in the .dat).

bbondy

I added a commit to fix npm run perf, this is unfortunately currently regressing perf by 4-5x overall though. We'll need to optimize that so it's trivially the same in speed for matching.

…/ re-alloc-ing objects

bbondy · 2019-02-27T02:58:26Z

+  while (current != std::string::npos) {
+    current_label = label_text.substr(previous, current - previous);
+    if (current_label.length() == 0) {
+      throw PublicSuffixRuleInputException(


Chromium is built with exceptions disabled, and this would be the first exception.
I'd recommend instead passing in a pointer to a vector and then filling it. And make the return value the result, and propagate failures.

bbondy · 2019-02-27T02:58:45Z

+  // If don't include any trailing whitespace, if there is any.
+  current_label = label_text.substr(previous, current - previous);
+  if (current_label == "") {
+    throw PublicSuffixRuleInputException(


ditto exceptions

bbondy · 2019-02-27T02:58:49Z

+PublicSuffixRule::PublicSuffixRule(const std::string& rule_text) {
+  std::string trimmed_rule_text(trim_to_whitespace(rule_text));
+  if (trimmed_rule_text.length() == 0) {
+    throw PublicSuffixRuleInputException(


ditto exceptions

bbondy · 2019-02-27T02:58:57Z

+      break;
+
+    case '/':
+      throw PublicSuffixRuleInputException(


ditto exceptions

bbondy · 2019-02-27T03:05:57Z

+    before(function () {
+      this.client = new AdBlockClient()
+      this.client.parse('||bannersnack.com^$third-party')
+      const etldRules = fs.readFileSync('./test/data/public_suffix_list.dat', 'utf8')


Can we do some similar tests for when we don't call parsePublicSuffixRules and it falls back to the warning with FQDN?

You can probably just add a for loop which loops twice using different clients around all the it calls.

bbondy · 2019-02-27T03:11:51Z

+      this.client.parsePublicSuffixRules(etldRules)
+    })
+    it('consider eTLD+1 domains as 1p', function () {
+      const altSubDomainQuery = this.client.findMatchingFilters(


could we add tests for matches too?

bbondy

Please squash these 3 commits together by using git rebase -i

linter fixes / cleanup
further linter fixes / cleanup
even more lint cleanup

…FQDN tests

pes10k requested a review from bbondy February 4, 2019 16:58

diracdeltas reviewed Feb 6, 2019

View reviewed changes

fmarier reviewed Feb 9, 2019

View reviewed changes

pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from be4c62e to 33ff5d9 Compare February 9, 2019 06:03

fmarier reviewed Feb 11, 2019

View reviewed changes

bbondy added the work-in-progress label Feb 12, 2019

pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from 2919690 to 8d72272 Compare February 22, 2019 06:04

pes10k removed the work-in-progress label Feb 22, 2019

pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from d1ada9e to 86fb8a9 Compare February 22, 2019 06:37

bbondy suggested changes Feb 23, 2019

View reviewed changes

pes10k and others added 7 commits February 24, 2019 17:53

init

d2e2fba

Complete initial implementation of using public suffix list for eTLD+1

3481216

linter fixes

3d90511

Use psl in perf application

adb1daf

refactor to avoid unneeded allocations

863d1be

improve matching algorithm, be _way_ more careful about when copying …

454b911

…/ re-alloc-ing objects

remove final malloc / new in matching process

d4a9635

pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from 0fcdfc4 to 76bdb66 Compare February 25, 2019 01:55

pes10k requested a review from bbondy February 26, 2019 01:26

fmarier mentioned this pull request Feb 26, 2019

Tweetdeck does not load with brave shields enabled. brave/brave-browser#3495

Closed

bsclifton assigned pes10k Feb 26, 2019

bbondy suggested changes Feb 27, 2019

View reviewed changes

bbondy reviewed Feb 27, 2019

View reviewed changes

pes10k added 5 commits February 28, 2019 13:47

linter fixes / cleanup

42ae4f4

fix memory leak during serialization

cd45c83

Avoid some more possible stack allocations, const a bunch of more things

1ef0ded

remove exceptions

4ce828b

remove adding eTLD+1 serilization code into dat, add matching tests, …

ad0dab0

…FQDN tests

pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from d25f24d to ad0dab0 Compare February 28, 2019 22:03

return non PSL behavior of isThirdPartyHost to is previous behavior

395d70d

		@@ -0,0 +1,51 @@
		/* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2

Conversation

pes10k commented Feb 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diracdeltas commented Feb 6, 2019

Uh oh!

pes10k commented Feb 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pes10k Feb 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmarier commented Feb 9, 2019

Uh oh!

pes10k commented Feb 9, 2019

Uh oh!

pes10k commented Feb 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmarier Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pes10k commented Feb 22, 2019

Uh oh!

bbondy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbondy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pes10k Feb 9, 2019 •

edited

Loading

fmarier Feb 11, 2019 •

edited

Loading

bbondy left a comment •

edited

Loading