Skip to content

Commit 82bb73a

Browse files
committed
Improve robustness of image map creation
One flaw I noticed in nearly all competitors was relying on Google's image lazy-load script not to change in any way. A more robust solution than mine would account for the _setImagesSrc function name to also possibly change & probably try only relying on the data:image structure as the initial clue. It would make scanning the first script more computationally expensive, but detected variables could then be used to speed up processing of subsequent scripts. Hopefully, Google never decides to combine all their lazy-loading scripts together. I'm not sure how that could be detected performantly, but I'm sure I could find a way, given enough time.
1 parent 81a2bb2 commit 82bb73a

2 files changed

Lines changed: 22 additions & 16 deletions

File tree

lib/extractor/thumbnail_index.rb

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,9 @@ module Extractor
1010
class ThumbnailIndex
1111
# Greedy on the data URI body, anchored on the trailing `_setImagesSrc(ii,s,r)`
1212
# to avoid mis-pairing s/ii from adjacent script blocks.
13-
BLOCK_REGEX = /
14-
var\s+s\s*=\s*'(?<data>data:image\/[^']+)'\s*;\s*
15-
var\s+ii\s*=\s*\[(?<ids>[^\]]*)\]\s*;\s*
16-
var\s+r\s*=\s*'[^']*'\s*;\s*
17-
_setImagesSrc\(ii,\s*s,\s*r\)
18-
/xm.freeze
13+
14+
15+
IMAGE_SETTER_REGEX = /_setImagesSrc\((?<id>[a-z]*),\s*(?<source>[a-z]*)/.freeze
1916

2017
ID_REGEX = /'([^']+)'/.freeze
2118

@@ -32,16 +29,19 @@ def build
3229
mapping = {}
3330
@document.css("script").each do |script|
3431
body = script.content
35-
# Cheap pre-filter avoids regex scanning unrelated scripts.
36-
next unless body.include?("_setImagesSrc")
37-
38-
body.scan(BLOCK_REGEX) do
39-
match = Regexp.last_match
40-
# Decode JS escapes so the resulting data URI matches browser output.
41-
data_uri = unescape_js(match[:data])
42-
# One data URI can map to multiple image ids.
43-
match[:ids].scan(ID_REGEX) { |(id)| mapping[id] = data_uri }
44-
end
32+
desired_variables = body.match(IMAGE_SETTER_REGEX)
33+
next if desired_variables.nil?
34+
35+
source_regex = /var\s+#{desired_variables[:source]}\s*=\s*'(?<data>data:image\/[^']+)'\s*;/.freeze
36+
ids_regex = /var\s+#{desired_variables[:id]}\s*=\s*\[(?<ids>[^\]]*)\]\s*;/.freeze
37+
38+
source_result = body.match(source_regex)
39+
ids_result = body.match(ids_regex)
40+
41+
# Decode JS escapes so the resulting data URI matches browser output.
42+
data_uri = unescape_js(source_result[:data])
43+
# One data URI can map to multiple image ids.
44+
ids_result[:ids].scan(ID_REGEX) { |(id)| mapping[id] = data_uri }
4545
end
4646
mapping
4747
end

spec/thumbnail_index_spec.rb

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ def doc_for(script_body)
1111
expect(described_class.build(doc_for(body))).to eq("img_1" => "data:image/jpeg;base64,ABC")
1212
end
1313

14+
it "can handle different variable names with different script structure" do
15+
body = "(function(){var ids=['img_1'];var source='data:image/jpeg;base64,ABC';" \
16+
"_setImagesSrc(ids,source);})();"
17+
expect(described_class.build(doc_for(body))).to eq("img_1" => "data:image/jpeg;base64,ABC")
18+
end
19+
1420
it "maps multiple ids in the same call to the same URI" do
1521
body = "(function(){var s='data:image/jpeg;base64,XYZ';" \
1622
"var ii=['a','b','c'];var r='';_setImagesSrc(ii,s,r);})();"

0 commit comments

Comments
 (0)