You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lib/extractor.rb
+1Lines changed: 1 addition & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@
4
4
require_relative"extractor/carousel"
5
5
require_relative"extractor/item"
6
6
7
+
# scrapeMemo psuedocode: enabling the Item class to detect whether a relevant scrapeMemo record has been found by the Carousel class would probably require changing the Extractor from a module into a class
7
8
moduleExtractor
8
9
# Public facade. Returns an Array<Hash> of carousel items.
Copy file name to clipboardExpand all lines: lib/extractor/carousel.rb
+27-49Lines changed: 27 additions & 49 deletions
Original file line number
Diff line number
Diff line change
@@ -21,11 +21,19 @@ def self.tiles(document)
21
21
# it without threading it through every method call.
22
22
definitialize(document)
23
23
@document=document
24
-
@root_selector=nil
25
24
end
26
25
27
26
deftiles
28
-
groups=candidate_groups
27
+
# scrapeMemo psuedocode: create empty scrapeMemo hash, which will serve as an index for future parsing of the same search result structure (data-attrid, tile grid container class, tile root class, tile count, name_attribute, image_script_variable_names)
# scrapeMemo psuedocode: if div['data-attrid'] can't be found, add that to scrapeMemo hash
32
+
# scrapeMemo psuedocode: check database for any records containing the same ['data-attrid'] value
33
+
# scrapeMemo psuedocode: if one or more record(s) exist, scan for the tile grid container class, prioritizing the record most recently created
34
+
# scrapeMemo psuedocode: if the tile grid container exists & has the expected number of children with the expected tile root class, set them as the tile roots & skip the rest of this function
35
+
36
+
groups=candidate_groups(target_section)
29
37
return[]ifgroups.empty?
30
38
31
39
# Multiple stick-link groups can exist on one page. We prefer the group
# scrapeMemo psuedocode: just like with the div['data-attrid'] value before, we can now check the tile grid container class, tile root class, tile count, as well as the div['data-attrid'] value against recorded indexes to check for search result structure drift
39
48
end
40
49
41
50
private
42
51
43
-
# Build candidate groups by:
44
-
# 1. Finding every `/search?…&stick=…` anchor.
45
-
# 2. Walking each anchor up to its *tile root* — the highest ancestor
46
-
# that still contains exactly one stick anchor.
47
-
# 3. Grouping tile roots by their common parent. A group with
48
-
# MIN_TILES+ siblings is a carousel candidate.
49
-
defcandidate_groups
50
-
# Structural fingerprint that avoids volatile CSS class names.
# scrapeMemo psuedocode: if the Carousel class detected a relevant scrapeMemo index, then the recorded function name, variable names, and variable order (maybe?) could be used to construct a single regex instead of the three different regexes I'm using
# instead of skipping if the image_setter_regex fails to match because the image_setter function has been renamed, we could modify the source & ids regexes to look for properly formatted variable assignments & then look for a function taking those variables as parameters near the very end of the script
0 commit comments