Skip to content

Commit 3cdae85

Browse files
author
Kennedy Omondi
committed
feat(extractor): implement Ruby parser
Parse the bundled SERP HTML into a SerpApi-shaped array of `{name, extensions, link, image}` without making HTTP requests. - Detect carousel tiles by structural signal (`/search?...&stick=...` siblinggroups), not volatile Google CSS classes, so the parser works across Van Gogh and variant fixtures. - Resolve thumbnails by parsing `_setImagesSrc(ii, s, r)` blocks into an `id -> image` map,including unescaping `\x3d` and `\/` values emitted in inline JS. - Extract `extensions` from leaf text nodes under each anchor to avoid container-text noise (for example, concatenated `name+year`). - Resolve `image` from values already present in the page file: inline JS mapping, inline non-placeholder data URIs, and in-file `data-src`/ `src` URLs. - Add comprehensive RSpec coverage for golden output, cross-layout fixtures, item parsing, thumbnail indexing, and carousel selection behavior.
1 parent 49645f7 commit 3cdae85

20 files changed

Lines changed: 1093 additions & 17 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,4 @@ build-iPhoneSimulator/
4949
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
5050
.rvmrc
5151
.DS_Store
52+
.vscode/

.rspec

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
--require spec_helper
2+
--format documentation
3+
--color

Gemfile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
source "https://rubygems.org"
2+
3+
gem "nokogiri", "~> 1.16"
4+
5+
group :development, :test do
6+
gem "rspec", "~> 3.13"
7+
gem "debug"
8+
end

Gemfile.lock

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
GEM
2+
remote: https://rubygems.org/
3+
specs:
4+
date (3.5.1)
5+
debug (1.11.1)
6+
irb (~> 1.10)
7+
reline (>= 0.3.8)
8+
diff-lcs (1.6.2)
9+
erb (6.0.4)
10+
io-console (0.8.2)
11+
irb (1.18.0)
12+
pp (>= 0.6.0)
13+
prism (>= 1.3.0)
14+
rdoc (>= 4.0.0)
15+
reline (>= 0.4.2)
16+
nokogiri (1.19.3-aarch64-linux-gnu)
17+
racc (~> 1.4)
18+
nokogiri (1.19.3-aarch64-linux-musl)
19+
racc (~> 1.4)
20+
nokogiri (1.19.3-arm-linux-gnu)
21+
racc (~> 1.4)
22+
nokogiri (1.19.3-arm-linux-musl)
23+
racc (~> 1.4)
24+
nokogiri (1.19.3-arm64-darwin)
25+
racc (~> 1.4)
26+
nokogiri (1.19.3-x86_64-darwin)
27+
racc (~> 1.4)
28+
nokogiri (1.19.3-x86_64-linux-gnu)
29+
racc (~> 1.4)
30+
nokogiri (1.19.3-x86_64-linux-musl)
31+
racc (~> 1.4)
32+
pp (0.6.3)
33+
prettyprint
34+
prettyprint (0.2.0)
35+
prism (1.9.0)
36+
psych (5.3.1)
37+
date
38+
stringio
39+
racc (1.8.1)
40+
rdoc (7.2.0)
41+
erb
42+
psych (>= 4.0.0)
43+
tsort
44+
reline (0.6.3)
45+
io-console (~> 0.5)
46+
rspec (3.13.2)
47+
rspec-core (~> 3.13.0)
48+
rspec-expectations (~> 3.13.0)
49+
rspec-mocks (~> 3.13.0)
50+
rspec-core (3.13.6)
51+
rspec-support (~> 3.13.0)
52+
rspec-expectations (3.13.5)
53+
diff-lcs (>= 1.2.0, < 2.0)
54+
rspec-support (~> 3.13.0)
55+
rspec-mocks (3.13.8)
56+
diff-lcs (>= 1.2.0, < 2.0)
57+
rspec-support (~> 3.13.0)
58+
rspec-support (3.13.7)
59+
stringio (3.2.0)
60+
tsort (0.2.0)
61+
62+
PLATFORMS
63+
aarch64-linux-gnu
64+
aarch64-linux-musl
65+
arm-linux-gnu
66+
arm-linux-musl
67+
arm64-darwin
68+
x86_64-darwin
69+
x86_64-linux-gnu
70+
x86_64-linux-musl
71+
72+
DEPENDENCIES
73+
debug
74+
nokogiri (~> 1.16)
75+
rspec (~> 3.13)
76+
77+
CHECKSUMS
78+
bundler (4.0.12) sha256=7f8b757d28dfb636e7b24fba2344ac6dd13b5b24f4b46d62573d483f211825ac
79+
date (3.5.1) sha256=750d06384d7b9c15d562c76291407d89e368dda4d4fff957eb94962d325a0dc0
80+
debug (1.11.1) sha256=2e0b0ac6119f2207a6f8ac7d4a73ca8eb4e440f64da0a3136c30343146e952b6
81+
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
82+
erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
83+
io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
84+
irb (1.18.0) sha256=de9454a0703a54704b9811a5ef31a60c86949fbf4013fcf244fabc7c775248e3
85+
nokogiri (1.19.3-aarch64-linux-gnu) sha256=46b89e5d7b9e844c2ee360794240c6ea2a4e6fa0c5892a4ed487db621224b639
86+
nokogiri (1.19.3-aarch64-linux-musl) sha256=8392dfdcd21be7a94dbbe9ccc138dea01b97b24cb2dc02a114ca98bfb1d9a0b7
87+
nokogiri (1.19.3-arm-linux-gnu) sha256=3919d5ffc334ad778a4a9eb88fda7dcb8b1fb58c8a52ac640c6dcd2f038e774f
88+
nokogiri (1.19.3-arm-linux-musl) sha256=9ce1cb6346bb9c67b1550eb537aa183ead91e4b6eadb2f36ade02d8dd2a79fb6
89+
nokogiri (1.19.3-arm64-darwin) sha256=71b9bd424b1b7abc18b05052a1a3cfd3627abdca62be280854cc411791357e42
90+
nokogiri (1.19.3-x86_64-darwin) sha256=77f3fba57d46c53ab31e62fc6c28f705109d1bf6264356c76f132b2be5728d4d
91+
nokogiri (1.19.3-x86_64-linux-gnu) sha256=2f5078620fe12e83669b5b17311b32532a8153d02eee7ad06948b926d6080976
92+
nokogiri (1.19.3-x86_64-linux-musl) sha256=248c906d2166eca5efb56d52fdee5f9a1f51d69a72e2b64fdac647b4ce39ea3f
93+
pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
94+
prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
95+
prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
96+
psych (5.3.1) sha256=eb7a57cef10c9d70173ff74e739d843ac3b2c019a003de48447b2963d81b1974
97+
racc (1.8.1) sha256=4a7f6929691dbec8b5209a0b373bc2614882b55fc5d2e447a21aaa691303d62f
98+
rdoc (7.2.0) sha256=8650f76cd4009c3b54955eb5d7e3a075c60a57276766ebf36f9085e8c9f23192
99+
reline (0.6.3) sha256=1198b04973565b36ec0f11542ab3f5cfeeec34823f4e54cebde90968092b1835
100+
rspec (3.13.2) sha256=206284a08ad798e61f86d7ca3e376718d52c0bc944626b2349266f239f820587
101+
rspec-core (3.13.6) sha256=a8823c6411667b60a8bca135364351dda34cd55e44ff94c4be4633b37d828b2d
102+
rspec-expectations (3.13.5) sha256=33a4d3a1d95060aea4c94e9f237030a8f9eae5615e9bd85718fe3a09e4b58836
103+
rspec-mocks (3.13.8) sha256=086ad3d3d17533f4237643de0b5c42f04b66348c28bf6b9c2d3f4a3b01af1d47
104+
rspec-support (3.13.7) sha256=0640e5570872aafefd79867901deeeeb40b0c9875a36b983d85f54fb7381c47c
105+
stringio (3.2.0) sha256=c37cb2e58b4ffbd33fe5cd948c05934af997b36e0b6ca6fdf43afa234cf222e1
106+
tsort (0.2.0) sha256=9650a793f6859a43b6641671278f79cfead60ac714148aabe4e3f0060480089f
107+
108+
BUNDLED WITH
109+
4.0.12

README.md

Lines changed: 127 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,138 @@
1-
# Extract Van Gogh Paintings Code Challenge
21

3-
Goal is to extract a list of Van Gogh paintings from the attached Google search results page.
2+
# Van Gogh Carousel Extractor
43

5-
![Van Gogh paintings](https://github.com/serpapi/code-challenge/blob/master/files/van-gogh-paintings.png?raw=true "Van Gogh paintings")
4+
Ruby solution for extracting Google knowledge-graph carousel items from a local
5+
SERP HTML snapshot.
66

7-
## Instructions
7+
## Challenge Requirements (Assignment)
88

9-
This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].)
10-
Try to come up with your own solution and your own test.
11-
Extract the painting `name`, `extensions` array (date), and Google `link` in an array.
9+
These are the original task requirements:
1210

13-
Fork this repository and make a PR when ready.
11+
1. Parse the provided Google results HTML directly (no extra HTTP requests).
12+
2. Extract carousel items with:
13+
- `name`
14+
- `extensions` array (for example year)
15+
- `link`
16+
3. Add thumbnails present in the page file.
17+
4. Test against 2 other similar result pages with the same carousel pattern.
1418

15-
Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like.
19+
Reference assets from the challenge:
20+
- HTML fixture: `files/van-gogh-paintings.html`
21+
- Expected output: `files/expected-array.json`
22+
- Challenge screenshot: `files/van-gogh-paintings.png`
1623

17-
Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything.
24+
## Solution in This Repo
1825

19-
[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
20-
[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json
21-
[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
22-
[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json
26+
This repository implements those requirements using Ruby + Nokogiri + RSpec.
2327

24-
Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).
28+
## What it does
2529

26-
Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
30+
Given an HTML file like `files/van-gogh-paintings.html`, it returns an array of:
2731

28-
The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
32+
- `name`
33+
- `extensions` (for example year)
34+
- `link` (absolute Google search URL)
35+
- `image` (inline `data:` image or in-file URL thumbnail when present)
36+
37+
The extractor does not make HTTP requests.
38+
39+
## How it works
40+
41+
Pipeline:
42+
43+
1. Parse HTML once with Nokogiri.
44+
2. Build thumbnail index from inline `_setImagesSrc(...)` script blocks.
45+
3. Detect the best carousel tile group using structural signals (`stick=` anchors).
46+
4. Parse each tile into the output schema.
47+
48+
Core files:
49+
50+
- `lib/extractor.rb` - public entrypoint.
51+
- `lib/extractor/carousel.rb` - carousel detection.
52+
- `lib/extractor/item.rb` - field extraction per tile.
53+
- `lib/extractor/thumbnail_index.rb` - inline JS image resolution.
54+
55+
## Setup
56+
57+
Prerequisites:
58+
59+
- Ruby
60+
- Bundler
61+
62+
Install dependencies:
63+
64+
```bash
65+
bundle install
66+
```
67+
68+
## Usage
69+
70+
Run extractor on the provided fixture:
71+
72+
```bash
73+
bin/extract files/van-gogh-paintings.html
74+
```
75+
76+
Save output:
77+
78+
```bash
79+
bin/extract files/van-gogh-paintings.html > /tmp/artworks.json
80+
```
81+
82+
Programmatic use:
83+
84+
```ruby
85+
require_relative "lib/extractor"
86+
87+
result = Extractor.call("files/van-gogh-paintings.html")
88+
puts result.first
89+
```
90+
91+
## Testing
92+
93+
Run all specs:
94+
95+
```bash
96+
bundle exec rspec
97+
```
98+
99+
Run one spec file:
100+
101+
```bash
102+
bundle exec rspec spec/extractor_spec.rb
103+
```
104+
105+
## Linting
106+
107+
Run syntax lint checks:
108+
109+
```bash
110+
bundle exec bin/lint
111+
```
112+
113+
`bin/lint` runs `ruby -wc` across `lib/`, `spec/`, and `bin/`.
114+
115+
## CI (GitHub Actions)
116+
117+
Run automated checks on every PR.
118+
The included workflow runs:
119+
120+
1. `bundle exec bin/lint`
121+
2. `bundle exec rspec`
122+
123+
See `.github/workflows/ci.yml`.
124+
125+
## Requirement Alignment
126+
127+
This implementation is aligned to the provided expected fixture output:
128+
129+
- 47 total items
130+
- 8 inline `data:` images
131+
- 39 in-file URL thumbnails (`https://encrypted-tbn...`)
132+
- 0 image `nil` values on the Van Gogh expected fixture
133+
134+
## Original challenge references
135+
136+
- Relevant test: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
137+
- HTML fixture: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
138+
- Expected array: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json

RUN.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Run Guide
2+
3+
This file contains setup, run, lint, and test commands for this script.
4+
5+
## Prerequisites
6+
7+
- Ruby
8+
- Bundler
9+
10+
## Setup
11+
12+
Install dependencies:
13+
14+
```bash
15+
bundle install
16+
```
17+
18+
## Run Extractor
19+
20+
Run extractor on the provided fixture:
21+
22+
```bash
23+
bin/extract files/van-gogh-paintings.html
24+
```
25+
26+
Save output:
27+
28+
```bash
29+
bin/extract files/van-gogh-paintings.html > /tmp/artworks.json
30+
```
31+
32+
Programmatic use:
33+
34+
```ruby
35+
require_relative "lib/extractor"
36+
37+
result = Extractor.call("files/van-gogh-paintings.html")
38+
puts result.first
39+
```
40+
41+
## Run Tests
42+
43+
Run all specs:
44+
45+
```bash
46+
bundle exec rspec
47+
```
48+
49+
Run one spec file:
50+
51+
```bash
52+
bundle exec rspec spec/extractor_spec.rb
53+
```
54+
55+
## Lint
56+
57+
Run syntax lint checks:
58+
59+
```bash
60+
bundle exec bin/lint
61+
```
62+
63+
`bin/lint` runs `ruby -wc` across `lib/`, `spec/`, and `bin/`.
64+
65+
## Local Quality Checks
66+
67+
1. `bundle exec bin/lint`
68+
2. `bundle exec rspec`

bin/extract

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/usr/bin/env ruby
2+
$LOAD_PATH.unshift File.expand_path("../lib", __dir__)
3+
require "extractor"
4+
require "json"
5+
6+
if ARGV.empty?
7+
warn "Usage: bin/extract <path-to-html>"
8+
exit 64
9+
end
10+
11+
puts JSON.pretty_generate("artworks" => Extractor.call(ARGV[0]))

bin/lint

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env ruby
2+
# frozen_string_literal: true
3+
4+
require "rbconfig"
5+
6+
files = Dir["lib/**/*.rb", "spec/**/*.rb", "bin/*"]
7+
files.select! { |path| File.file?(path) }
8+
files.sort!
9+
10+
failed = []
11+
12+
files.each do |path|
13+
ok = system(RbConfig.ruby, "-wc", path)
14+
failed << path unless ok
15+
end
16+
17+
if failed.any?
18+
warn "\nLint failed for #{failed.size} file(s):"
19+
failed.each { |path| warn "- #{path}" }
20+
exit 1
21+
end
22+
23+
puts "\nLint passed for #{files.size} file(s)."

0 commit comments

Comments
 (0)