|
1 | | -# Extract Van Gogh Paintings Code Challenge |
2 | 1 |
|
3 | | -Goal is to extract a list of Van Gogh paintings from the attached Google search results page. |
| 2 | +# Van Gogh Carousel Extractor |
4 | 3 |
|
5 | | - |
| 4 | +Ruby solution for extracting Google knowledge-graph carousel items from a local |
| 5 | +SERP HTML snapshot. |
6 | 6 |
|
7 | | -## Instructions |
| 7 | +## Challenge Requirements (Assignment) |
8 | 8 |
|
9 | | -This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].) |
10 | | -Try to come up with your own solution and your own test. |
11 | | -Extract the painting `name`, `extensions` array (date), and Google `link` in an array. |
| 9 | +These are the original task requirements: |
12 | 10 |
|
13 | | -Fork this repository and make a PR when ready. |
| 11 | +1. Parse the provided Google results HTML directly (no extra HTTP requests). |
| 12 | +2. Extract carousel items with: |
| 13 | + - `name` |
| 14 | + - `extensions` array (for example year) |
| 15 | + - `link` |
| 16 | +3. Add thumbnails present in the page file. |
| 17 | +4. Test against 2 other similar result pages with the same carousel pattern. |
14 | 18 |
|
15 | | -Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like. |
| 19 | +Reference assets from the challenge: |
| 20 | +- HTML fixture: `files/van-gogh-paintings.html` |
| 21 | +- Expected output: `files/expected-array.json` |
| 22 | +- Challenge screenshot: `files/van-gogh-paintings.png` |
16 | 23 |
|
17 | | -Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything. |
| 24 | +## Solution in This Repo |
18 | 25 |
|
19 | | -[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb |
20 | | -[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json |
21 | | -[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html |
22 | | -[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json |
| 26 | +This repository implements those requirements using Ruby + Nokogiri + RSpec. |
23 | 27 |
|
24 | | -Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed). |
| 28 | +## What it does |
25 | 29 |
|
26 | | -Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.) |
| 30 | +Given an HTML file like `files/van-gogh-paintings.html`, it returns an array of: |
27 | 31 |
|
28 | | -The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want. |
| 32 | +- `name` |
| 33 | +- `extensions` (for example year) |
| 34 | +- `link` (absolute Google search URL) |
| 35 | +- `image` (inline `data:` image or in-file URL thumbnail when present) |
| 36 | + |
| 37 | +The extractor does not make HTTP requests. |
| 38 | + |
| 39 | +## How it works |
| 40 | + |
| 41 | +Pipeline: |
| 42 | + |
| 43 | +1. Parse HTML once with Nokogiri. |
| 44 | +2. Build thumbnail index from inline `_setImagesSrc(...)` script blocks. |
| 45 | +3. Detect the best carousel tile group using structural signals (`stick=` anchors). |
| 46 | +4. Parse each tile into the output schema. |
| 47 | + |
| 48 | +Core files: |
| 49 | + |
| 50 | +- `lib/extractor.rb` - public entrypoint. |
| 51 | +- `lib/extractor/carousel.rb` - carousel detection. |
| 52 | +- `lib/extractor/item.rb` - field extraction per tile. |
| 53 | +- `lib/extractor/thumbnail_index.rb` - inline JS image resolution. |
| 54 | + |
| 55 | +## Setup |
| 56 | + |
| 57 | +Prerequisites: |
| 58 | + |
| 59 | +- Ruby |
| 60 | +- Bundler |
| 61 | + |
| 62 | +Install dependencies: |
| 63 | + |
| 64 | +```bash |
| 65 | +bundle install |
| 66 | +``` |
| 67 | + |
| 68 | +## Usage |
| 69 | + |
| 70 | +Run extractor on the provided fixture: |
| 71 | + |
| 72 | +```bash |
| 73 | +bin/extract files/van-gogh-paintings.html |
| 74 | +``` |
| 75 | + |
| 76 | +Save output: |
| 77 | + |
| 78 | +```bash |
| 79 | +bin/extract files/van-gogh-paintings.html > /tmp/artworks.json |
| 80 | +``` |
| 81 | + |
| 82 | +Programmatic use: |
| 83 | + |
| 84 | +```ruby |
| 85 | +require_relative "lib/extractor" |
| 86 | + |
| 87 | +result = Extractor.call("files/van-gogh-paintings.html") |
| 88 | +puts result.first |
| 89 | +``` |
| 90 | + |
| 91 | +## Testing |
| 92 | + |
| 93 | +Run all specs: |
| 94 | + |
| 95 | +```bash |
| 96 | +bundle exec rspec |
| 97 | +``` |
| 98 | + |
| 99 | +Run one spec file: |
| 100 | + |
| 101 | +```bash |
| 102 | +bundle exec rspec spec/extractor_spec.rb |
| 103 | +``` |
| 104 | + |
| 105 | +## Linting |
| 106 | + |
| 107 | +Run syntax lint checks: |
| 108 | + |
| 109 | +```bash |
| 110 | +bundle exec bin/lint |
| 111 | +``` |
| 112 | + |
| 113 | +`bin/lint` runs `ruby -wc` across `lib/`, `spec/`, and `bin/`. |
| 114 | + |
| 115 | +## CI (GitHub Actions) |
| 116 | + |
| 117 | +Run automated checks on every PR. |
| 118 | +The included workflow runs: |
| 119 | + |
| 120 | +1. `bundle exec bin/lint` |
| 121 | +2. `bundle exec rspec` |
| 122 | + |
| 123 | +See `.github/workflows/ci.yml`. |
| 124 | + |
| 125 | +## Requirement Alignment |
| 126 | + |
| 127 | +This implementation is aligned to the provided expected fixture output: |
| 128 | + |
| 129 | +- 47 total items |
| 130 | +- 8 inline `data:` images |
| 131 | +- 39 in-file URL thumbnails (`https://encrypted-tbn...`) |
| 132 | +- 0 image `nil` values on the Van Gogh expected fixture |
| 133 | + |
| 134 | +## Original challenge references |
| 135 | + |
| 136 | +- Relevant test: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb |
| 137 | +- HTML fixture: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html |
| 138 | +- Expected array: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json |
0 commit comments