Skip to content

Code Challenge submission #327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
source "https://rubygems.org"

gem 'rspec'
gem 'nokolexbor'
gem 'selenium-webdriver'
42 changes: 42 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
GEM
remote: https://rubygems.org/
specs:
base64 (0.2.0)
diff-lcs (1.6.1)
logger (1.7.0)
nokolexbor (0.6.0)
nokolexbor (0.6.0-arm64-darwin)
rexml (3.4.1)
rspec (3.13.0)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.3)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.3)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.2)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.2)
rubyzip (2.4.1)
selenium-webdriver (4.31.0)
base64 (~> 0.2)
logger (~> 1.4)
rexml (~> 3.2, >= 3.2.5)
rubyzip (>= 1.2.2, < 3.0)
websocket (~> 1.0)
websocket (1.2.11)

PLATFORMS
arm64-darwin-22
ruby

DEPENDENCIES
nokolexbor
rspec
selenium-webdriver

BUNDLED WITH
2.6.3
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,11 @@ Add also to your array the painting thumbnails present in the result page file (
Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.

## Commentary

The exercise for me had two parts. I started with the assumption that using a "pure" HTML parser, like Nokogiri would be enough. For the task, I chose Nokolexbor: David Sojevic talked about it during the first interview, and I decided to use it instead of Nokogiri. I built using TDD, so I first wrote a test based on the files already in the repo, and started working from there. I reviewed the structure of the HTML file, which is the same one you get using Google Search as of today (hopefully that doesn't change very often) and found the classes in which the relevant information is stored. That worked for three of the four attributes: name, extensions and link. It didn't worked for image. Taking another look at the page, it became clear that there was some Javascript that modified the HTML, replacing the images once the page loads.

Here started the second part of the exercise for me. How to deal with the modifications on the executed with JS? I thought of two ways. The first, the more direct one and the one I finally chose, was to open the page with a "real" browser, and then fetch the modified HTML from it. This is what is commonly done with integration tests. The second one would be to check what the JS was doing and try to go around it without executing the JS itself. This second option would be more performant (browsers are comparatively slow), but way more complicated and easier to break. Therefore, I decided to go with the first option.

To implement it, I first used Watir (something like Capybara), but later I realized that, given that I didn't need to manipulate the HTML at all, I could just use Selenium to load the HTML. At this point, I did consider remove the Nokolexbor dependency and just fetch all the elements using Selenium, but the difference would be small, and the possibility that Nokolexbor was more performant intrigued me. Moreover, it's your exercise, so it's nice to use your library. :)
33 changes: 33 additions & 0 deletions files/miro-artworks.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions files/miro.json

Large diffs are not rendered by default.

54 changes: 54 additions & 0 deletions files/rodin-artworks.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions files/rodin.json

Large diffs are not rendered by default.

56 changes: 56 additions & 0 deletions lib/google_search_artworks_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
require 'nokolexbor'
require 'open-uri'
require 'selenium-webdriver'

class GoogleSearchArtworksParser
def self.parse(location)
driver.navigate.to process_location(location)
doc = Nokolexbor::HTML(driver.page_source)
objects = doc.css(".iELo6")

{"artworks" => process_objects(objects)}
end

private

def self.driver
@driver ||= Selenium::WebDriver.for :chrome
end

def self.process_location(location)
if url?(location)
location
else
"File://" + Dir.pwd + "/" + location
end
end

def self.process_objects(objects)
objects.map { |o| process_object(o) }
end

def self.process_object(object)
link_object = object.css('a').first

{
"link" => "https://www.google.com" + link_object['href'],
"name" => link_object.css('.pgNMRc').first.content,
}.tap do |h|
h["image"] = get_image(link_object)

extensions = object.css('.cxzHyb').first.content
h['extensions'] = [extensions] if extensions && extensions != ''
end
end

def self.get_image(link_object)
link_object.css('img').first['data-src'] || link_object.css('img').first['src']
end

def self.url?(string)
uri = URI.parse(string)
uri.scheme && uri.host
rescue URI::InvalidURIError
false
end
end
77 changes: 77 additions & 0 deletions spec/google_search_artworks_parser_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
require 'google_search_artworks_parser'
require 'json'

RSpec.describe GoogleSearchArtworksParser do
ARTWORK_PROPERTIES = %w{extensions image link name}

describe '.parse' do
describe 'Van Gogh example' do
before do
@expected_result = JSON.parse(File.read('files/expected-array.json'))
@result = GoogleSearchArtworksParser.parse('files/van-gogh-paintings.html')
end

it 'returns the right amount of artworks' do
expect(@result['artworks'].count).to eq(@expected_result['artworks'].count)
end

it 'returns elements with the right properties' do
expect(@result['artworks'].first.keys.sort).to eq(ARTWORK_PROPERTIES)
end

ARTWORK_PROPERTIES.each do |artwork_property|
it "returns elements with the right #{artwork_property}" do
expected_attributes = @expected_result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }

expect(@result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }).to eq(expected_attributes)
end
end
end

describe 'Rodin example' do
before do
@expected_result = JSON.parse(File.read('files/rodin.json'))
@result = GoogleSearchArtworksParser.parse('files/rodin-artworks.html')
end

it 'returns the right amount of artworks' do
expect(@result['artworks'].count).to eq(@expected_result['artworks'].count)
end

it 'returns elements with the right properties' do
expect(@result['artworks'].first.keys.sort).to eq(ARTWORK_PROPERTIES)
end

ARTWORK_PROPERTIES.each do |artwork_property|
it "returns elements with the right #{artwork_property}" do
expected_attributes = @expected_result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }

expect(@result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }).to eq(expected_attributes)
end
end
end

describe 'Miro example' do
before do
@expected_result = JSON.parse(File.read('files/miro.json'))
@result = GoogleSearchArtworksParser.parse('files/miro-artworks.html')
end

it 'returns the right amount of artworks' do
expect(@result['artworks'].count).to eq(@expected_result['artworks'].count)
end

it 'returns elements with the right properties' do
expect(@result['artworks'].first.keys.sort).to eq(ARTWORK_PROPERTIES)
end

ARTWORK_PROPERTIES.each do |artwork_property|
it "returns elements with the right #{artwork_property}" do
expected_attributes = @expected_result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }

expect(@result['artworks'].map { |artwork| artwork.fetch(artwork_property, nil) }).to eq(expected_attributes)
end
end
end
end
end