Enhanced Painting Extractor with Selenium WebDriver #325

Mujadded · 2025-04-25T20:27:16Z

Summary

This PR introduces a significantly improved extraction approach that uses Selenium WebDriver in addition to Nokogiri to properly extract painting data from Google search results.

Key Changes

Added Selenium WebDriver Integration
- Added Selenium WebDriver to handle JavaScript rendering before HTML parsing
- Configured headless Chrome with appropriate options for efficient processing
- Implemented wait time to ensure JavaScript execution completes
- Continued to use Nokogiri for robust HTML parsing after rendering
Enhanced Image Extraction
- Implemented placeholder image detection to handle lazy-loaded images
- Added logic to capture data-src attributes when placeholder images are detected
- Improved image selection to get the highest quality available
Improved Link Handling
- Added functionality to extract actual links from anchor tags
- Implemented helper for converting relative links to absolute Google URLs
- Added fallback search link generation for items missing explicit links
Refined Output Formatting
- Structured JSON output to match expected format
- Ensured extensions field is properly wrapped in an array
- Maintained consistent field ordering in the output
Comprehensive Testing
- Created test fixtures with various selector patterns
- Added specs for placeholder image detection, link normalization, and data structure formatting
- Implemented mocks for Selenium to enable headless testing
- Added two additional test cases with different layouts as required

Project Structure

.
├── extract.rb                        # Main executable script
├── fixtures/                         # Test fixtures
│   ├── sample.html                   # Main test fixture for unit tests
│   ├── van-gogh-paintings.html       # Original Van Gogh paintings test case
│   ├── pablo-picasso-paintings.html  # Picasso paintings test case
│   └── leonardo-da-vinci-paintings.html # Da Vinci paintings test case
├── lib/
│   ├── cli.rb                        # Command-line interface
│   └── painting_extractor.rb         # Core extractor implementation
├── outputs/                          # Output files from test runs
│   ├── output-van-gogh.json          # Results from Van Gogh test case
│   ├── output-pablo.json             # Results from Picasso test case
│   └── output-leonardo.json          # Results from Da Vinci test case
├── spec/
│   ├── spec_helper.rb                # RSpec configuration
│   └── painting_extractor_spec.rb    # Tests for the extractor
└── tmp/                              # Temporary files used during tests

How to Run the Extractor

The extractor can be run from the command line with the following syntax:

ruby extract.rb <input_html_file> <output_json_file>

Examples

Extract from the original Van Gogh paintings test case:

ruby extract.rb fixtures/van-gogh-paintings.html outputs/output-van-gogh.json

Extract from additional test cases:

# Extract from the Pablo Picasso paintings test case
ruby extract.rb fixtures/pablo-picasso-paintings.html outputs/output-pablo.json

# Extract from the Leonardo da Vinci paintings test case
ruby extract.rb fixtures/leonardo-da-vinci-paintings.html outputs/output-leonardo.json

Run tests:

# Run all specs
bundle exec rspec

# Run specific spec file
bundle exec rspec spec/painting_extractor_spec.rb

Why Selenium Instead of Just Nokogiri?

Nokogiri alone was insufficient because:

The Google search results page contains JavaScript-rendered content, with the image carousel built dynamically
Initial attempts with just Nokogiri resulted in capturing only placeholder images (data URI or empty src)
The real image URLs are loaded after JavaScript execution, often stored in data-src attributes
Selenium allows the page to fully render before extraction, ensuring all dynamic content is processed

Why Placeholder Image Detection?

Google search uses a lazy-loading strategy for images:

Initial page load includes placeholder images (base64-encoded tiny images or empty src)
JavaScript then replaces these with actual images when they enter the viewport
Without proper detection and handling, we would extract placeholder images instead of the actual artwork
The placeholder detection identifies these patterns and retrieves the real image URL from data-src attributes

Additional Test Cases

As requested in the instructions, I've tested the extractor against two additional carousel layouts:

Pablo Picasso Paintings Test Case
- Located in fixtures/pablo-picasso-paintings.html (1.8MB)
- Output saved to outputs/output-pablo.json (160KB, 353 lines)
- Successfully extracts a comprehensive set of Picasso's artwork
Leonardo da Vinci Paintings Test Case
- Located in fixtures/leonardo-da-vinci-paintings.html (1.5MB)
- Output saved to outputs/output-leonardo.json (98KB, 341 lines)
- Demonstrates the extractor's adaptability to different HTML structures

Both test cases confirm that the extractor works reliably with different layouts while maintaining consistent output formatting. The successfully extracted data from all three test cases (Van Gogh, Picasso, and Da Vinci) demonstrates the robustness of the implementation.

Requirements Fulfilled

This PR addresses all requirements from the original challenge:

Extract painting name, extensions array (date), and Google link in an array
Parse directly from the HTML result page with no extra HTTP requests
Include painting thumbnails present in the result page file
Test against 2 other similar result pages with different layouts
Implemented in Ruby with RSpec tests

Mujadded added 2 commits April 25, 2025 21:17

first working version

9df5aba

added outputs

43c8f27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Painting Extractor with Selenium WebDriver #325

Enhanced Painting Extractor with Selenium WebDriver #325

Mujadded commented Apr 25, 2025 •

edited

Loading

Enhanced Painting Extractor with Selenium WebDriver #325

Are you sure you want to change the base?

Enhanced Painting Extractor with Selenium WebDriver #325

Conversation

Mujadded commented Apr 25, 2025 • edited Loading

Summary

Key Changes

Project Structure

How to Run the Extractor

Examples

Why Selenium Instead of Just Nokogiri?

Why Placeholder Image Detection?

Additional Test Cases

Requirements Fulfilled

Mujadded commented Apr 25, 2025 •

edited

Loading