Skip to content

Enhanced Painting Extractor with Selenium WebDriver #325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Mujadded
Copy link

@Mujadded Mujadded commented Apr 25, 2025

Summary

This PR introduces a significantly improved extraction approach that uses Selenium WebDriver in addition to Nokogiri to properly extract painting data from Google search results.

Key Changes

  1. Added Selenium WebDriver Integration

    • Added Selenium WebDriver to handle JavaScript rendering before HTML parsing
    • Configured headless Chrome with appropriate options for efficient processing
    • Implemented wait time to ensure JavaScript execution completes
    • Continued to use Nokogiri for robust HTML parsing after rendering
  2. Enhanced Image Extraction

    • Implemented placeholder image detection to handle lazy-loaded images
    • Added logic to capture data-src attributes when placeholder images are detected
    • Improved image selection to get the highest quality available
  3. Improved Link Handling

    • Added functionality to extract actual links from anchor tags
    • Implemented helper for converting relative links to absolute Google URLs
    • Added fallback search link generation for items missing explicit links
  4. Refined Output Formatting

    • Structured JSON output to match expected format
    • Ensured extensions field is properly wrapped in an array
    • Maintained consistent field ordering in the output
  5. Comprehensive Testing

    • Created test fixtures with various selector patterns
    • Added specs for placeholder image detection, link normalization, and data structure formatting
    • Implemented mocks for Selenium to enable headless testing
    • Added two additional test cases with different layouts as required

Project Structure

.
├── extract.rb                        # Main executable script
├── fixtures/                         # Test fixtures
│   ├── sample.html                   # Main test fixture for unit tests
│   ├── van-gogh-paintings.html       # Original Van Gogh paintings test case
│   ├── pablo-picasso-paintings.html  # Picasso paintings test case
│   └── leonardo-da-vinci-paintings.html # Da Vinci paintings test case
├── lib/
│   ├── cli.rb                        # Command-line interface
│   └── painting_extractor.rb         # Core extractor implementation
├── outputs/                          # Output files from test runs
│   ├── output-van-gogh.json          # Results from Van Gogh test case
│   ├── output-pablo.json             # Results from Picasso test case
│   └── output-leonardo.json          # Results from Da Vinci test case
├── spec/
│   ├── spec_helper.rb                # RSpec configuration
│   └── painting_extractor_spec.rb    # Tests for the extractor
└── tmp/                              # Temporary files used during tests

How to Run the Extractor

The extractor can be run from the command line with the following syntax:

ruby extract.rb <input_html_file> <output_json_file>

Examples

  1. Extract from the original Van Gogh paintings test case:
ruby extract.rb fixtures/van-gogh-paintings.html outputs/output-van-gogh.json
  1. Extract from additional test cases:
# Extract from the Pablo Picasso paintings test case
ruby extract.rb fixtures/pablo-picasso-paintings.html outputs/output-pablo.json

# Extract from the Leonardo da Vinci paintings test case
ruby extract.rb fixtures/leonardo-da-vinci-paintings.html outputs/output-leonardo.json
  1. Run tests:
# Run all specs
bundle exec rspec

# Run specific spec file
bundle exec rspec spec/painting_extractor_spec.rb

Why Selenium Instead of Just Nokogiri?

Nokogiri alone was insufficient because:

  1. The Google search results page contains JavaScript-rendered content, with the image carousel built dynamically
  2. Initial attempts with just Nokogiri resulted in capturing only placeholder images (data URI or empty src)
  3. The real image URLs are loaded after JavaScript execution, often stored in data-src attributes
  4. Selenium allows the page to fully render before extraction, ensuring all dynamic content is processed

Why Placeholder Image Detection?

Google search uses a lazy-loading strategy for images:

  1. Initial page load includes placeholder images (base64-encoded tiny images or empty src)
  2. JavaScript then replaces these with actual images when they enter the viewport
  3. Without proper detection and handling, we would extract placeholder images instead of the actual artwork
  4. The placeholder detection identifies these patterns and retrieves the real image URL from data-src attributes

Additional Test Cases

As requested in the instructions, I've tested the extractor against two additional carousel layouts:

  1. Pablo Picasso Paintings Test Case

    • Located in fixtures/pablo-picasso-paintings.html (1.8MB)
    • Output saved to outputs/output-pablo.json (160KB, 353 lines)
    • Successfully extracts a comprehensive set of Picasso's artwork
  2. Leonardo da Vinci Paintings Test Case

    • Located in fixtures/leonardo-da-vinci-paintings.html (1.5MB)
    • Output saved to outputs/output-leonardo.json (98KB, 341 lines)
    • Demonstrates the extractor's adaptability to different HTML structures

Both test cases confirm that the extractor works reliably with different layouts while maintaining consistent output formatting. The successfully extracted data from all three test cases (Van Gogh, Picasso, and Da Vinci) demonstrates the robustness of the implementation.

Requirements Fulfilled

This PR addresses all requirements from the original challenge:

  • Extract painting name, extensions array (date), and Google link in an array
  • Parse directly from the HTML result page with no extra HTTP requests
  • Include painting thumbnails present in the result page file
  • Test against 2 other similar result pages with different layouts
  • Implemented in Ruby with RSpec tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant