Skip to content

method to extract sitemaps #5

@MothOnMars

Description

@MothOnMars

It would be very useful to have a method to extract the sitemaps listed in a robots.txt file, per the sitemaps specification: https://www.sitemaps.org/protocol.html#submit_robots

Example usage:
http://www.nytimes.com/robots.txt

Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
> robotex = Robotex.new
> robotex.sitemaps('http://www.nytimes.com')
=> ["http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz",
 "http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz",
 "http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz",
 "http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz",
 "http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions