Skip to content

RSS ingestor#1276

Open
eilmiv wants to merge 14 commits intoElixirTeSS:masterfrom
pan-training:rss_ingestor
Open

RSS ingestor#1276
eilmiv wants to merge 14 commits intoElixirTeSS:masterfrom
pan-training:rss_ingestor

Conversation

@eilmiv
Copy link
Copy Markdown
Collaborator

@eilmiv eilmiv commented Apr 7, 2026

Summary of changes

  • Added RSS and Atom feed support using rss gem
    • Separate ingestor for events and materials (but RSS for events is not as useful)
  • RSS feeds are optionally discovered from html pages using a link element with application/rss+xml or atom
  • Support for metadata extentions (not every extension for every rss/atom version)
    • RDF metadata (Bioschemas)
    • Dublin Core
    • iTunes
    • yahoo media (e.g. used on YouTube)

Motivation and context

Closes #722

Screenshots
image

Checklist

  • I have read and followed the CONTRIBUTING guide.
  • I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree to license it to the TeSS codebase under the BSD license.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds RSS/Atom feed ingestion to TeSS by introducing dedicated ingestors for events and materials, including optional HTML feed discovery and support for several common metadata extensions (Dublin Core, RDF/Bioschemas, iTunes, Yahoo Media).

Changes:

  • Introduce shared RSS/Atom ingestion helpers (RSSIngestion) plus reusable Dublin Core parsing/building (DublinCoreIngestion).
  • Add new ingestors for event and material RSS/Atom feeds, including RDF/Bioschemas merge behavior and HTML alternate-feed discovery.
  • Add RSS Media namespace support for Atom parsing and comprehensive unit tests for RSS/Atom ingestion and extensions.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/unit/rss_media_atom_test.rb Tests Media namespace installation idempotency for Atom.
test/unit/ingestors/material_rss_ingestor_test.rb Material RSS/Atom ingestion tests (DC, RSS versions, RDF/Bioschemas, HTML discovery, media/iTunes extensions).
test/unit/ingestors/event_rss_ingestor_test.rb Event RSS/Atom ingestion tests (DC, relative links, RDF/Bioschemas, HTML discovery).
lib/rss/media.rb Defines Yahoo Media RSS extension wiring + loads Atom-specific patch.
lib/rss/media/atom.rb Patches Atom classes to support media:group parsing and makes namespace installation idempotent.
lib/ingestors/rss_ingestion.rb Shared feed fetching/parsing + HTML discovery + extraction/merge helpers.
lib/ingestors/dublin_core_ingestion.rb Centralized DC-to-OpenStruct builders and normalization helpers.
lib/ingestors/material_rss_ingestor.rb New material RSS/Atom ingestor (RSS/RDF/Atom + Bioschemas LearningResource extraction).
lib/ingestors/event_rss_ingestor.rb New event RSS/Atom ingestor (RSS/RDF/Atom + Bioschemas Event/Course extraction).
lib/ingestors/oai_pmh_ingestor.rb Refactors OAI-PMH DC parsing to reuse DublinCoreIngestion.
lib/ingestors/ingestor_factory.rb Registers the new RSS ingestors.
config/initializers/inflections.rb Adds RSS acronym for correct Zeitwerk/inflector naming.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@eilmiv
Copy link
Copy Markdown
Collaborator Author

eilmiv commented Apr 9, 2026

Two additional notes:

  • Depending on the RSS feed, only the most recent entries are imported/updated (e.g. the most recent 15 for YouTube)
  • Not sure if there exists an rss feed for events, I have no problem removing the rss event ingestor if this is not seen as useful

@eilmiv eilmiv requested a review from fbacall April 9, 2026 10:29
@eilmiv eilmiv marked this pull request as ready for review April 9, 2026 10:30
Copy link
Copy Markdown
Member

@fbacall fbacall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - very flexible and nice tests. I think some parts can be simplified, and it might be good to split the YouTube functionality into a simple subclass for the sake of clarity.

Comment on lines +52 to +53
rescue StandardError
nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this rescue be more specific?

@@ -0,0 +1,111 @@
require 'rss'
require_relative '../rss/media'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Should be able to autoload

Comment on lines +39 to +42
else
@messages << "Parsing UNKNOWN feed: #{feed_title(feed)}"
@messages << 'unsupported feed format'
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other types of feed are there? Perhaps log what the feed was.

Comment on lines +70 to +71
rescue StandardError
nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary?

return nil if candidate.blank?

URI.parse(candidate)
return candidate if URI::DEFAULT_PARSER.make_regexp(%w[http https]).match?(candidate)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check the regexp if it's already parsed above? Can just check .scheme on the URI object.

links.map { |link| text_value(link.href) }.find(&:present?)
end

def resolve_feed_url(candidate_url, feed_url)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what this method is for, but if you're trying to get an absolute URL from the candidate_url which may be relative or absolute, you can just use:

Addressable::URI.join(feed_url, candidate_url)

@@ -0,0 +1,54 @@
# Patches RSS::Atom::Feed and RSS::Atom::Entry with Media namespace support (see ../media.rb).
# Kept as RSS::Media::Atom so Zeitwerk can autoload it from lib/rss/media/atom.rb.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anywhere autoloading it?

I think put this in config/initializers, where we already have other patches. It will also save you having to worry about loading twice.

Comment on lines +52 to +55
if (url = discover_feed_url_from_youtube_playlist_url(base_url))
@messages << "Found Atom feed link from YouTube playlist URL, following: #{url}"
return url
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nicer if we have a subclassed Ingestor for YouTube ingestion. Then we don't have to explain to users that they need to pick the RSS ingestor for YouTube content.

def discover_feed_url_from_youtube_playlist_url(base_url)
uri = URI.parse(base_url)
host = uri.host.to_s.downcase
return nil unless host == 'youtube.com' || host.end_with?('.youtube.com')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could perhaps re-use the is_youtube_url? method from lib/renderers/youtube.rb (make it not private)

Comment on lines +68 to +70
Array(values).map { |v| dublin_core_text(v) }
.map(&:to_s)
.map(&:strip)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to map 3 times, just do dublin_core_text(v).to_s.strip in the first block

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants