[HEP Training ingestors] added a custom event ingestor (Gray Scott events)#1236
[HEP Training ingestors] added a custom event ingestor (Gray Scott events)#1236kennethrioja wants to merge 3 commits intoElixirTeSS:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new HEPTraining ingestor to import Gray Scott School 2026 webinar events from an ICS feed, including fixtures/tests, and factors out a shared HTML-fetch helper for ingestors.
Changes:
- Introduce
Ingestors::Heptraining::GrayScottIngestorto ingest events from the Gray Scott ICS and scrape additional details from linked pages. - Register the new ingestor in
IngestorFactory. - Add unit test + fixtures for the ICS and associated HTML pages; refactor GitHub ingestor to use a shared HTML fetcher.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
lib/ingestors/heptraining/gray_scott_ingestor.rb |
New ingestor implementation for Gray Scott events (ICS parsing + HTML scraping). |
lib/ingestors/ingestor_factory.rb |
Registers the new Heptraining ingestor in the factory list. |
lib/ingestors/ingestor.rb |
Adds shared get_html_from_url helper for HTTParty+Nokogiri HTML retrieval. |
lib/ingestors/github_ingestor.rb |
Switches from a local HTML helper to the new base-class helper. |
test/unit/ingestors/heptraining/gray_scott_ingestor_test.rb |
Adds unit test coverage for the new ingestor behavior. |
test/fixtures/files/ingestion/heptraining/grayscott/grayscott-event.ics |
ICS fixture used by the new ingestor test. |
test/fixtures/files/ingestion/heptraining/grayscott/grayscott-redirect.html |
Redirect-page fixture used by the new ingestor test. |
test/fixtures/files/ingestion/heptraining/grayscott/grayscott-page.html |
Event detail page fixture used by the new ingestor test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def get_redirected_url(url) | ||
| uri = URI.parse(url) | ||
| label = CGI.parse(uri.query)['label']&.first |
There was a problem hiding this comment.
This ingestor defines get_redirected_url(url) which overrides Ingestor#get_redirected_url(url, limit = 5) with a different signature/behavior. This can be confusing and makes it easy to accidentally bypass the shared redirect logic; consider renaming this method to something Gray-Scott-specific (or accept *args and call super where appropriate).
| script_content = get_html_from_url(url).css('script').find { |s| s.content.include?('var dictReference') }&.content | ||
| dict_match = script_content&.match(/var\s+dictReference\s*=\s*({[^}]+})/) | ||
| return unless dict_match | ||
|
|
||
| dict = JSON.parse(dict_match[1]) | ||
| matched_value = dict[label] | ||
|
|
||
| "#{uri.scheme}://#{uri.host}#{uri.path.sub(%r{/[^/]+$}, '')}/#{matched_value}" | ||
| end |
There was a problem hiding this comment.
get_redirected_url can return nil (no matching script/dictionary) or build an invalid URL when matched_value is nil, but the caller immediately passes the result into get_html_from_url(...). Add a guard/fallback (e.g., return the original URL, or raise a descriptive error) before attempting to fetch/parse HTML.
| end | ||
|
|
||
| def get_html_from_url(url) | ||
| response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] }) |
There was a problem hiding this comment.
get_html_from_url sets the User-Agent header to config[:user_agent] without a default; ingestors like GrayScottIngestor don't set :user_agent, which can result in a nil/empty User-Agent being sent and cause requests to be blocked. Consider defaulting to something like 'TeSS Bot' (consistent with get_redirected_url).
| response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] }) | |
| response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] || 'TeSS Bot' }) |
| # puts "calevent: #{calevent.inspect}" | ||
| gs_url = calevent.custom_properties.find { |key, _| key.include?('http') }&.last&.first&.strip&.gsub(%r{^[/\s]+|[/\s]+$}, '')&.prepend('https://') | ||
| html = get_html_from_url(get_redirected_url(gs_url)) |
There was a problem hiding this comment.
gs_url is derived from calevent.custom_properties, but the provided .ics fixture embeds the redirect URL inside the DESCRIPTION (folded line), not as a custom property. This will likely produce nil for gs_url and then raise when calling URI.parse/get_redirected_url. Extract the first http(s) URL from calevent.description (or calevent.url when present) and validate it before proceeding.
| event = OpenStruct.new | ||
| event.title = calevent.summary.to_s | ||
| event.url = gs_url | ||
| event.description = html.css('.paragraphStyle').text.strip || calevent.description.to_s |
There was a problem hiding this comment.
html.css('.paragraphStyle').text.strip || calevent.description.to_s will never fall back to the iCal description when the HTML selector is missing, because text.strip returns "" (truthy) rather than nil. Use a blank/presence check so the iCal description is used when the scraped HTML is empty.
| event.description = html.css('.paragraphStyle').text.strip || calevent.description.to_s | |
| html_description = html.css('.paragraphStyle').text.to_s.strip | |
| event.description = html_description.empty? ? calevent.description.to_s : html_description |
Summary of changes
Motivation and context
Asked by David Chamont (in2p3)
Screenshots
N/A
Checklist
to license it to the TeSS codebase under the
BSD license.