Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions lib/ingestors/github_ingestor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def to_material(repo_data) # rubocop:disable Metrics/AbcSize
github_io_homepage = github_io_homepage? repo_data['homepage']
url = github_io_homepage ? repo_data['homepage'] : repo_data['html_url']
redirected_url = get_redirected_url(url)
html = get_html(redirected_url)
html = get_html_from_url(redirected_url)

material = OpenStruct.new
material.title = repo_data['name'].titleize
Expand All @@ -131,11 +131,6 @@ def github_io_homepage?(homepage)
url.host&.downcase&.end_with?('.github.io')
end

def get_html(url)
response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
Nokogiri::HTML(response.body)
end

# DEFINITION – Opens the GitHub homepage, fetches the 3 first >50 char <p> tags'text
# and joins them with a 'Read more...' link at the end of the description
# Some of the first <p> tags were not descriptive, thus skipping them
Expand Down
75 changes: 75 additions & 0 deletions lib/ingestors/heptraining/gray_scott_ingestor.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
require 'icalendar'
require 'nokogiri'
require 'open-uri'
require 'tzinfo'

module Ingestors
module Heptraining
class GrayScottIngestor < Ingestor
def self.config
{
key: 'gray_scott_event',
title: 'Gray Scott Events API',
category: :events
}
end

def read(url)
@verbose = false
process_gray_scott(url)
end

private

def process_gray_scott(url)
events = Icalendar::Event.parse(open_url(url, raise: true).set_encoding('utf-8'))
raise 'Not found' if events.nil? || events.empty?

events.each do |e|
process_calevent(e, url)
end
end

def process_calevent(calevent, url)
# puts "calevent: #{calevent.inspect}"
gs_url = calevent.custom_properties.find { |key, _| key.include?('http') }&.last&.first&.strip&.gsub(%r{^[/\s]+|[/\s]+$}, '')&.prepend('https://')
html = get_html_from_url(get_redirected_url(gs_url))
Comment on lines +34 to +36
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gs_url is derived from calevent.custom_properties, but the provided .ics fixture embeds the redirect URL inside the DESCRIPTION (folded line), not as a custom property. This will likely produce nil for gs_url and then raise when calling URI.parse/get_redirected_url. Extract the first http(s) URL from calevent.description (or calevent.url when present) and validate it before proceeding.

Copilot uses AI. Check for mistakes.

event = OpenStruct.new
event.title = calevent.summary.to_s
event.url = gs_url
event.description = html.css('.paragraphStyle').text.strip || calevent.description.to_s
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

html.css('.paragraphStyle').text.strip || calevent.description.to_s will never fall back to the iCal description when the HTML selector is missing, because text.strip returns "" (truthy) rather than nil. Use a blank/presence check so the iCal description is used when the scraped HTML is empty.

Suggested change
event.description = html.css('.paragraphStyle').text.strip || calevent.description.to_s
html_description = html.css('.paragraphStyle').text.to_s.strip
event.description = html_description.empty? ? calevent.description.to_s : html_description

Copilot uses AI. Check for mistakes.

event.end = calevent.dtend&.to_time
unless calevent.dtstart.nil?
dtstart = calevent.dtstart
event.start = dtstart&.to_time
tzid = dtstart.ical_params['tzid']
event.timezone = tzid.first.to_s if !tzid.nil? && tzid.size.positive?
end
event.venue = clean_html(calevent.location.to_s)
event.organizer = html.css('h3:contains("Speakers") + ul li a')&.map(&:text)&.map(&:strip)&.join(', ') # coma separated if multiple speakers

@events << event
end

def get_redirected_url(url)
uri = URI.parse(url)
label = CGI.parse(uri.query)['label']&.first
Comment on lines +56 to +58
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ingestor defines get_redirected_url(url) which overrides Ingestor#get_redirected_url(url, limit = 5) with a different signature/behavior. This can be confusing and makes it easy to accidentally bypass the shared redirect logic; consider renaming this method to something Gray-Scott-specific (or accept *args and call super where appropriate).

Copilot uses AI. Check for mistakes.

script_content = get_html_from_url(url).css('script').find { |s| s.content.include?('var dictReference') }&.content
dict_match = script_content&.match(/var\s+dictReference\s*=\s*({[^}]+})/)
return unless dict_match

dict = JSON.parse(dict_match[1])
matched_value = dict[label]

"#{uri.scheme}://#{uri.host}#{uri.path.sub(%r{/[^/]+$}, '')}/#{matched_value}"
end
Comment on lines +60 to +68
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_redirected_url can return nil (no matching script/dictionary) or build an invalid URL when matched_value is nil, but the caller immediately passes the result into get_html_from_url(...). Add a guard/fallback (e.g., return the original URL, or raise a descriptive error) before attempting to fetch/parse HTML.

Copilot uses AI. Check for mistakes.

def clean_html(html)
Nokogiri::HTML::DocumentFragment.parse(html).text.strip
end
end
end
end
5 changes: 5 additions & 0 deletions lib/ingestors/ingestor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ def open_url(url, raise: false, token: nil)
end
end

def get_html_from_url(url)
response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_html_from_url sets the User-Agent header to config[:user_agent] without a default; ingestors like GrayScottIngestor don't set :user_agent, which can result in a nil/empty User-Agent being sent and cause requests to be blocked. Consider defaulting to something like 'TeSS Bot' (consistent with get_redirected_url).

Suggested change
response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
response = HTTParty.get(url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] || 'TeSS Bot' })

Copilot uses AI. Check for mistakes.
Nokogiri::HTML(response.body)
end

# Some URLs automatically redirects the user to another webpage
# This method gets a URL and returns the last redirected URL (as shown by a 30X response or a `meta[http-equiv="Refresh"]` tag)
def get_redirected_url(url, limit = 5) # rubocop:disable Metrics/AbcSize
Expand Down
8 changes: 7 additions & 1 deletion lib/ingestors/ingestor_factory.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def self.ingestors
Ingestors::TessEventIngestor,
Ingestors::ZenodoIngestor,
Ingestors::GithubIngestor,
] + taxila_ingestors + llm_ingestors
] + taxila_ingestors + llm_ingestors + heptraining_ingestors
end

def self.taxila_ingestors
Expand Down Expand Up @@ -49,6 +49,12 @@ def self.llm_ingestors
]
end

def self.heptraining_ingestors
[
Ingestors::Heptraining::GrayScottIngestor
]
end

def self.ingestor_config
@ingestor_config ||= ingestors.map do |i|
[i.config[:key], i.config.merge(ingestor: i)]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//PhoenixTex2Html//gray_scott_2026_webinars/
BEGIN:VEVENT
CLASS:PUBLIC
DTSTAMP:20260212T103600
UID:TH8WMR_PNR0012_20260212T103600
DTSTART;TZID=Europe/Paris:20260226T100000
DTEND;TZID=Europe/Paris:20260226T113000
SUMMARY:Memory allocation, why and how to profile applications

LOCATION:Registration : <a id="0" href="https://teratec.webex.com/blabla">https://teratec.webex.com/blabla</a>

DESCRIPTION:Memory allocation, why and how to profile applications
\n
https://cta-lapp.pages.in2p3.fr/cours/gray_scott_revolutions/grayscott2026/redirect.html?label=sec_gray_scott_webinar_memory_allocation_memory_profiling\n
BEGIN:VALARM
TRIGGER:-PT10M
ACTION:DISPLAY
DESCRIPTION:Reminder
END:VALARM
END:VEVENT
END:VCALENDAR
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@

<!DOCTYPE html>
<html class="js sidebar-visible navy" lang="fr">
<head>
<meta charset="UTF-8">
<title>Memory allocation, why and how to profile applications
</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta name="description" content="Memory allocation, why and how to profile applications
">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="rgba(0, 0, 0, 0)">
<link rel="stylesheet" href="variables.css">
<link rel="stylesheet" href="dark_style.css" />
<link rel="stylesheet" href="general.css">
<link rel="stylesheet" href="chrome.css">
<link rel="stylesheet" href="highlight.css" disabled="">
<link rel="stylesheet" href="tomorrow-night.css">
<link rel="stylesheet" href="ayu-highlight.css" disabled="">
<!-- Fonts -->
<link rel="stylesheet" href="font-awesome.css">
<link rel="stylesheet" href="fonts.css">
<!-- <script src="" async></script> -->
<!-- <script src=""></script> -->
</head>
<body>

<a id="450" href="invitation/gray_scott_webinar_memory_allocation_memory_profiling.ics"><div class="rendezvousStyle"></div></a><b>Date</b> : 26/02/2026<br />
<b>Location</b> : Registration : <a id="458" href="https://teratec.webex.com/webappng/sites/teratec/webinar/webinarSeries/register/0465b64b919540de9910a5b84077b878">https://teratec.webex.com/webappng/sites/teratec/webinar/webinarSeries/register/0465b64b919540de9910a5b84077b878</a>
<br />
<b>Start at</b> : 10:00<br />
<b>Stop at</b> : 11:30 <h3 id="466">Speakers</h3>
<ul>
<li><a href="2-3-5-4513.html">Someone
</a></li>
<li><a href="2-3-5-4513.html">SomeoneElse
</a></li>
</ul>
<h3 id="471">Description</h3>
<p id="472" class="paragraphStyle">
Sometimes memory has become a major problem in applications, with its bandwidth but also by the incresing size needed by more and more complex and dynamic applications. So, how to track these errors and point problematic patterns ? How to find where the memory is consumed when the application reaches the hardware limit ? After my PhD on memory management in HPC context (NUMA, parallel, etc) I had the opportunity to develop two profilers (malloc and numa) now open-sources for C/C++/Fortran and Rust. I will briefly present these tools with some examples and expected observations.
</p>

</body>
</html>

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

<!DOCTYPE html>
<html lang="fr">
<head>
<meta charset="utf-8" />
<title>Page redirection</title>
<link rel="stylesheet" href="dark_style.css" />
<script type="text/javascript">
function redirectionWithLabelReference(){
var parameters = location.search.substring(1).split("?");
var tmp = parameters[0].split("=");
referenceName = unescape(tmp[1]);
var dictReference = {
"sec_gray_scott_webinar_memory_allocation_memory_profiling": "1-1-5-1-449.html"
};
if(referenceName in dictReference){
document.location.href=dictReference[referenceName];
}else{
document.location.href="index.html";
}
}
</script>
</head>
<body onLoad="setTimeout('redirectionWithLabelReference()', 1000)">
<div>Dans 2 secondes vous allez être redirigé vers la page que vous avez demandée... normalement</div>
</body>
</html>

40 changes: 40 additions & 0 deletions test/unit/ingestors/heptraining/gray_scott_ingestor_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
require 'test_helper'

class GrayScottIngestorTest < ActiveSupport::TestCase
setup do
@ingestor = Ingestors::Heptraining::GrayScottIngestor.new
@user = users(:regular_user)
@content_provider = content_providers(:another_portal_provider)

webmock('https://cta-lapp.pages.in2p3.fr/COURS/GRAY_SCOTT_REVOLUTIONS/GrayScott2026/invitation/gray_scott_2026_webinars.ics', 'heptraining/grayscott/grayscott-event.ics')
webmock('https://cta-lapp.pages.in2p3.fr/cours/gray_scott_revolutions/grayscott2026/redirect.html?label=sec_gray_scott_webinar_memory_allocation_memory_profiling', 'heptraining/grayscott/grayscott-redirect.html')
webmock('https://cta-lapp.pages.in2p3.fr/cours/gray_scott_revolutions/grayscott2026/1-1-5-1-449.html', 'heptraining/grayscott/grayscott-page.html')
end

teardown do
reset_timezone
end

test 'should read Gray Scott ics' do
@ingestor.read('https://cta-lapp.pages.in2p3.fr/COURS/GRAY_SCOTT_REVOLUTIONS/GrayScott2026/invitation/gray_scott_2026_webinars.ics')
@ingestor.write(@user, @content_provider)

sample = @ingestor.events.detect { |e| e.title == 'Memory allocation, why and how to profile applications' }
assert sample.persisted?

assert_equal sample.url, 'https://cta-lapp.pages.in2p3.fr/cours/gray_scott_revolutions/grayscott2026/redirect.html?label=sec_gray_scott_webinar_memory_allocation_memory_profiling'
assert_includes sample.description, 'Sometimes memory has become a major problem in applications'
assert_equal sample.end, '2026-02-26 10:30:00 +0000'
assert_equal sample.start, '2026-02-26 09:00:00 +0000'
assert_equal sample.timezone, 'Paris'
assert_includes sample.venue, 'teratec.webex.com'
assert_equal sample.organizer, 'Someone, SomeoneElse'
end

private

def webmock(url, filename)
file = Rails.root.join('test', 'fixtures', 'files', 'ingestion', filename)
WebMock.stub_request(:get, url).to_return(status: 200, headers: {}, body: file.read)
end
end