-
Notifications
You must be signed in to change notification settings - Fork 56
Refactor DOI Handling — Replace doi-query-tool with Modular Seek::Doi::Parser Framework
#2385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: seek-1.17
Are you sure you want to change the base?
Conversation
- Introduced BaseParser, inherited by CrossrefParser and DataCiteParser - Allows easy extension for future DOI registration agencies (e.g., mEDRA)
…nt). Handles subtitle if present. Cleans the abstract by removing all <jats:*> tags and other XML markers
…ttps://github.com/seek4science/DOI-query-tool.git add/update tests expect related tests fail
extract publication `type`
- Use abbreviated journal title when available - Add support for articles identified by article number (no volume, issue, or page).
- HTML decoding - clean up code - add more tests
(HTTP requests, JSON parsing, decoding, author/editor extraction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request refactors the DOI metadata retrieval system by replacing the legacy XML-based Crossref implementation with modern JSON APIs for both Crossref and DataCite. The implementation introduces a modular parser architecture with comprehensive error handling and extensive test coverage using VCR cassettes.
Key changes:
- Replaces XML parsing with JSON-based APIs for both Crossref and DataCite
- Implements a Registration Agency (RA) detection mechanism to route DOIs to the appropriate parser
- Introduces a structured exception hierarchy for DOI processing errors
- Adds comprehensive VCR test cassettes for various DOI types and error scenarios
Reviewed Changes
Copilot reviewed 39 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lib/seek/doi/parser.rb |
Main entry point implementing RA detection and parser routing |
lib/seek/doi/parsers/base_parser.rb |
Abstract base class with shared parsing logic and utilities |
lib/seek/doi/parsers/crossref_parser.rb |
Crossref-specific parser with citation formatting |
lib/seek/doi/parsers/datacite_parser.rb |
DataCite-specific parser implementation |
lib/seek/doi/base_exception.rb |
Custom exception hierarchy for DOI operations |
lib/seek/doi/author.rb |
Author model object for structured metadata |
test/unit/doi/* |
New test files with VCR cassettes for various DOI types |
test/unit/publication_test.rb |
Updated tests to use new parser API |
test/vcr_cassettes/doi/* |
VCR cassettes capturing real API responses |
test/vcr_cassettes/publications/fairdom_by_doi.yml |
Deleted obsolete XML-based test cassette |
Comments suppressed due to low confidence (1)
lib/seek/doi/parsers/crossref_parser.rb:1
- The regex pattern for stripping JATS tags uses a simple approach that may not handle nested tags or edge cases correctly. Consider using a proper HTML/XML parser like Nokogiri for more robust tag removal, or at minimum document why the simple regex is sufficient for JATS content.
module Seek
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/seek/doi/base_exception.rb
Outdated
| def message | ||
| cause ? "#{super} (#{cause.class.name}: #{cause.message})" : super | ||
| end |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message method could potentially cause infinite recursion if the cause chain is circular or if cause.message itself triggers an error. Consider adding safeguards or limiting the depth of cause message extraction.
app/models/publication.rb
Outdated
| @error = e.message | ||
| rescue Seek::Doi::NotFoundException => e | ||
| @error = e.message | ||
| rescue Seek::Doi::RecordNotSupported => e |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to e is useless, since its value is never read.
| rescue Seek::Doi::RecordNotSupported => e | |
| rescue Seek::Doi::RecordNotSupported |
test/unit/publication_test.rb
Outdated
| assert_nil publication.doi | ||
| assert_equal Publication::REGISTRATION_BY_DOI, publication.registered_mode | ||
| test 'create publication from metadata via Seek::Doi::Parser' do | ||
| doi = '10.1234/fake-doi-for-test' |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to doi is useless, since its value is never read.
| doi = '10.1234/fake-doi-for-test' |
Removed external dependency on the
doi-query-toolgemDOI fetching and parsing are now handled entirely within SEEK.
Introduced modular parser architecture:
Seek::Doi::Parser– central entry point that detects the DOI registration agency (RA).Seek::Doi::Parsers::BaseParser– shared base class with standard request, error, and metadata-handling logic.Seek::Doi::Parsers::CrossrefParser– rewritten Crossref parser (replaces the gem-based version).Seek::Doi::Parsers::DataciteParser– newly added to support DOIs registered under DataCite.Extensible design for future RA support
Adding new DOI sources (e.g. mEDRA, JaLC) will only require inheriting from BaseParser and implementing a single metadata extraction method.
Unified exception handling
More informative, user-facing error messages in the UI.
Updated Publication model
Now fetches and parses DOI metadata via the new parser system.
Displays detailed error feedback for unsupported or malformed DOIs.
Improved test coverage