Integration of ORCiD data into names vocabulary #158

cc-a · 2025-02-26T17:38:15Z

This issue covers some exploration and experimentation with regards to how could use data from ORCiD with the names voculabulary.

The overall goal is to provide a good UX when adding creatibutor data to records that makes it easy to add researchers whilst providing comprehensive metadata. To this end we'd like the names vocabulary to be populated with data such that it:

covers relevant Imperial staff and students (fed directly from the Imperial directory).
covers the research community more broadly (likely fed from orcid data).
includes orcids and affiliation metadata for as many entries as possible.

InvenioRDM provides some basic support for importing data from the annual data dump that ORCiD provides (see names vocab docs).

The challenge in integrating both our internal data feed with the data from ORCiD however is that the two overlap as many researchers at Imperial of course already have an ORCiD.

Possible approaches

Some thoughts (including some assumptions that could be checked).

Naively Combining Both Data Sources

Will likely lead to duplicate entries for any Imperial researcher that has an ORCiD. Depending on the visibility of data in their ORCiD profile, the vocabulary entry from ORCiD may be missing affiliation data whilst the Imperial entry will be missing the ORCiD. So whichever of the 2 entries picked some metadata will be missing.

Combining Both Data Sources with Duplicates Removed

Essentially the same as above except we try to remove duplicate entries between the two. As far as I'm aware the only way to unambiguously cross link entries would be based on email address. The challenge here is that email addresses associated with ORCiDs may not be up to date and may not have been made publicly visible. In general I think a minority of ORCiDs have a publicly visible email associated so I would expect the number of entries we could cross link would be relatively small and the duplicates would remain. We could improve our ability to cross-link records if we allow/encourage/require users to link an their ORCiD accounts.

Use the Imperial Data Feed Enriched with Data from ORCiD

We'd only add name entries based on our internal data feed but whenever we add one we try to look for an associated ORCiD and include it. This is probably subject to the same set of caveats as cross linking entries above in that it will probably only be possible for a minority of entries. This approach should prevent duplicate entries but will obviously limit the ease with which researchers outside Imperial can be added to deposits and will likely lead to lower quality metadata for those.

Things to Try

Use the built-in functionality to load the full orcid data dump - does affiliation data get populated? Sample some Imperial researchers and see if Imperial is present as an affiliation.
Load up both the full orcid dump and imperial one at the same time. Do we get duplicates?
Does cross-linking entries by email address work ok? Possible using the orcid data dump and/or the orcid API? Can use my ORCiD as an example of one with a public imperial email (made public recently so may not appear in the data dump).
ORCiD has a concept of "verified email domains" can we use this in someway if the actual email address for an ORCiD isn't available.

richard-jones · 2025-03-04T14:26:18Z

@npapantonis @cc-a - to take this back to the working group for a view on importance of ORCID vs Imperial record

github-project-automation bot added this to Fair Data Repository Feb 26, 2025

richard-jones assigned cc-a, npapantonis and alexdewar Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of ORCiD data into names vocabulary #158

Integration of ORCiD data into names vocabulary #158

cc-a commented Feb 26, 2025 •

edited

Loading

richard-jones commented Mar 4, 2025

Integration of ORCiD data into names vocabulary #158

Integration of ORCiD data into names vocabulary #158

Comments

cc-a commented Feb 26, 2025 • edited Loading

Possible approaches

Naively Combining Both Data Sources

Combining Both Data Sources with Duplicates Removed

Use the Imperial Data Feed Enriched with Data from ORCiD

Things to Try

richard-jones commented Mar 4, 2025

cc-a commented Feb 26, 2025 •

edited

Loading