-
Notifications
You must be signed in to change notification settings - Fork 40
Description
This is broken into sub-tasks, see implementation order below:
1. Foundation — Database Models
- [DwC export]: Create
exportdatasettable to replace ExportFeed app resource #7745 - [DwC export]: Create
extensionsjoin table #7746 - [DwC export]: Create
schemamappingtable (or extendspquery) #7747 - [DwC export]: Extend
spqueryfieldfor DwC term mapping #7748
2. Schema Terms JSON
3. Core Schema Mapping UI
- [DwC export]: Add "Term" mapping column to Schema Mapping interface #7715
- [DwC export]: Add term info icon with description tooltip/dialog #7716
- [DwC export]: Implement Schema Mapping interface toolbar #7717
- [DwC export]: Implement auto-mapping of Specify fields to DwC terms #7718
- [DwC export]: Allow entering a custom IRI in the Term column #7719
- [DwC export]: Create default schema mapping queries (Occurrence + Extensions) #7711
- [DwC export]: Allow cloning of schema mapping queries #7712
4. Locked Row + Validation
- [DwC export]: Add locked
occurrenceIDrow (Collection Object → GUID) #7722 - [DwC export]: Validate
occurrenceIDuniqueness on Run/Save (Core mappings) #7721 - [DwC export]: Validate that each DwC term appears only once per mapping #7731
5. Entry Points into the Mapping Interface
- [DwC export]: Add vocabulary/term list selection dialog #7713
- [DwC export]: Add "New Mapping" type selection dialog #7710
- [DwC export]: Add Schema Mapper tool to User Tools #7709
6. Export Package
- [DwC export]: Add Export Package creation/edit form #7725
- [DwC export]: Allow cloning an Export Package #7724
- [DwC export]: Add 'Export Packages' tool to User Tools #7723
7. Permissions
8. DwCA Generation
- [DwC export]: Implement Darwin Core Archive (DwCA) file generation #7733
- [DwC export]: Add Download Archive button in Export Package form #7726
9. EML + GBIF Validator
- [DwC export]: Add EML creation/import mechanism for Export Packages #7734
- [DwC export]: Provide link to GBIF data validator in export workflow #7732
10. Attachment URLs
11. Cache Tables
- [DwC export]: Implement export cache table creation mechanism #7737
- [DwC export]: Build flattened cache tables for core and extension mappings #7739
- [DwC export]: Add mechanism to clean/remove orphaned cache tables #7738
- [DwC export]: Remove cache tables when export mapping or extension is deleted #7742
- [DwC export]: Report cache build progress to UI and update "last exported" timestamp #7740
- [DwC export]: Allow authorized users to build export cache via API #7741
12. RSS + Scheduling
- [DwC export]: Add "Copy RSS Feed URL" button to Update RSS Feed notification #7735
- [DwC export]: Add Download action to RSS Feed update notification (V2) #7736
- [DwC export]: Replace "Update RSS Feed" tool in User Tools #7728
- [DwC export]: Implement automatic RSS publishing without an external cron job #7744
- [DwC export]: Support administrator-configured scheduled export pipeline (cron/script) #7743
13. Schema Config Page Addition
14. V2
Requirements from @grantfitzsimmons:
Important
While not described in these requirements, the existing publishing mechanism using the app resources system and export feeds must be retained and backwards compatibility should be preserved for the near future.
Eventually, we need to discuss providing a utility for converting the legacy publishing pipeline to the new one or offering services to members for this conversion. For now, the legacy system and the modern system should remain distinct yet both functional.
These requirements describe the enhancement of the existing Specify 7 data publishing for ease of use and more efficient publishing. This includes Darwin Core publishing to aggregators and (potentially) publishing to web portals, which is currently not supported in Specify 7. Where possible, existing UI mechanisms should be used to ensure continuity for the user and enhance the user's intuitive experience.
These requirements were developed in conjunction with @acbentley and @tlammer.
Goals
- Enhance the current Specify 7 data publishing system.
- Improve user experience and efficiency in publishing data.
- Support DwC publishing to aggregators.
- Potentially enable future publishing to web portals.
- Leverage existing UI mechanisms for consistency and intuitiveness.
Aggregators we must support are below. Other aggregators that accept DwCA formatted data are also compatible:
Workflows
Workflow: From No Mapping to a Complete Mapping
- An Institution Admin opens User Tools and selects Schema Mapper.
- A dialog lists all existing mappings, separated into Core and Extension sections, with New and Close actions.
- When New is selected, the user chooses a Mapping Type: Core or Extension.
- The user chooses one of two paths:
- Use an Existing Mapping
- They can either select a provided default mapping (e.g., DwC Occurrence for Core; Audiovisual Core, Identification History, etc. for Extensions) or a custom mapping that was added by a user in the database previously.
- The mapping opens in the schema mapping interface for review and optional edits.
- Create From Scratch
- Choose a schema vocabulary to start from (e.g., Darwin Core, Audiovisual Core).
- The schema mapping interface opens with terms restricted to the selected schema.
- Use an Existing Mapping
- In the schema mapping interface:
- Users add fields the same way as Query Builder.
- Fields with matching
mappingPathin the schema terms JSON auto-map; others remain unmapped. If a default was selected, the query will be pre-populated with mapped fields. occurrenceIDis pre-added, locked, and always mapped toCollection Object → GUID.- Users may enter a verbatim term or custom IRI to map to a custom concept or future Web Portal caption.
- Before saving, the user can click "Preview" (equivalent to "Query" button) and see the results. There is a new button named "Save Mapping" next to this in the bottom right, that both runs the mapping query and saves the mapping changes. The "Save Query" and "Save As" buttons do not exist anymore at the top.
- Core mappings must validate that
occurrenceIDis unique in the results after running the query. - Extension mappings may reuse
occurrenceIDbut must include it.
- Core mappings must validate that
Tip
If the user is unable to save due to duplicate occurrenceIDs, we need to provide a user-friendly message to point them in the right direction. From Specify 6, for example: "The mapping contains duplicate occurrence IDs and cannot be exported. Please modify the mapping query to ensure it returns unique records. (Hint: You may need to add a condition for current determination or use an aggregator for preparations or other one-to-many relationships.)"
-
The user saves the mapping.
-
The user returns to User Tools → Export Packages, where they can create or edit an export package. If they are creating one for the first time, they need to fill in each of the following fields:
Export Package Form Fields
- Export Name (text, unique)
- Core Mapping (query combo box, core mappings only)
- File Name (text, must end with
.zip, unique) - Metadata (query combo box, EML app resource)
- Extensions (subview, one-to-many, extension mappings only)
- Collection (query combo box, any collection)
- Include in RSS feed? (checkbox)
- Frequency (integer, number of days between RSS updates)
In this dialog, there should also be a button to download the archive immediately.
Workflow: Exporting a single data set
An admin user can go to User Tools → Export Packages, select any export package (from any collection), and can trigger a download of the data set from the dialog.
Workflow: Exporting all collection data sets
This process will happen automatically without input from the user if any data set is set to be included in the RSS feed and has a frequency set.
- The admin user can open User Tools and select Update RSS Feed
- A dialog will appear confirming the user's intention (Update all RSS export feed items now?), and they can proceed.
- This will update the export package for all
exportdatasetrecords that have the 'Include in RSS feed?' checkbox checked.
Non-Functional Requirements
- NFR-01: Where possible, existing UI mechanisms should be used to ensure continuity for the user.
- NFR-02: The new system should enhance the user's intuitive experience.
- NFR-03: The user should never edit code to set up, map, or export their data.
- NFR-04: The user should never have to copy information from another site to map or export their data.
- NFR-05: More Darwin Core concepts and extensions may be added in the future, and users need to be able to map to those concepts.
- Users will need to add these to their existing exports, and updates should not be required on our side to use new terms.
- NFR-06: Specify’s export should require little effort on the user’s part.
Functional Requirements
Schema Configuration
- FR-01: In the Schema Config interface, below the Specify field description, a new section titled 'Darwin Core' will appear. When you click on this item, a section will expand below, displaying the DwC term, description, IRI, and vocabulary. If multiple terms are mapped to that field in the JSON file, a list of names and descriptions will be shown to the user.
- FR-02: [DwC export]: Add missing DwC fields to Specify #7602
Query Builder
- FR-04: Add support for a new "schema mapping" interface built atop the query builder, including a new column for mapping terms.
- FR-05: DwC default queries must be able to be copied so users can create their own modified version.
- FR-08: Implement auto-mapping for fields matching common DwC concepts.
- FR-09: Allow manual mapping of additional fields for concepts not automatically mapped and for fields used to limit results or ensure uniqueness. This should be an IRI (URL). Verbatim values will be needed for the Web Portal in v2.
- FR-10: Support mapping of fields and aggregated table formats.
- FR-11: Regular query filters should work with the mapping.
- FR-12: All exports should use the
YYYY-MM-DDInternational (ISO) Standard format, regardless of the date format configured for the database. - FR-13: New schema mappings must start from the Collection Object table. There is no option to select a different base table when creating an export mapping.
Darwin Core Mapping
This should be done in a schema mapping interface (looks like the query builder) with an additional editable pick list where you can choose from a list of schema concepts.
- FR-14: Users must be able to modify the schema mapping they are using for mapping after having started the Darwin Core Mapping process.
- FR-15: Add the ability to select Darwin Core concepts in the UI to match specific query fields to concepts, both for the occurrence file and extension files.
- Users should not have to search another site to link a Specify field to a concept.
- The user should be able to click on an icon (perhaps
) which appears next to a term. This should show them a description of the term with a link to the quick reference guide if applicable.

- FR-15a:
occurrenceIDis always added to the mapping, cannot be removed, and cannot be duplicated. It is always mapped toCollection Object → GUIDfor both core and extension mappings. - FR-15b: The term pick list must be restricted to the selected schema vocabulary. Users may enter a custom IRI to map to a concept not present in the list.
- FR-16: Must be able to add static text that will map to DwC concepts without requiring a field mapping. This text value is stored on the query field definition (possibly in V2, need clarification)
- FR-17: Attachment URLs should be automatically constructed from the configured web asset server URL and collection if attachments (e.g. aggregated
CollectionObjectAttachments) are included in an export without additional configuration. - FR-18: Automatically map fields to DwC concepts based on the schema terms JSON identifiers (e.g. term name or IRI).
- FR-19: Support multiple table formats and aggregations #6435
- FR-20: Once a term has been mapped, it must not be mapped again. A term can be mapped only once. This will be enforced as part of the validation when saved. If a term has been mapped several times, a warning message will be displayed.
- FR-22: Uniqueness validation is context-dependent:
- For Core mappings (e.g. CollectionObject),
occurrenceIDmust be unique. - For Extension mappings (e.g. Determination), the unique key is the base table ID (e.g.
DeterminationID), but theoccurrenceIDfield must be present to link back to the Core. Multiple extension rows may share the sameoccurrenceID.
- For Core mappings (e.g. CollectionObject),
I have already created a mapping of DwC concepts to Specify fields here: DWC Terms to Specify
We need to add all of the current accepted Darwin Core terms into the static schema terms JSON file with the mapping described in this spreadsheet.
Validation
- FR-23: Provide the ability to validate results before exporting.
- FR-24: Include validation for duplicate records in the Core and Extension files.
Validation Steps:
- Verify that each
occurrenceIDonly appears once (for extensions, verify that the base table record IDs only appear once). - Verify that the export mapping and EML is valid.
- Provide a link to the GBIF data validator so the user can verify it externally.
Data Output
- FR-25: File output must be a Darwin Core Archive (DwCA).
- FR-26: If all steps are followed correctly, the export produced must match current standards and be validated without errors by the GBIF data validator.
DwCA Ecological Metadata
- FR-27: There should be a straightforward mechanism for creating or adding Ecological Metadata Language (EML) associated with a published data set. We recommend using the EML generator built and maintained by GBIF Norway. When creating a new
exportdatasetrecord, users should easily select and import an EML file to automatically create the app resource, minimizing any friction from the form.- As in FR-26, the EML created must match current standards and be validated without errors by the GBIF data validator.
RSS Publishing
- FR-28: Automatic RSS publishing needs to work without an external cron job ([DwC export]: Make RSS Feed DwCA automatic export process internal #1166)
- FR-28a: When a user triggers “Update RSS Feed” from User Tools, the resulting notification must include a button to copy the RSS feed endpoint link in addition to the download action.
Permissions
- FR-29: Institution Administrators are the only users who can use the data publishing tools.
User Tool
- FR-30: Create a User Tool item where access to all the files for Data Publishing are located.
- FR-30a: Add a Schema Mapper tool that opens a dialog listing all existing
schemamappingrecords, segmented into Core and Extension sections, with New and Close actions. - FR-30b: Add an Export Packages tool that opens a dialog listing all
exportdatasetrecords, allows creating new records, and allows editing existing records. - FR-30c: The Export Package form must support triggering a data set download directly and provide a button to copy the data set URL.
- FR-30d: export packages must be clone-able so a mapping can be reused across collections while still requiring unique
ExportNameandFileNameper collection.
Darwin Core Updates & Versioning
- FR-31: The schema terms JSON file serves as the single source of truth for the Darwin Core version currently supported by the installation along with any accompanying extensions.
- FR-32: System-provided terms in the schema terms JSON file are read-only and cannot be edited by users.
- FR-33: Specify software updates will handle standard changes (e.g., new terms, deprecated terms) by updating the schema terms JSON file.
- FR-35: Existing mappings must remain stable during updates; mappings store the selected term values, so changes to a term's metadata or the addition of new terms must not break existing export configurations.
Data Caching & Performance
Important
The approach used here can be modified slightly based on implementation preference and best practices. This model follows the approach used in Specify 6, but this can be altered if necessary in how it is populated (by streaming, bulk-loading, etc.), but fundamentally a table (cache) needs to be created when requested by the user.
This functionality is necessary so that institutions can perform post-processing on the data as part of a standard data export pipeline.
*It is not explicitly necessary to target only changed rows if the system is performant enough to rebuild from scratch in a short amount of time.
Tip
We could have the worker handle the job of creating and dropping tables if needed. Some concern for the atomicity of this in implementation as CREATE TABLE implicitly commits. We don't want to leave orphaned tables in the database without some cleaning mechanism.
- FR-36: When an export is run, the current export package needs to be built into a flattened cache table in the same database with the mapping’s name (sanitized for MariaDB) (e.g.
{MappingName}, likedwc_fish).- Any associated extensions should be included in the accompanying flattened cache tables (e.g.,
{MappingName}_ext1_{ExtensionName}, likedwc_fish_ext1_audiovisual), separated by a meaningful extension identifier such asext1,ext2, and so on.
- Any associated extensions should be included in the accompanying flattened cache tables (e.g.,
- FR-37: The process must create or replace the cache table (adding a
<table>Idprimary key), choosing column types from the mapping’s fields. Bulk-load mechanisms should be used for speed. - FR-38: Data returned from the source query must be inserted into the cache table. Cache rows whose source records no longer exist must be deleted.
- FR-39: Progress should be reported to the UI during the cache building process. When finished, the system must update the mapping’s “last exported” timestamp so future runs only pick up changes.
- FR-40: If an export mapping or extension is deleted, the cache tables should be removed as well.
API Integration
- FR-41: If they have the appropriate permission, users must be able to build the export cache using the API directly (without the use of the UI).
- FR-42: An administrator needs to be able to setup a cronjob (or script) that serves as a pipeline where they choose a schema mapping which will be used as the basis for an export. They should be able to set the build frequency (cron or scheduled task is fine).
Additional Deliverables
This work requires the implementation of technical components before beginning. These components will be packaged with the release as deliverables accessible directly to the user. Default mappings must be easily selected and used without requiring the user to build a query first.
These may be reviewed by the SCC member community and/or the board.
- Develop at least one default mapping from Collection Object to Darwin Core Occurrence, based on aforementioned mapping
- Develop default extension mappings for the following extensions:
- Identification History
- Audiovisual Core
- GGBN Material Sample
- EOL References
- Resource Relationship
Proposed Model
Below is a detailed outline of the model within Specify. This model distinguishes between standard queries and "Schema Mappings" to prevent user confusion and ensures terms are mapped at the field level.
Export Package Table exportdataset
An export package groups together the critical components for publishing your data, used to create a data set on platforms like GBIF.
This is a replacement for the current ExportFeed app resource.
| Field | Type | Description | Example |
|---|---|---|---|
| ExportName | Text | The name of the export. | KUBI Ichthyology Voucher |
| FileName | Text | The name of the export file once packaged, always ending with .zip. |
kui-dwca.zip |
| RSS | Checkbox | Indicates if this should be made available via the RSS feed when updated | Yes |
| Frequency | Integer | If published, this represents the number of days between automatically updating the RSS feed | |
| Metadata | Link to spappresource |
A link to the app resource containing the Ecological Metadata Language (EML) created for the data set being published. | EML data sourced from GBIF or created using the GBIF EML generator |
| CoreMapping | Link to schemamapping |
Links to the primary Core schema mapping for the export (e.g. Occurrence). | Voucher (schema mapping name) |
| Extensions | One-to-many to extension |
A one-to-many relationship where many schema mappings can be linked to a single export mapping. | GBIF Identification, CO Audubon Core (schema mapping names) |
Extensions extensions
A join table that bridges exportdataset with schemamapping to capture the one-to-many nature of extensions.
| Field | Type | Description | Example |
|---|---|---|---|
| Mapping | Link to schemamapping |
Links to the extension’s schema mapping. The system does not require the same number of rows as the Core, but the extension query must include the occurrenceID (inherited from the Core query) to facilitate the join. |
CO Audubon Core |
| ExportDataSet | Link to exportdataset |
Links to the export package the extension is connected to. |
Schema Mapping Table schemamapping
A schema mapping is a strict wrapper around the standard query system (spquery). It segregates "Mapping Queries" from standard "User Queries" in the UI.
This is the replacement for the
spexportschemamappingsystem in Specify 6.
The distinction between the spquery and the schemamapping record should be invisible for the user for all intents and purposes. When the user creates a schemamapping, it should ask them if this is a "Core" or "Extension" mapping, and they can provide a description. On the user side of things, the title of the query can be used to identify the mapping.
Implementation is up to the development team as to whether this table is needed or if extensions to the spquery table is sufficient.
This table defines whether the underlying query is a Core (e.g., Occurrence) or an Extension (e.g., Audubon Core).
| Field | Type | Description | Example |
|---|---|---|---|
| Query | Link to spquery |
One-to-One link to spquery. The underlying query engine handles the logic. (Required) |
|
| MappingType | Enum | Defines the role: Core (Occurrence) or Extension. (Required) |
Core |
| Name | Text | User-facing description of what this mapping achieves. | Maps Collection Object to DwC Occurrence |
Query Field Extensions spqueryfield modification
The existing spqueryfield table is extended to support mapping specific columns to terms and supporting static values. This allows the term to be associated with the specific output column, rather than the query as a whole.
| Field | Description | Example |
|---|---|---|
| Term | Optional term value (IRI or term name) selected from the schema terms JSON or entered as a custom value. If set, this column is exported with the Term Name as the header. | http://rs.tdwg.org/dwc/terms/catalogNumber |
These may be in v2:
| Field | Description | Example |
|---|---|---|
| IsStatic | Boolean. If true, the StringId is ignored. |
True |
| StaticValue | The actual static text to export if IsStatic is true. |
"PreservedSpecimen" |
Terms
The schema terms JSON file acts as the controlled vocabulary for Darwin Core and extension terms. This file represents the version currently supported by Specify.
- System Terms: Read-only terms provided by Specify updates.
- Custom Terms: Users can add new terms at the mapping level without editing the JSON file.
Example:
{
"dwc": {
"desc": "Darwin Core",
"abbreviation": "dwc",
"vocabularyURI": "http://rs.tdwg.org/dwc/terms/",
"lastUpdated": "2023-09-18",
"terms": [
{"http://rs.tdwg.org/dwc/terms/eventDate": {"name": "Event Date", "mappingPath": "table:collectionObject → field:startDateOrEndDate", "description": "The date-time or interval during which a dwc:Event occurred. For occurrences, this is the date-time when the dwc:Event was recorded. Not suitable for a time in a geological context.", "termName": "eventDate"}},
{"http://rs.tdwg.org/dwc/terms/basisOfRecord": {"name": "Basis Of Record", "mappingPath": "table:collectionObject → field:basisOfRecord", "description": "The specific nature of the data record.", "termName": "basisOfRecord"}},
{"http://rs.tdwg.org/dwc/terms/scientificName": {"name": "Scientific Name", "mappingPath": "table:collectionObject → field:fullName", "description": "The full scientific name, with authorship and date information if known. When forming part of a dwc:Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the dwc:identificationQualifier term.", "termName": "scientificName"}},
{"http://rs.tdwg.org/dwc/terms/occurrenceID": {"name": "Occurrence ID", "mappingPath": "table:collectionObject → field:guid", "description": "An identifier for the dwc:Occurrence (as opposed to a particular digital record of the dwc:Occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:occurrenceID globally unique.", "termName": "occurrenceId"}}
]
}
}See my working example, including dwc (Darwin Core) and ac (Audiovisual Core):
specify_dwc.json
Original Issue
A user interface for mapping a Specify query to Darwin Core terms. This could be used multiple times throughout the interface, used to calculated MIDS levels for each record (#4604), share data easily to GBIF and other data aggregators, and much more.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status