This repo contains prototyping work for creating an OPTIMADE API for searching and accessing structures from the Cambridge Structural Database (CSD).
The structures are accessed via the CSD Python
API and cast to the
OPTIMADE format; the
optimade-maker and
optimade-python-tools
are then used to launch a local OPTIMADE API.
After cloning this repository and using some appropriate method of creating a virtual environment (current recommendation is uv), this package can be installed with
git clone [email protected]:datalab-industries/csd-optimade
cd csd-optimade
uv sync --extra-index-url https://pip.ccdc.cam.ac.ukor
git clone [email protected]:datalab-industries/csd-optimade
cd csd-optimade
pip install . --extra-index-url https://pip.ccdc.cam.ac.ukNote that the extra index URL is required to install the csd-python-api package.
Important
Any attempts to use CSD data will additionally require a CSD license and appropriate configuration.
The CSD can be ingested into the OPTIMADE format using the csd-ingest entrypoint:
csd-ingestThis will use multiple processes (controlled by --num-processes) to ingest the
local copy of the CSD database in chunks of size --chunk-size until the target
--num-structures has been reached (defaults to the entire CSD).
Each batch will be written to an OPTIMADE JSONLines file,
and combined into a single JSONLines file (~ 5.5 GB for the entire CSD, or 2 GB compressed) on completion, with name
<--run-name>-optimade.jsonl.
Depending on parallelisation, this process should take a few minutes to ingest the entire CSD on consumer hardware (around 10 minutes with 8 processes on an AMD Ryzen 7 PRO 7840U mobile processor, requiring around 3 GB of RAM per process with the default chunk size of 100k).
The csd-serve entrypoint provides a thin wrapper around the
optimade-maker tool,
and bundles the simple configuration required to launch a local OPTIMADE API
with a simple in-memory database (if --mongo-uri is provided, a real MongoDB
backend will be used).
Just provide the path to your combined OPTIMADE JSONLines file:
csd-serve <path-to-optimade-jsonl>You should now be able to try out some queries locally, either in the browser or
with a tool like curl:
curl http://localhost:5000/structures?filter=elements HAS "C"For ease of deployment, as containerised version of the ingestion pipeline is available.
Important
You should verify that your license agreement allows for any kind of deployment outside of your private network; it likely does not.
To build the container from scratch, you need both a time-limited CSD installer
download link (CSD_INSTALLER_URL), and your activation key
(CSD_ACTIVATION_KEY).
Note
As of January 2025, you can request your time-limited CSD installer link at https://www.ccdc.cam.ac.uk/support-and-resources/download-the-csd/. Once you receive the email, the CSD_INSTALLER_URL should be the one listed as "CSD Portfolio Linux Online Installer (recommended, small download)".
These should be stored in a .env file that is available both at build time and runtime.
Note, managing these secrets requires a recent Docker version that includes
Buildx.
Once configured, you can build the container with
docker build --secret id=env,src=.env --target csd-optimade-server -t csd-optimade-server .This will install the CSD inside the container, run the ingestion pipeline and
prepare an encrypted version of the CSD in the OPTIMADE JSONLines format.
The file can be decrypted with your CSD_ACTIVATION_KEY.
To launch the container (which will decrypt the file and start the OPTIMADE API locally):
docker run --env-file .env -p 5000:5000 csd-optimade-serverIf using a persistent database, future runs of the API can be controlled with
the CSD_OPTIMADE_INSERT environment variable. If true, the configured database will be
For development and deployment, you may prefer to use the bake definitions in
docker-bake.hcl to build and tag the relevant build stages:
docker buildx bake csd-optimade-server
docker run --env-file .env -p 5000:5000 ghcr.io/datalab-industries/csd-optimade-serverAs noted above, the CSD_ACTIVATION_KEY used to build the container must be provided at runtime.
The API container can also be configured with all the OPTIMAKE_ prefixed environment variables.
The most important ones are listed here:
-
OPTIMAKE_MONGO_URI: to use a persistent MongoDB backend, you can provide aMONGO_URIvia:OPTIMAKE_DATABASE_BACKEND=mongodb OPTIMAKE_MONGO_URI=mongodb://mongodb_server:27017/optimade
-
OPTIMAKE_BASE_URL: to set the base URL of the API (used to generate pagination links), you can provide aBASE_URLvia:OPTIMAKE_BASE_URL=https://my-csd-deployment.com
Finally, if using a persistent database, future runs of the API can be controlled with the CSD_OPTIMADE_INSERT environment variable.
If true (default), the configured database will be wiped and rebuilt from the JSONL file directly, and a separate process will run the API.
If false, only the API will be started, with no database rebuild.
Note
When used in production with the full CSD database, performance will be
significantly improved by creating the appropriate indexes for queryable
fields in MongoDB. This may be partially handled by optimade-maker and
optimade-python-tools, but you may wish to also tune index performance
for your particular use case.
All development of this package (bug reports, suggestions, feedback and pull requests) occurs in the csd-optimade GitHub repository. Contribution guidelines and tips for getting help can be found in the contributing notes.
This project was developed by datalab industries ltd., on behalf of the UK's Physical Sciences Data Infrastructure (PSDI), supported by the Cambridge Crystallographic Data Centre (CCDC).