- Service AKA API
- Configuration
- APPINSIGHTS_CRAWLER
- APPINSIGHTS_INSTRUMENTATIONKEY
- APPINSIGHTS_SERVICE
- ATTACHMENT_STORE_PROVIDER
- AUTH_CURATION_TEAM
- AUTH_GITHUB_CLIENT
- AUTH_HARVEST_TEAM
- CACHING_PROVIDER
- CACHING_REDIS_SERVICE
- CRAWLER_API_AUTH_TOKEN**
- CRAWLER_API_URL
- CURATION_GITHUB_REPO
- CURATION_GITHUB_OWNER
- CURATION_GITHUB_BRANCH
- CURATION_GITHUB_TOKEN
- CURATION_STORE_PROVIDER
- CURATION_MONGO_COLLECTION_NAME
- CURATION_MONGO_CONNECTION_STRING
- CURATION_PROVIDER
- CURATION_QUEUE_PROVIDER
- DEFINITION_STORE_PROVIDER
- DEFINITION_MONGO_TRIMMED_COLLECTION_NAME
- DEFINITION_MONGO_COLLECTION_NAME
- DEFINITION_MONGO_CONNECTION_STRING
- DOCKER
- DOCKER_ENABLE_CI
- HARVEST_AZBLOB_CONTAINER_NAME
- HARVEST_AZBLOB_CONNECTION_STRING
- HARVEST_STORE_PROVIDER
- HARVEST_QUEUE_PROVIDER
- HARVEST_QUEUE_PREFIX**
- HARVESTER_PROVIDER
- MULTIVERSION_CURATION_FF
- NODE_ENV
- RATE_LIMIT_MAX
- RATE_LIMIT_WINDOW
- SEARCH_PROVIDER
- SEARCH_AZURE_SERVICE
- SEARCH_AZURE_API_KEY
- SERVICE_ENDPOINT
- TEMPDIR
- WEBHOOK_CRAWLER_SECRET
- WEBHOOK_GITHUB_SECRET
- WEBSITE_ENDPOINT
- WEBSITE_HTTPLOGGING_RETENTION_DAYS
- WEBSITES_ENABLE_APP_SERVICE_STORAGE
- Configuration
The environmental variables for the clearlydefined-api-dev App Service include:
- APPINSIGHTS_CRAWLER_APIKEY
- APPINSIGHTS_CRAWLER_APPLICATIONID
- APPINSIGHTS_INSTRUMENTATIONKEY
- APPINSIGHTS_SERVICE_APIKEY
- APPINSIGHTS_SERVICE_APPLICATIONID
- ATTACHMENT_STORE_PROVIDER
- AUTH_CURATION_TEAM
- AUTH_GITHUB_CLIENT_ID
- AUTH_GITHUB_CLIENT_SECRET
- AUTH_HARVEST_TEAM
- CACHING_PROVIDER
- CACHING_REDIS_API_KEY
- CASHING_REDIS_SERVICE
- CRAWLER_API_AUTH_TOKEN
- CRAWLER_API_URL
- CURATION_GITHUB_BRANCH
- CURATION_GITHUB_REPO
- CURATION_MONGO_COLLECTION_NAME
- CURATION_MONGO_CONNECTION_STRING
- CURATION_PROVIDER
- CURATION_QUEUE_PROVIDER
- CURATION_STORE_PROVIDER
- DEFINITION_MONGO_COLLECTION_NAME
- DEFINITION_MONGO_TRIMMED_COLLECTION_NAME
- DEFINITION_MONGO_CONNECTION_STRING
- DEFINITION_STORE_PROVIDER
- DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE
- DEFINITION_UPGRADE_PROVIDER
- DEFINITION_UPGRADE_QUEUE_PROVIDER
- DEFINITION_UPGRADE_QUEUE_CONNECTION_STRING
- DEFINITION_UPGRADE_QUEUE_NAME
- DOCKER_CUSTOM_IMAGE_NAME
- DOCKER_ENABLE_CI
- DOCKER_REGISTRY_SERVER_PASSWORD
- DOCKER_REGISTRY_SERVER_URL
- DOCKER_REGISTRY_SERVER_USNERMAE
- HARVEST_AZBLOB_CONNECTION_STRING
- HARVEST_AZBLOB_CONTAINER_NAME
- HARVEST_QUEUE_PREFIX
- HARVEST_QUEUE_PROVIDER
- HARVESTER_PROVIDER
- NODE_ENV
- RATE_LIMIT_MAX
- RATE_LIMIT_WINDOW
- SEARCH_AZURE_API_KEY
- SEARCH_AZURE_SERVICE
- SEARCH_PROVIDER
- SERVICE_ENDPOINT
- WEBHOOK_CRAWLER_SECRET
- WEBHOOK_GITHUB_SECRET
- WEBSITE_ENDPOINT
- WEBSITE_HTTPLOGGING_RETENTION_DAYS
- WEBSITES_ENABLE_APP_SERVICE_STORAGE
That is a lot! Let's break it down.
These are used to get information from the Crawler's App Insights setup when the Service's /status API call is used (it's used by the website to display a status page). The App Insight's instance is called cdcrawler-dev
This is used by a dependency called Winston. Winston is a Node JS logging library. We use an additional dependency, winston-azure-application-insights to broadcast the logs to Azure Application Insights. This requires an instrumentation key for our Azure Application Insights set up.
These environmental variables are used to by the Service to query the apis Azure Application Insights instance. This is queried when the Service's /status API call is used.
The value for this variable is "azure". This indicates that attachment data is stored in Azure, in this case in the "develop" blob container in the environments Azure Storage account.
Harvested data is the data output from our the various scanning tools (scancode, licensee, ClearlyDefined). Attachments are the "interesting files" we find and want to archive (either for durability or for quick access for example in producing a notices file).
This is a team in the ClearlyDefined GitHub Organization. This is the team that has permission to merge curations in the curations GitHub repo (which is defined in the CURATION_GITHUB_REPO environment variable) and sync them with the ClearlyDefined service.
Dev envioment team: https://github.com/orgs/clearlydefined/teams/curation-dev Production envionment team: https://github.com/orgs/clearlydefined/teams/curation-prod
ClearlyDefined uses a GitHub OAuth App to authenticate users to the Service.
These define the client id and client secret for the OAuth App.
Although this does correlate to a team in the ClearlyDefined GitHub organization, it is not clear what it is used for in the Service.
Dev envronment team: https://github.com/orgs/clearlydefined/teams/harvest-dev prod environment team: https://github.com/orgs/clearlydefined/teams/harvest-prod
The Service caches definitions in a Redis cache. The cached definitions are replaced whenever a definition is updated.
This cache is an Azure Cache for Redis
This is the key for the Azure Cache for Redis.
The URL for the Azure Cache for Redis.
This is a token used to authenticate the Service to send requests to the ClearlyDefined Crawler. It is the same as the CRAWLER_SERVICE_AUTH_TOKEN in the App Service configuration.
This is the URL for the App Service.
This is the GitHub repo we use to store curations,
Dev: https://github.com/clearlydefined/curated-data-dev Production: https://github.com/clearlydefined/curated-data
This is the owner of the GitHub repo used to store curations, in this case the clearlydefined GitHub org.
This is the branch ClearlyDefined pulls curation information from, in this case the branch called master.
When it comes to curations, the ClearlyDefined service makes extensive use of the GitHub API. This is an API token that allows it to do this.
This is what service we use to store information about curations, in this case mongo.
While we store the curations in GitHub, we store information about the Curations in a MongoDB collection, called curations-20190227.
The curations-20190227 lives in the clearlydefined Mongo Database, which lives in the enviroments Azure Cosmos DB account.
Azure Cosmos DB is a managed NoSQL database service in Azure.
This is the string we use to connect to the clearlydefined Mongo Database in the enviroments Azure Cosmos DB account.
This is the provider we use to store curations. In this case, it is github
The Curation Queue is where we queue up curations for ClearlyDefined to process. In this case, we use an Azure Storage Queue called curations, which is kept in the same Azure Storage Account.
We use multiple services to store definition information.
If you look at the value of this environmental variable, you will see that it is "dispatch+azure+mongoTrimmed"
dispatch indicates that we use multiple memory stores - we need to dispatch requests to both of them.
azure indicates that we store definitions in Azure Blob Storage in the enviroments Azure Storage Account.
mongoTrimmed indicates that we store definitions in a Mongo collection as well, in this case in the definitions-trimmed collection in the clearlydefined database in the environments Azure Cosmos DB account.
Mongo store is mainly used for search. Azure blob storage is our primary store for definitions.
This is the Mongo collection which stores definition without file information, in this case the definitions-trimmed collection in the clearlydefined database in the environments Azure Cosmos DB account.
This was the Mongo collection which stores the entire definition information in paged format, in this case the definitions-paged collection in the clearlydefined database in the Azure Cosmos DB account. To store definition information in its entirety, use mongo in DEFINITION_STORE_PROVIDER (e.g. "dispatch+azure+mongo") and use this variable to specify the collection for storage.
This is the string we use to connect to the clearlydefined Mongo DB in the enviroments Azure Cosmos DB account.
This is a string value that specifies how the service handles the definition when its schema version becomes stale.
Valid values: versionCheck
, upgradeQueue
Default: versionCheck
versionCheck
: If this option is selected then the service will check the schema version and recompute the definition on-the-fly if it becomes stale.upgradeQueue
: If this option is selected then service will return the existing definition, and if the schema has changed, the service will queue a recompute operation. The updated definition will be returned in subsequent requests once the recomputation is completed.
This string value determines which queuing implementation will be used to queue upgrades (recomputes).
Valid values: memory
, azure
Default: memory
This is a field for the connection string to the Azure Storage Queue. If no value is provided, the connection information from HARVEST_AZBLOB_CONNECTION_STRING
will be used.
This string value specifies the name of the upgrade (recompute) queue. Default: definitions-upgrade
This string value defines the number of messages that will be dequeued at once from the upgrade (recompute) queue. Default: 16
The Docker environmental variables define what container image is used for the Crawler, as well as what registry that image is kept in, and authentication info for the registry.
This environmental variable is used by the App Service. When this is set to "true", anytime a new new version of the Docker image is pushed to the registry, the app service will automatically re-deploy.
When this setting is enabled, the App Service adds a Container registry webhook to your Azure resource group. In the case of the ClearlyDefined website, this is the webappclearlydefineddev container registry webhook. When a new version of the clearlydefined/service Docker image is pushed to the clearlydefineddev2 Azure Container registry, the webappclearlydefinedapidev webhook will POST to a /docker/hook on the clearlydefined-api-dev App Service, which will trigger a re-deploy of the service.
More information about enabling Docker CI in an Azure App Service
This is the blob container where we store information that we harvest about components. development develop blob in the clearlydefineddev Azure Storage Account. production production blob in the clearlydefinedprod Azure Storage Account.
This is the string we use to connect to the Azure Storage Account.
This indicates where we store our Harvest data, which in this environment is in Azure
development develop blob in the clearlydefineddev Azure Storage Account. production production blob in the clearlydefinedprod Azure Storage Account.
This indicates what we use to queue up components to be harvested, in dev Azure Storage Queue in the Azure Storage Account.
This is the prefix we use for queues that we use for harvesting.
For example, in the dev api we use cdcrawlerdev
as the prefix for the queues we use for harvesting. This means that the queues we use for harvesting in the dev environment are:
- cdcrawlerdev-later
- cdcrawlerdev-normal
- cdcrawlerdev-soon
Important to ensure that any other instances of production crawlers that use the same storage account use a different prefix for their queues.
This indicates what type of service we use for harvesting, in this case it's crawlerQueue, which corresponds with the crawlerQueue harvest provider
This is a feature flag that indicates whether the Multi-version curation feature is active.
This environmental variable is used by the Express framework to indicate what environment the Express application is running in
The ClearlyDefined Service uses the Express Rate Limit npm library to limit repeated requests to its public API.
In this case, we limit requests to the API from one IP to 500 per RATE_LIMIT_WINDOW
This is the time window we apply the RATE_LIMIT_MAX to. This is set to 300 milliseconds (or 0.3 seconds).
When we use this value in the code, we multiply it by 1000, making it 300,000 milliseconds (or 300 seconds).
So, one IP address can only call the ClearlyDefined API 500 times every 300 seconds.
We use Azure Cognitive Search to power ClearlyDefined's Search functionality, in this case this is indicated with the string "azure".
The name of this environment's Azure Cognitive Search service.
This is the API key we use to connect to the Azure Cognitive Search.
This is the URL used to access the ClearlyDefined Service, dev: https://dev-api.clearlydefined.io. prod: https://dev-api.clearlydefined.io.
The DNS for dev-api.clearlydefined.io lives in our Cloudflare account.
This is the location where temporary files are stored in the crawler. In deployment, it is the crawlerdev-file-share in the Azure Storage Account. The mount path is configured in the cdcrawler-dev App Service.
This is what the Crawler uses to authenticate to the ClearlyDefined Service API.
This is the token the webhook routes use to authenticate to the GitHub API.
This is the url for the front end UI of ClearlyDefined, also known as the ClearlyDefined website,
dev: https://dev.clearlydefined.io
production https://clearlydefined.io
This does not appear to be used anywhere in the Service. It may be able to be removed.
This does not appear to be used anywhere in the Service. It may be able to be removed.