Skip to content

Latest commit

 

History

History
383 lines (253 loc) · 17.3 KB

service.md

File metadata and controls

383 lines (253 loc) · 17.3 KB

Service AKA API

Configuration

The environmental variables for the clearlydefined-api-dev App Service include:

  • APPINSIGHTS_CRAWLER_APIKEY
  • APPINSIGHTS_CRAWLER_APPLICATIONID
  • APPINSIGHTS_INSTRUMENTATIONKEY
  • APPINSIGHTS_SERVICE_APIKEY
  • APPINSIGHTS_SERVICE_APPLICATIONID
  • ATTACHMENT_STORE_PROVIDER
  • AUTH_CURATION_TEAM
  • AUTH_GITHUB_CLIENT_ID
  • AUTH_GITHUB_CLIENT_SECRET
  • AUTH_HARVEST_TEAM
  • CACHING_PROVIDER
  • CACHING_REDIS_API_KEY
  • CASHING_REDIS_SERVICE
  • CRAWLER_API_AUTH_TOKEN
  • CRAWLER_API_URL
  • CURATION_GITHUB_BRANCH
  • CURATION_GITHUB_REPO
  • CURATION_MONGO_COLLECTION_NAME
  • CURATION_MONGO_CONNECTION_STRING
  • CURATION_PROVIDER
  • CURATION_QUEUE_PROVIDER
  • CURATION_STORE_PROVIDER
  • DEFINITION_MONGO_COLLECTION_NAME
  • DEFINITION_MONGO_TRIMMED_COLLECTION_NAME
  • DEFINITION_MONGO_CONNECTION_STRING
  • DEFINITION_STORE_PROVIDER
  • DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE
  • DEFINITION_UPGRADE_PROVIDER
  • DEFINITION_UPGRADE_QUEUE_PROVIDER
  • DEFINITION_UPGRADE_QUEUE_CONNECTION_STRING
  • DEFINITION_UPGRADE_QUEUE_NAME
  • DOCKER_CUSTOM_IMAGE_NAME
  • DOCKER_ENABLE_CI
  • DOCKER_REGISTRY_SERVER_PASSWORD
  • DOCKER_REGISTRY_SERVER_URL
  • DOCKER_REGISTRY_SERVER_USNERMAE
  • HARVEST_AZBLOB_CONNECTION_STRING
  • HARVEST_AZBLOB_CONTAINER_NAME
  • HARVEST_QUEUE_PREFIX
  • HARVEST_QUEUE_PROVIDER
  • HARVESTER_PROVIDER
  • NODE_ENV
  • RATE_LIMIT_MAX
  • RATE_LIMIT_WINDOW
  • SEARCH_AZURE_API_KEY
  • SEARCH_AZURE_SERVICE
  • SEARCH_PROVIDER
  • SERVICE_ENDPOINT
  • WEBHOOK_CRAWLER_SECRET
  • WEBHOOK_GITHUB_SECRET
  • WEBSITE_ENDPOINT
  • WEBSITE_HTTPLOGGING_RETENTION_DAYS
  • WEBSITES_ENABLE_APP_SERVICE_STORAGE

That is a lot! Let's break it down.

APPINSIGHTS_CRAWLER

These are used to get information from the Crawler's App Insights setup when the Service's /status API call is used (it's used by the website to display a status page). The App Insight's instance is called cdcrawler-dev

APPINSIGHTS_INSTRUMENTATIONKEY

This is used by a dependency called Winston. Winston is a Node JS logging library. We use an additional dependency, winston-azure-application-insights to broadcast the logs to Azure Application Insights. This requires an instrumentation key for our Azure Application Insights set up.

APPINSIGHTS_SERVICE

These environmental variables are used to by the Service to query the apis Azure Application Insights instance. This is queried when the Service's /status API call is used.

ATTACHMENT_STORE_PROVIDER

The value for this variable is "azure". This indicates that attachment data is stored in Azure, in this case in the "develop" blob container in the environments Azure Storage account.

Harvested data is the data output from our the various scanning tools (scancode, licensee, ClearlyDefined). Attachments are the "interesting files" we find and want to archive (either for durability or for quick access for example in producing a notices file).

AUTH_CURATION_TEAM

This is a team in the ClearlyDefined GitHub Organization. This is the team that has permission to merge curations in the curations GitHub repo (which is defined in the CURATION_GITHUB_REPO environment variable) and sync them with the ClearlyDefined service.

Dev envioment team: https://github.com/orgs/clearlydefined/teams/curation-dev Production envionment team: https://github.com/orgs/clearlydefined/teams/curation-prod

AUTH_GITHUB_CLIENT

ClearlyDefined uses a GitHub OAuth App to authenticate users to the Service.

These define the client id and client secret for the OAuth App.

AUTH_HARVEST_TEAM

Although this does correlate to a team in the ClearlyDefined GitHub organization, it is not clear what it is used for in the Service.

Dev envronment team: https://github.com/orgs/clearlydefined/teams/harvest-dev prod environment team: https://github.com/orgs/clearlydefined/teams/harvest-prod

CACHING_PROVIDER

The Service caches definitions in a Redis cache. The cached definitions are replaced whenever a definition is updated.

This cache is an Azure Cache for Redis

This is the key for the Azure Cache for Redis.

CACHING_REDIS_SERVICE

The URL for the Azure Cache for Redis.

CRAWLER_API_AUTH_TOKEN**

This is a token used to authenticate the Service to send requests to the ClearlyDefined Crawler. It is the same as the CRAWLER_SERVICE_AUTH_TOKEN in the App Service configuration.

CRAWLER_API_URL

This is the URL for the App Service.

CURATION_GITHUB_REPO

This is the GitHub repo we use to store curations,

Dev: https://github.com/clearlydefined/curated-data-dev Production: https://github.com/clearlydefined/curated-data

CURATION_GITHUB_OWNER

This is the owner of the GitHub repo used to store curations, in this case the clearlydefined GitHub org.

CURATION_GITHUB_BRANCH

This is the branch ClearlyDefined pulls curation information from, in this case the branch called master.

CURATION_GITHUB_TOKEN

When it comes to curations, the ClearlyDefined service makes extensive use of the GitHub API. This is an API token that allows it to do this.

CURATION_STORE_PROVIDER

This is what service we use to store information about curations, in this case mongo.

CURATION_MONGO_COLLECTION_NAME

While we store the curations in GitHub, we store information about the Curations in a MongoDB collection, called curations-20190227.

The curations-20190227 lives in the clearlydefined Mongo Database, which lives in the enviroments Azure Cosmos DB account.

Azure Cosmos DB is a managed NoSQL database service in Azure.

CURATION_MONGO_CONNECTION_STRING

This is the string we use to connect to the clearlydefined Mongo Database in the enviroments Azure Cosmos DB account.

CURATION_PROVIDER

This is the provider we use to store curations. In this case, it is github

CURATION_QUEUE_PROVIDER

The Curation Queue is where we queue up curations for ClearlyDefined to process. In this case, we use an Azure Storage Queue called curations, which is kept in the same Azure Storage Account.

DEFINITION_STORE_PROVIDER

We use multiple services to store definition information.

If you look at the value of this environmental variable, you will see that it is "dispatch+azure+mongoTrimmed"

dispatch indicates that we use multiple memory stores - we need to dispatch requests to both of them.

azure indicates that we store definitions in Azure Blob Storage in the enviroments Azure Storage Account.

mongoTrimmed indicates that we store definitions in a Mongo collection as well, in this case in the definitions-trimmed collection in the clearlydefined database in the environments Azure Cosmos DB account.

Mongo store is mainly used for search. Azure blob storage is our primary store for definitions.

DEFINITION_MONGO_TRIMMED_COLLECTION_NAME

This is the Mongo collection which stores definition without file information, in this case the definitions-trimmed collection in the clearlydefined database in the environments Azure Cosmos DB account.

DEFINITION_MONGO_COLLECTION_NAME

This was the Mongo collection which stores the entire definition information in paged format, in this case the definitions-paged collection in the clearlydefined database in the Azure Cosmos DB account. To store definition information in its entirety, use mongo in DEFINITION_STORE_PROVIDER (e.g. "dispatch+azure+mongo") and use this variable to specify the collection for storage.

DEFINITION_MONGO_CONNECTION_STRING

This is the string we use to connect to the clearlydefined Mongo DB in the enviroments Azure Cosmos DB account.

DEFINITION_UPGRADE_PROVIDER

This is a string value that specifies how the service handles the definition when its schema version becomes stale.

Valid values: versionCheck, upgradeQueue Default: versionCheck

  • versionCheck: If this option is selected then the service will check the schema version and recompute the definition on-the-fly if it becomes stale.
  • upgradeQueue: If this option is selected then service will return the existing definition, and if the schema has changed, the service will queue a recompute operation. The updated definition will be returned in subsequent requests once the recomputation is completed.

DEFINITION_UPGRADE_QUEUE_PROVIDER

This string value determines which queuing implementation will be used to queue upgrades (recomputes).

Valid values: memory, azure Default: memory

DEFINITION_UPGRADE_QUEUE_CONNECTION_STRING

This is a field for the connection string to the Azure Storage Queue. If no value is provided, the connection information from HARVEST_AZBLOB_CONNECTION_STRING will be used.

DEFINITION_UPGRADE_QUEUE_NAME

This string value specifies the name of the upgrade (recompute) queue. Default: definitions-upgrade

DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE

This string value defines the number of messages that will be dequeued at once from the upgrade (recompute) queue. Default: 16

DOCKER

The Docker environmental variables define what container image is used for the Crawler, as well as what registry that image is kept in, and authentication info for the registry.

DOCKER_ENABLE_CI

This environmental variable is used by the App Service. When this is set to "true", anytime a new new version of the Docker image is pushed to the registry, the app service will automatically re-deploy.

When this setting is enabled, the App Service adds a Container registry webhook to your Azure resource group. In the case of the ClearlyDefined website, this is the webappclearlydefineddev container registry webhook. When a new version of the clearlydefined/service Docker image is pushed to the clearlydefineddev2 Azure Container registry, the webappclearlydefinedapidev webhook will POST to a /docker/hook on the clearlydefined-api-dev App Service, which will trigger a re-deploy of the service.

More information about enabling Docker CI in an Azure App Service

HARVEST_AZBLOB_CONTAINER_NAME

This is the blob container where we store information that we harvest about components. development develop blob in the clearlydefineddev Azure Storage Account. production production blob in the clearlydefinedprod Azure Storage Account.

HARVEST_AZBLOB_CONNECTION_STRING

This is the string we use to connect to the Azure Storage Account.

HARVEST_STORE_PROVIDER

This indicates where we store our Harvest data, which in this environment is in Azure

development develop blob in the clearlydefineddev Azure Storage Account. production production blob in the clearlydefinedprod Azure Storage Account.

HARVEST_QUEUE_PROVIDER

This indicates what we use to queue up components to be harvested, in dev Azure Storage Queue in the Azure Storage Account.

HARVEST_QUEUE_PREFIX**

This is the prefix we use for queues that we use for harvesting.

For example, in the dev api we use cdcrawlerdev as the prefix for the queues we use for harvesting. This means that the queues we use for harvesting in the dev environment are:

  • cdcrawlerdev-later
  • cdcrawlerdev-normal
  • cdcrawlerdev-soon

Important to ensure that any other instances of production crawlers that use the same storage account use a different prefix for their queues.

HARVESTER_PROVIDER

This indicates what type of service we use for harvesting, in this case it's crawlerQueue, which corresponds with the crawlerQueue harvest provider

MULTIVERSION_CURATION_FF

This is a feature flag that indicates whether the Multi-version curation feature is active.

NODE_ENV

This environmental variable is used by the Express framework to indicate what environment the Express application is running in

RATE_LIMIT_MAX

The ClearlyDefined Service uses the Express Rate Limit npm library to limit repeated requests to its public API.

In this case, we limit requests to the API from one IP to 500 per RATE_LIMIT_WINDOW

RATE_LIMIT_WINDOW

This is the time window we apply the RATE_LIMIT_MAX to. This is set to 300 milliseconds (or 0.3 seconds).

When we use this value in the code, we multiply it by 1000, making it 300,000 milliseconds (or 300 seconds).

So, one IP address can only call the ClearlyDefined API 500 times every 300 seconds.

SEARCH_PROVIDER

We use Azure Cognitive Search to power ClearlyDefined's Search functionality, in this case this is indicated with the string "azure".

SEARCH_AZURE_SERVICE

The name of this environment's Azure Cognitive Search service.

SEARCH_AZURE_API_KEY

This is the API key we use to connect to the Azure Cognitive Search.

SERVICE_ENDPOINT

This is the URL used to access the ClearlyDefined Service, dev: https://dev-api.clearlydefined.io. prod: https://dev-api.clearlydefined.io.

The DNS for dev-api.clearlydefined.io lives in our Cloudflare account.

TEMPDIR

This is the location where temporary files are stored in the crawler. In deployment, it is the crawlerdev-file-share in the Azure Storage Account. The mount path is configured in the cdcrawler-dev App Service.

WEBHOOK_CRAWLER_SECRET

This is what the Crawler uses to authenticate to the ClearlyDefined Service API.

WEBHOOK_GITHUB_SECRET

This is the token the webhook routes use to authenticate to the GitHub API.

WEBSITE_ENDPOINT

This is the url for the front end UI of ClearlyDefined, also known as the ClearlyDefined website, dev: https://dev.clearlydefined.io production https://clearlydefined.io

WEBSITE_HTTPLOGGING_RETENTION_DAYS

This does not appear to be used anywhere in the Service. It may be able to be removed.

WEBSITES_ENABLE_APP_SERVICE_STORAGE

This does not appear to be used anywhere in the Service. It may be able to be removed.