Skip to content

Commit 3a5f7c2

Browse files
authored
copy in information that came up in the failure screnarios (#1262)
2 parents b905b1b + c8ef97f commit 3a5f7c2

File tree

3 files changed

+41
-9
lines changed

3 files changed

+41
-9
lines changed

docs/configuration/environment-variables.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ A list of strings representing the host/domain names that this Django site can s
6161

6262
!!! warning "Deployment configuration"
6363

64-
You may change this setting when deploying the app to a non-localhost domain
64+
Do not enable this in production
6565

6666
!!! tldr "Django docs"
6767

docs/deployment/infrastructure.md

+2
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,8 @@ The following things in Azure are managed by the California Department of Techno
7878
- [Resource Groups](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal)
7979
- Networking
8080
- Front Door
81+
- Web Application Firewall (WAF)
82+
- Distributed denial-of-service (DDoS) protection
8183
- IAM
8284
- Service connections
8385

docs/deployment/troubleshooting.md

+38-8
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
# Troubleshooting
22

3-
## Monitoring
3+
## Tools
4+
5+
### Monitoring
46

57
We have [ping tests](https://docs.microsoft.com/en-us/azure/azure-monitor/app/monitor-web-app-availability) set up to notify about availability of each [environment](../infrastructure/#environments). Alerts go to [#benefits-notify](https://cal-itp.slack.com/archives/C022HHSEE3F).
68

7-
## Logs
9+
### Logs
810

911
Logs can be found a couple of places:
1012

11-
### Azure App Service Logs
13+
#### Azure App Service Logs
1214

1315
[Open the `Logs` for the environment you are interested in.](https://docs.google.com/document/d/11EPDIROBvg7cRtU2V42c6VBxcW_o8HhcyORALNtL_XY/edit#heading=h.6pxjhslhxwvj) The following tables are likely of interest:
1416

@@ -18,7 +20,7 @@ Logs can be found a couple of places:
1820

1921
For some pre-defined queries, click `Queries`, then `Group by: Query type`, and look under `Query pack queries`.
2022

21-
### [Azure Monitor Logs](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/data-platform-logs)
23+
#### [Azure Monitor Logs](https://docs.microsoft.com/en-us/azure/azure-monitor/logs/data-platform-logs)
2224

2325
[Open the `Logs` for the environment you are interested in.](https://docs.google.com/document/d/11EPDIROBvg7cRtU2V42c6VBxcW_o8HhcyORALNtL_XY/edit#heading=h.n0oq4r1jo7zs)
2426

@@ -31,19 +33,23 @@ In the latter two, you should see recent log output. Note [there is some latency
3133

3234
See [`Failures`](https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-exceptions#diagnose-failures-using-the-azure-portal) in the sidebar (or `exceptions` under `Logs`) for application errors/exceptions.
3335

34-
### Live tail
36+
#### Live tail
3537

3638
After [setting up the Azure CLI](#making-changes), you can use the following command to [stream live logs](https://docs.microsoft.com/en-us/azure/app-service/troubleshoot-diagnostic-logs#in-local-terminal):
3739

3840
```sh
3941
az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck
4042
```
4143

42-
### SCM
44+
#### SCM
4345

4446
<https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker>
4547

46-
## Terraform lock
48+
## Specific issues
49+
50+
This section serves as the [runbook](https://www.pagerduty.com/resources/learn/what-is-a-runbook/) for Benefits.
51+
52+
### Terraform lock
4753

4854
[General info](https://developer.hashicorp.com/terraform/language/state/locking)
4955

@@ -54,7 +60,29 @@ If Terraform commands fail (locally or in the Pipeline) due to an `Error acquiri
5460
1. **Do any engineers have a Terrafrom command running locally?** You'll need to ask them. For example: They may have started an `apply` and it's sitting waiting for them to [approve](https://developer.hashicorp.com/terraform/cli/commands/apply#automatic-plan-mode) it. They will need to (gracefully) exit for the lock to be released.
5561
1. **If none of the steps above identified the source of the lock**, and especially if the `Created` time is more than ten minutes ago, that probably means the last Terraform command didn't release the lock. You'll need to grab the `ID` from the `Lock Info` output and [force unlock](https://developer.hashicorp.com/terraform/language/state/locking#force-unlock).
5662

57-
## Eligibility Server
63+
### App fails to start
64+
65+
If the container fails to start, you should see a [downtime alert](#monitoring). Assuming this app version was working in another [environment](../infrastructure/#environments), the issue is likely due to misconfiguration. Some things you can do:
66+
67+
- Check the [logs](#logs)
68+
- Ensure the [environment variables](../../configuration/environment-variables/) and [configuration data](../../configuration/data/) are set properly.
69+
- [Turn on debugging](../../configuration/environment-variables/#django_debug)
70+
- Force-push/revert the [environment](../infrastructure/#environments) branch back to the old version to roll back
71+
72+
### Littlepay API issue
73+
74+
Littlepay API issues may show up as:
75+
76+
- The [monitor](https://github.com/cal-itp/benefits/actions/workflows/check-api.yml) failing
77+
- The `Connect your card` button doesn't work
78+
79+
A common problem that causes Littlepay API failures is that the certificate expired. To resolve:
80+
81+
1. Reach out to <[email protected]>
82+
1. Receive a new certificate
83+
1. Put that certificate into the [configuration data](../../configuration/data/) and/or the [GitHub Actions secrets](https://github.com/cal-itp/benefits/settings/secrets/actions)
84+
85+
### Eligibility Server
5886

5987
If the Benefits application gets a 403 error when trying to make API calls to the [Eligibility Server](https://docs.calitp.org/eligibility-server/), it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.
6088

@@ -64,3 +92,5 @@ If the Benefits application gets a 403 error when trying to make API calls to th
6492
1. Click `Edit`
6593
1. Click `Variables`
6694
1. Update the relevant variable with the new list of CIDRs
95+
96+
Note there is nightly downtime as the Eligibility Server restarts and loads new data.

0 commit comments

Comments
 (0)