-
Notifications
You must be signed in to change notification settings - Fork 810
Add runbook for Sectigo root CA rotation #3944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! I left some comments mostly on the linux script, tbh the Powershell is totally unfamiliar to me and we might want someone else to look at it
|
|
||
| The scripts require outbound HTTPS connectivity to: | ||
|
|
||
| - `raw.githubusercontent.com` (to download the certificate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something we would want to have available on our actual domain like our linux install script in case of firewall restrictions? I'm not sure how we would go about doing that but the build team probably knows
| elif command -v wget &>/dev/null; then | ||
| DOWNLOADER="wget" | ||
| else | ||
| install_curl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple thoughts on this:
- if curl or wget don't exist, how did they download the install script to the host?
- I don't feel good about installing curl or wget if the customer has explicitly removed it (not sure why they would), imo this should be part of the prerequisites to running this so the customer can proceed how they want to
| } | ||
|
|
||
| ensure_target_directory() { | ||
| if ! sudo test -d "$TARGET_DIR"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the command is being run with sudo, I'm not sure we need to prefix these commands as such
| ensure_target_directory() { | ||
| if ! sudo test -d "$TARGET_DIR"; then | ||
| echo "Directory $TARGET_DIR does not exist. Creating it..." | ||
| sudo mkdir -p "$TARGET_DIR" || error_exit "Error: Failed to create $TARGET_DIR." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want /opt/datadog-agent/agent to be owned by root or the dd-agent user? I'm not familiar with agent 5 installations but agent 7 installations make this owned by dd-agent
possibly this should be run as sudo -u dd-agent mkdir -p "$TARGET_DIR"?
jhanna@fedora:~$ ls -al /opt
total 0
drwxr-xr-x. 1 root root 58 Jul 16 12:09 .
dr-xr-xr-x. 1 root root 180 Sep 29 12:01 ..
drwxr-xr-x. 1 dd-agent dd-agent 472 Sep 29 12:01 datadog-agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this behavior for the Agent 5 I try an other way let me know what you think of it
| main() { | ||
| check_downloader | ||
| ensure_target_directory | ||
| download_certificate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test that the downloaded certificate works here via wget --ca-certificate or curl --cert before restarting the agent? it could indicate a failure earlier in the update process here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, added something
| while IFS= read -r line; do | ||
| log_file="${line%%:*}" | ||
| pre_size="${line##*:}" | ||
| if sudo test -f "$log_file"; then | ||
| if [ "$pre_size" -ge 0 ]; then | ||
| sudo tail -c "+$((pre_size+1))" "$log_file" 2>/dev/null | grep -qiE "$pattern" \ | ||
| && error_exit "Detected SSL/cert verification failure in $(basename "$log_file") since restart." | ||
| else | ||
| sudo tail -n 500 "$log_file" 2>/dev/null | grep -qiE "$pattern" \ | ||
| && error_exit "Detected SSL/cert verification failure in new $(basename "$log_file") since restart." | ||
| fi | ||
| fi | ||
| done < "$OFFSETS_FILE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we saving and reading log offsets to a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was mainly to only have the log since the restart to avoid false negatives
| if [ "$pre_size" -ge 0 ]; then | ||
| sudo tail -c "+$((pre_size+1))" "$log_file" 2>/dev/null | grep -qiE "$pattern" \ | ||
| && error_exit "Detected SSL/cert verification failure in $(basename "$log_file") since restart." | ||
| else | ||
| sudo tail -n 500 "$log_file" 2>/dev/null | grep -qiE "$pattern" \ | ||
| && error_exit "Detected SSL/cert verification failure in new $(basename "$log_file") since restart." | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should just rotate these logs files instead of trying to manage an offset, if something bad happens it'd be nice to just have a "here is everything since the attempted restart" log for customers to give us / troubleshoot with
if we can't rotate the logs another option might just to be opening these in a tail -f stream for 30s or something on the restart instead of trying to find our point in time but that is more difficult to troubleshoot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really good idea, I applied it, let me know if this fits your idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I do not mind the offset mechanism since it is fairly well scoped and not too complicated in this use cae. The problem with log rotation is cleanup and bloat over time. Customers really do not like dirty hosts on uninstall.
| } | ||
| $filtered += $line | ||
| } | ||
| $filtered += 'use_curl_http_client: true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the .conf file care about indentation levels (is it yaml)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is yaml
| } | ||
|
|
||
| function Enable-Tls12 { | ||
| try { [Net.ServicePointManager]::SecurityProtocol = [Net.ServicePointManager]::SecurityProtocol -bor [Net.SecurityProtocolType]::Tls12 } catch { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably better to at least write a warning on failure.
| $filtered += 'use_curl_http_client: true' | ||
| $updated = (($filtered -join "`r`n") + "`r`n") | ||
|
|
||
| try { $updated | Set-Content -LiteralPath $ConfFile -Encoding ASCII } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably UTF8 is better.
| Write-Warning "Failed to restart service '$name': $($_.Exception.Message)" | ||
| } | ||
| } | ||
| if (-not $restarted) { Error-Exit "Error: Failed to restart the Datadog Agent (service not found or restart failed)." } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intention is to Error-Exit if any service fails to restart, this condition will not work.
In the foreach loop, the first service that succeeds restart will permanently set $restarted = $true and ignore other subsequent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, thanks for looping the docs team into this PR. I have left some comments/suggestions. Let me know if you have any questions!
|
|
||
| ## Why This Matters | ||
|
|
||
| Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the agent to lose connectivity with Datadog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the agent to lose connectivity with Datadog. | |
| Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the Agent to lose connectivity with Datadog. |
| This runbook provides automated scripts for both Linux and Windows that will: | ||
|
|
||
| 1. Download and install an updated certificate bundle | ||
| 2. Configure the agent to use your operating system's certificate store as a fallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Configure the agent to use your operating system's certificate store as a fallback | |
| 2. Configure the Agent to use your operating system's certificate store as a fallback |
| If certificate errors persist after running the script: | ||
|
|
||
| 1. Verify your operating system is receiving security updates | ||
| 2. Check the agent logs for detailed error messages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Check the agent logs for detailed error messages: | |
| 2. Check the Agent logs for detailed error messages: |
|
|
||
| ## Verification | ||
|
|
||
| After running the script, verify your agent is reporting metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| After running the script, verify your agent is reporting metrics: | |
| After running the script, verify your Agent is reporting metrics: |
| 2. Check your host in the Datadog Infrastructure List | ||
| 3. Verify the "Last Seen" timestamp is recent | ||
|
|
||
| You can also manually check agent status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You can also manually check agent status: | |
| You can also manually check the Agent status: |
|
|
||
| ## Long-Term Recommendation | ||
|
|
||
| While this runbook provides a working solution, **we strongly recommend upgrading to Datadog Agent v6 or v7**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| While this runbook provides a working solution, **we strongly recommend upgrading to Datadog Agent v6 or v7**: | |
| While this runbook provides a working solution, **Datadog strongly recommends upgrading to Datadog Agent v6 or v7** to benefit from: |
| - Improved performance and new features | ||
| - Long-term support | ||
|
|
||
| Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/). | |
| Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/guide/upgrade_agent_fleet_automation). |
This new link directs users to steps for upgrading their Agent. See if there is a guide here that you deem more appropriate for this section.
| 1. Your Agent version | ||
| 2. Operating system and version | ||
| 3. Complete output from the script | ||
| 4. Recent agent log excerpts showing any errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 4. Recent agent log excerpts showing any errors | |
| 4. Recent Agent log excerpts showing any errors |
|
lgtm |
Description
This pull request adds a new runbook under
runbooks/sectigo-root-ca-rotation-2025/to assist users of Datadog Agent v5 in resolving connectivity issues caused by upcoming SSL certificate authority changes (Sectigo root CA rotation).The runbook provides step-by-step automated scripts for both Linux and Windows environments that:
use_curl_http_client: trueThis ensures continued connectivity for Agent v5 instances affected by the root CA update.
Motivation
Older versions of Datadog Agent v5 use a bundled certificate store that does not include the new Sectigo root CA.
Without intervention, these agents may lose the ability to send data to Datadog intake endpoints after the rotation.
This runbook provides an automated remediation path for affected users.
Testing Guidelines
TODO
Additional Notes