Skip to content

Conversation

@louis-cqrl
Copy link

Description

This pull request adds a new runbook under runbooks/sectigo-root-ca-rotation-2025/ to assist users of Datadog Agent v5 in resolving connectivity issues caused by upcoming SSL certificate authority changes (Sectigo root CA rotation).

The runbook provides step-by-step automated scripts for both Linux and Windows environments that:

  • Download and install the updated Datadog certificate bundle
  • Update the agent configuration to enable use_curl_http_client: true
  • Restart the Datadog Agent and verify connectivity
  • Detect and report SSL verification errors since restart

This ensures continued connectivity for Agent v5 instances affected by the root CA update.

Motivation

Older versions of Datadog Agent v5 use a bundled certificate store that does not include the new Sectigo root CA.
Without intervention, these agents may lose the ability to send data to Datadog intake endpoints after the rotation.
This runbook provides an automated remediation path for affected users.

Testing Guidelines

TODO

Additional Notes

  • This runbook is intended as a temporary mitigation for legacy Agent v5 users.
  • Datadog strongly recommends upgrading to Agent v6 or v7 for long-term compatibility and security updates.
  • The runbook creates backups of configuration files and prints detailed error messages if issues occur.

@louis-cqrl louis-cqrl requested a review from a team as a code owner October 15, 2025 13:55
Copy link

@jeremy-hanna jeremy-hanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! I left some comments mostly on the linux script, tbh the Powershell is totally unfamiliar to me and we might want someone else to look at it


The scripts require outbound HTTPS connectivity to:

- `raw.githubusercontent.com` (to download the certificate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we would want to have available on our actual domain like our linux install script in case of firewall restrictions? I'm not sure how we would go about doing that but the build team probably knows

elif command -v wget &>/dev/null; then
DOWNLOADER="wget"
else
install_curl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple thoughts on this:

  • if curl or wget don't exist, how did they download the install script to the host?
  • I don't feel good about installing curl or wget if the customer has explicitly removed it (not sure why they would), imo this should be part of the prerequisites to running this so the customer can proceed how they want to

}

ensure_target_directory() {
if ! sudo test -d "$TARGET_DIR"; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the command is being run with sudo, I'm not sure we need to prefix these commands as such

ensure_target_directory() {
if ! sudo test -d "$TARGET_DIR"; then
echo "Directory $TARGET_DIR does not exist. Creating it..."
sudo mkdir -p "$TARGET_DIR" || error_exit "Error: Failed to create $TARGET_DIR."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want /opt/datadog-agent/agent to be owned by root or the dd-agent user? I'm not familiar with agent 5 installations but agent 7 installations make this owned by dd-agent

possibly this should be run as sudo -u dd-agent mkdir -p "$TARGET_DIR"?

jhanna@fedora:~$ ls -al /opt
total 0
drwxr-xr-x. 1 root     root      58 Jul 16 12:09 .
dr-xr-xr-x. 1 root     root     180 Sep 29 12:01 ..
drwxr-xr-x. 1 dd-agent dd-agent 472 Sep 29 12:01 datadog-agent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this behavior for the Agent 5 I try an other way let me know what you think of it

main() {
check_downloader
ensure_target_directory
download_certificate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test that the downloaded certificate works here via wget --ca-certificate or curl --cert before restarting the agent? it could indicate a failure earlier in the update process here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, added something

Comment on lines 122 to 134
while IFS= read -r line; do
log_file="${line%%:*}"
pre_size="${line##*:}"
if sudo test -f "$log_file"; then
if [ "$pre_size" -ge 0 ]; then
sudo tail -c "+$((pre_size+1))" "$log_file" 2>/dev/null | grep -qiE "$pattern" \
&& error_exit "Detected SSL/cert verification failure in $(basename "$log_file") since restart."
else
sudo tail -n 500 "$log_file" 2>/dev/null | grep -qiE "$pattern" \
&& error_exit "Detected SSL/cert verification failure in new $(basename "$log_file") since restart."
fi
fi
done < "$OFFSETS_FILE"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we saving and reading log offsets to a file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was mainly to only have the log since the restart to avoid false negatives

Comment on lines 126 to 132
if [ "$pre_size" -ge 0 ]; then
sudo tail -c "+$((pre_size+1))" "$log_file" 2>/dev/null | grep -qiE "$pattern" \
&& error_exit "Detected SSL/cert verification failure in $(basename "$log_file") since restart."
else
sudo tail -n 500 "$log_file" 2>/dev/null | grep -qiE "$pattern" \
&& error_exit "Detected SSL/cert verification failure in new $(basename "$log_file") since restart."
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we should just rotate these logs files instead of trying to manage an offset, if something bad happens it'd be nice to just have a "here is everything since the attempted restart" log for customers to give us / troubleshoot with

if we can't rotate the logs another option might just to be opening these in a tail -f stream for 30s or something on the restart instead of trying to find our point in time but that is more difficult to troubleshoot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really good idea, I applied it, let me know if this fits your idea

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I do not mind the offset mechanism since it is fairly well scoped and not too complicated in this use cae. The problem with log rotation is cleanup and bloat over time. Customers really do not like dirty hosts on uninstall.

}
$filtered += $line
}
$filtered += 'use_curl_http_client: true'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the .conf file care about indentation levels (is it yaml)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this is yaml

}

function Enable-Tls12 {
try { [Net.ServicePointManager]::SecurityProtocol = [Net.ServicePointManager]::SecurityProtocol -bor [Net.SecurityProtocolType]::Tls12 } catch { }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to at least write a warning on failure.

$filtered += 'use_curl_http_client: true'
$updated = (($filtered -join "`r`n") + "`r`n")

try { $updated | Set-Content -LiteralPath $ConfFile -Encoding ASCII }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably UTF8 is better.

Write-Warning "Failed to restart service '$name': $($_.Exception.Message)"
}
}
if (-not $restarted) { Error-Exit "Error: Failed to restart the Datadog Agent (service not found or restart failed)." }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intention is to Error-Exit if any service fails to restart, this condition will not work.
In the foreach loop, the first service that succeeds restart will permanently set $restarted = $true and ignore other subsequent failures.

Copy link

@iadjivon iadjivon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, thanks for looping the docs team into this PR. I have left some comments/suggestions. Let me know if you have any questions!


## Why This Matters

Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the agent to lose connectivity with Datadog.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the agent to lose connectivity with Datadog.
Agent v5 uses an embedded certificate bundle for SSL/TLS verification. When Datadog's SSL certificates are updated to use newer certificate authorities, older Agent v5 installations may not recognize these certificates, causing the Agent to lose connectivity with Datadog.

This runbook provides automated scripts for both Linux and Windows that will:

1. Download and install an updated certificate bundle
2. Configure the agent to use your operating system's certificate store as a fallback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Configure the agent to use your operating system's certificate store as a fallback
2. Configure the Agent to use your operating system's certificate store as a fallback

If certificate errors persist after running the script:

1. Verify your operating system is receiving security updates
2. Check the agent logs for detailed error messages:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Check the agent logs for detailed error messages:
2. Check the Agent logs for detailed error messages:


## Verification

After running the script, verify your agent is reporting metrics:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After running the script, verify your agent is reporting metrics:
After running the script, verify your Agent is reporting metrics:

2. Check your host in the Datadog Infrastructure List
3. Verify the "Last Seen" timestamp is recent

You can also manually check agent status:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can also manually check agent status:
You can also manually check the Agent status:


## Long-Term Recommendation

While this runbook provides a working solution, **we strongly recommend upgrading to Datadog Agent v6 or v7**:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While this runbook provides a working solution, **we strongly recommend upgrading to Datadog Agent v6 or v7**:
While this runbook provides a working solution, **Datadog strongly recommends upgrading to Datadog Agent v6 or v7** to benefit from:

- Improved performance and new features
- Long-term support

Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/).
Agent v5 reached end-of-life and no longer receives updates. For migration guidance, visit the [Datadog documentation](https://docs.datadoghq.com/agent/guide/upgrade_agent_fleet_automation).

This new link directs users to steps for upgrading their Agent. See if there is a guide here that you deem more appropriate for this section.

1. Your Agent version
2. Operating system and version
3. Complete output from the script
4. Recent agent log excerpts showing any errors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Recent agent log excerpts showing any errors
4. Recent Agent log excerpts showing any errors

@renantelo
Copy link

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants