Provide network partition configuration enablement via terraform-aws-consul-ecs automation#313
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for network partition resilience configuration to the consul-ecs automation, enabling outlier detection (passive health checks) via Envoy through service defaults. This allows services to automatically eject unhealthy upstream instances from the load balancing pool based on consecutive failures.
Changes:
- Added JSON schema and Go types for configuring network partition resilience with outlier detection parameters
- Implemented automatic registration of Consul service defaults with passive health check configuration during mesh initialization
- Enhanced Makefile to allow ARCH variable override for cross-platform builds
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| config/schema.json | Added JSON schema for networkResilienceConfig with outlierDetection settings (interval, maxFailures, enforcingConsecutive5xx, maxEjectionPercent) |
| config/types.go | Added OutlierDetectionConfig and NetworkPartitionResilienceConfig types with constants for default values and constructor functions |
| subcommand/mesh-init/command.go | Implemented registerServiceDefaults function to merge existing service defaults and add passive health check configuration; integrated into realRun workflow |
| subcommand/mesh-init/checks.go | Minor formatting change removing blank line in imports |
| Makefile | Changed ARCH to use conditional assignment (?=) allowing override for cross-platform builds |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey Team, @anandmukul93 As per my understanding Consul doesn't support the all the types of PassiveHealth checks provided by Envoy. Relying purely on this generic 5xx errors ( For this Consul might need to implement the configuration of: enforcing_consecutive_gateway_failure
I'm wary of the false positives we might hit without this distinction. If we can get this config in, I think we're in a much safer spot for production. Let me know your thoughts. |
|
@v-rosa
As a requirement i see that 5xx is more generic . |
7ce03c4 to
4dd6d7b
Compare
Thanks @anandmukul93 for taking time to check my comment! Indeed, by being more generic, it is what makes this approach prone to false positives and not really usable to improve the resilience of the network (original root cause which triggered this change). |
|
Thanks @v-rosa for the valuable concern. In addition to this, consul-ui also needs to be updated with critical status for container checks upon network partition between ecs and consul cluster. for that we need to add checkTTL type creation for ecs containers in consul server as well which currently is not being done in the catalog register flow. |
4dd6d7b to
5171a96
Compare
Changes proposed in this PR:
How I've tested this PR:
How I've tested this PR:
Service defaults are created -
Envoy config outlier present for upstream of product-api-sidecar-proxy for product-db clusters
How I expect reviewers to test this PR:
Checklist:
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.
Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.