Skip to content

Provide network partition configuration enablement via terraform-aws-consul-ecs automation#313

Merged
anandmukul93 merged 2 commits intomainfrom
mukul/ecs-network-partition-resilience-configure
Feb 24, 2026
Merged

Provide network partition configuration enablement via terraform-aws-consul-ecs automation#313
anandmukul93 merged 2 commits intomainfrom
mukul/ecs-network-partition-resilience-configure

Conversation

@anandmukul93
Copy link
Copy Markdown
Contributor

Changes proposed in this PR:

  • makefile for builds
  • schema changes for consul-ecs-config json
  • outlier detection in envoy via service defaults passiveHealthcheck (it merges existing service defaults to include passiveHealthCheck if doesnt exist)

How I've tested this PR:

How I've tested this PR:

Used a setup for hashicups with product-api -> product-db (upstream)

Service defaults are created -

/ $ consul config read -kind service-defaults -name product-api

{
    "Kind": "service-defaults",
    "Name": "product-api",
    "TransparentProxy": {},
    "MeshGateway": {},
    "Expose": {},
    "UpstreamConfig": {
        "Defaults": {
            "PassiveHealthCheck": {
                "Interval": 10000000000,
                "MaxFailures": 5,
                "EnforcingConsecutive5xx": 100,
                "MaxEjectionPercent": 50
            },
            "MeshGateway": {}
        }
    },
    "CreateIndex": 87,
    "ModifyIndex": 87
}
/ $ 
/ $ consul config read -kind service-defaults -name product-db
{
    "Kind": "service-defaults",
    "Name": "product-db",
    "TransparentProxy": {},
    "MeshGateway": {},
    "Expose": {},
    "UpstreamConfig": {
        "Defaults": {
            "PassiveHealthCheck": {
                "Interval": 10000000000,
                "MaxFailures": 5,
                "EnforcingConsecutive5xx": 100,
                "MaxEjectionPercent": 50
            },
            "MeshGateway": {}
        }
    },
    "CreateIndex": 95,
    "ModifyIndex": 95
}
/ $ consul config read -kind service-defaults -name payments
{
    "Kind": "service-defaults",
    "Name": "payments",
    "TransparentProxy": {},
    "MeshGateway": {},
    "Expose": {},
    "UpstreamConfig": {
        "Defaults": {
            "PassiveHealthCheck": {
                "Interval": 10000000000,
                "MaxFailures": 5,
                "EnforcingConsecutive5xx": 100,
                "MaxEjectionPercent": 50
            },
            "MeshGateway": {}
        }
    },
    "CreateIndex": 112,
    "ModifyIndex": 112
}
/ $ 

Envoy config outlier present for upstream of product-api-sidecar-proxy for product-db clusters



{
  "version_info": "56b560c2f883e490a7d0c9f59ada6e2359dbed5be5e3e81c26a764b19ef4981c",
  "cluster": {
    "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
    "name": "product-db.default.dc1.internal.2ba4048e-a093-2f33-4ba6-112ab577986f.consul",
    "type": "EDS",
    "eds_cluster_config": {
      "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
      }
    },
    "connect_timeout": "5s",
    "circuit_breakers": {},
    "outlier_detection": {
      "consecutive_5xx": 5,
      "interval": "10s",
      "max_ejection_percent": 50,
      "enforcing_consecutive_5xx": 100
    },

How I expect reviewers to test this PR:

Checklist:

  • Tests added
  • CHANGELOG entry added

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

@anandmukul93 anandmukul93 requested a review from a team as a code owner February 16, 2026 22:35
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 16, 2026 22:43 — with GitHub Actions Inactive
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 16, 2026 22:53 — with GitHub Actions Inactive
@kswap kswap requested a review from Copilot February 17, 2026 07:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for network partition resilience configuration to the consul-ecs automation, enabling outlier detection (passive health checks) via Envoy through service defaults. This allows services to automatically eject unhealthy upstream instances from the load balancing pool based on consecutive failures.

Changes:

  • Added JSON schema and Go types for configuring network partition resilience with outlier detection parameters
  • Implemented automatic registration of Consul service defaults with passive health check configuration during mesh initialization
  • Enhanced Makefile to allow ARCH variable override for cross-platform builds

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
config/schema.json Added JSON schema for networkResilienceConfig with outlierDetection settings (interval, maxFailures, enforcingConsecutive5xx, maxEjectionPercent)
config/types.go Added OutlierDetectionConfig and NetworkPartitionResilienceConfig types with constants for default values and constructor functions
subcommand/mesh-init/command.go Implemented registerServiceDefaults function to merge existing service defaults and add passive health check configuration; integrated into realRun workflow
subcommand/mesh-init/checks.go Minor formatting change removing blank line in imports
Makefile Changed ARCH to use conditional assignment (?=) allowing override for cross-platform builds

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@v-rosa
Copy link
Copy Markdown

v-rosa commented Feb 17, 2026

Hey Team, @anandmukul93

As per my understanding Consul doesn't support the all the types of PassiveHealth checks provided by Envoy.

Relying purely on this generic 5xx errors (enforcing_consecutive_5xx) might lead to false positives, I believe the scope of this change is to detect gateway/routing issues and not server errors.

For this Consul might need to implement the configuration of:

enforcing_consecutive_gateway_failure

The % chance that a host will be actually ejected when an outlier status is detected through consecutive gateway failures. This setting can be used to disable ejection or to ramp it up slowly. Defaults to 0.

http://envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/outlier_detection.proto#envoy-v3-api-field-config-cluster-v3-outlierdetection-failure-percentage-threshold

I'm wary of the false positives we might hit without this distinction. If we can get this config in, I think we're in a much safer spot for production.

Let me know your thoughts.

@anandmukul93
Copy link
Copy Markdown
Contributor Author

@v-rosa
as far as i checked envoy considers gateway failures as below

consecutive_gateway_failure
(UInt32Value) The number of consecutive gateway failures (502, 503, 504 status codes) before a consecutive gateway failure ejection occurs. Defaults to 5.

enforcing_consecutive_gateway_failure
(UInt32Value) The % chance that a host will be actually ejected when an outlier status is detected through consecutive gateway failures. This setting can be used to disable ejection or to ramp it up slowly. Defaults to 0.

As a requirement i see that 5xx is more generic .

@anandmukul93 anandmukul93 force-pushed the mukul/ecs-network-partition-resilience-configure branch from 7ce03c4 to 4dd6d7b Compare February 19, 2026 15:00
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 19, 2026 15:18 — with GitHub Actions Inactive
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 19, 2026 15:18 — with GitHub Actions Inactive
@v-rosa
Copy link
Copy Markdown

v-rosa commented Feb 19, 2026

As a requirement i see that 5xx is more generic .

Thanks @anandmukul93 for taking time to check my comment! Indeed, by being more generic, it is what makes this approach prone to false positives and not really usable to improve the resilience of the network (original root cause which triggered this change).

@anandmukul93
Copy link
Copy Markdown
Contributor Author

anandmukul93 commented Feb 23, 2026

Thanks @v-rosa for the valuable concern.
i understand generic 5xx is a concern but as a hotfix requirement , although i see that 5xx is more generic and false prone it is currently supported by consul out of the box.
Future work would add onto this by adding gateway failure support in consul and consul-api and then consul-ecs. all of which will be part of the passivehealthcheck itself. given the nature of partitions to be transient, this hotfix works for most situations but i understand now what could be a better way and will surely move towards that.

In addition to this, consul-ui also needs to be updated with critical status for container checks upon network partition between ecs and consul cluster. for that we need to add checkTTL type creation for ecs containers in consul server as well which currently is not being done in the catalog register flow.

Copy link
Copy Markdown
Contributor

@kswap kswap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

- makefile
- schema
- outlier detection in envoy via service defaults
- add test cases
- fix pr comments for go lint
- fix error logging
- schema json defaults
- service defaults error types handled separately
@anandmukul93 anandmukul93 force-pushed the mukul/ecs-network-partition-resilience-configure branch from 4dd6d7b to 5171a96 Compare February 24, 2026 01:08
@anandmukul93 anandmukul93 requested a review from a team as a code owner February 24, 2026 01:08
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 24, 2026 01:25 — with GitHub Actions Inactive
@anandmukul93 anandmukul93 temporarily deployed to dockerhub/hashicorpdev February 24, 2026 01:25 — with GitHub Actions Inactive
@anandmukul93 anandmukul93 merged commit 96a2f5f into main Feb 24, 2026
53 checks passed
@anandmukul93 anandmukul93 deleted the mukul/ecs-network-partition-resilience-configure branch February 24, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants