Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Raft cluster peer management (GetPeers, AddPeer, RemovePeer) #661

Closed
wants to merge 49 commits into from

Conversation

sinadarbouy
Copy link
Collaborator

Ticket(s)

Close #642

Description

This PR implements Raft peer management APIs to enable adding, removing, and querying peers in the Raft cluster. Key changes include:

  • Added new API endpoints for Raft peer management:
    • GetPeers: Returns information about all peers in the Raft cluster
    • AddPeer: Adds a new peer to the Raft cluster
    • RemovePeer: Removes a peer from the Raft cluster
  • Added corresponding protobuf definitions and messages
  • Added comprehensive test coverage for the new APIs
  • Updated API documentation
  • Added raft/node* to .gitignore
  • Updated Dockerfile to use version ranges for dependencies

Related PRs

Development Checklist

  • I have added a descriptive title to this PR.
  • [] I have squashed related commits together.
  • [] I have rebased my branch on top of the latest main branch.
  • I have performed a self-review of my own code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added docstring(s) to my code.
  • I have made corresponding changes to the documentation (docs).
  • [] I have updated docs using make gen-docs command.
  • I have added tests for my changes.
  • I have signed all the commits.

Legal Checklist

Adds support for automatic peer discovery and cluster joining for non-bootstrap nodes.
Key changes:

- Add AddPeer RPC endpoint to allow nodes to join an existing cluster
- Implement TryConnectToCluster() to handle automatic cluster joining
- Forward AddPeer requests to leader if received by follower
- Add protobuf definitions for AddPeer request/response
- Update .gitignore to exclude raft node data files

This change allows new nodes to automatically discover and join an existing cluster
by attempting to connect to configured peers until successful. Non-leader nodes
will forward join requests to the current leader.
Add unit tests to verify the AddPeer behavior in both leader and follower nodes:
- Test successful peer addition when node is leader
Copy link

⚠️ This PR contains unsigned commits. To get your PR merged, please sign those commits (git rebase --exec 'git commit -S --amend --no-edit -n' @{upstream}) and force push them to this branch (git push --force-with-lease).

If you're new to commit signing, there are different ways to set it up:

Sign commits with gpg

Follow the steps below to set up commit signing with gpg:

  1. Generate a GPG key
  2. Add the GPG key to your GitHub account
  3. Configure git to use your GPG key for commit signing
Sign commits with ssh-agent

Follow the steps below to set up commit signing with ssh-agent:

  1. Generate an SSH key and add it to ssh-agent
  2. Add the SSH key to your GitHub account
  3. Configure git to use your SSH key for commit signing
Sign commits with 1Password

You can also sign commits using 1Password, which lets you sign commits with biometrics without the signing key leaving the local 1Password process.

Learn how to use 1Password to sign your commits.

Watch the demo

Copy link

github-actions bot commented Feb 16, 2025

Overview

Image reference ghcr.io/gatewayd-io/gatewayd:4bbd0c2 gatewaydio/gatewayd:latest
- digest 0f7619257329 383013efa302
- tag 4bbd0c2 latest
- provenance b6df86a
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 1 high: 3 medium: 6 low: 0
- platform linux/amd64 linux/amd64
- size 20 MB 18 MB (-2.7 MB)
- packages 145 140 (-5)
Base Image alpine:3
also known as:
3.21
3.21.3
latest
alpine:3.20
also known as:
3
latest
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 0 high: 1 medium: 3 low: 0
Packages and Vulnerabilities (55 package changes and 4 vulnerability changes)
  • ➕ 2 packages added
  • ➖ 6 packages removed
  • ♾️ 47 packages changed
  • 87 packages unchanged
  • ❗ 4 vulnerabilities added
Changes for packages of type apk (19 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:4bbd0c2
Version
gatewaydio/gatewayd:latest
alpine-base 3.21.3-r0
♾️ alpine-baselayout 3.6.8-r1 3.6.5-r0
♾️ alpine-baselayout-data 3.6.8-r1 3.6.5-r0
♾️ alpine-keys 2.5-r0 2.4-r1
alpine-release 3.21.3-r0
♾️ apk-tools 2.14.6-r3 2.14.4-r0
♾️ busybox 1.37.0-r12 1.36.1-r29
♾️ busybox-binsh 1.37.0-r12 1.36.1-r29
ca-certificates 20241121-r1
♾️ ca-certificates-bundle 20241121-r1 20240705-r0
♾️ libcrypto3 3.3.3-r0 3.3.2-r0
♾️ libssl3 3.3.3-r0 3.3.2-r0
♾️ musl 1.2.5-r9 1.2.5-r0
♾️ musl-utils 1.2.5-r9 1.2.5-r0
critical: 0 high: 1 medium: 0 low: 0
Added vulnerabilities (1):
  • high : CVE--2025--26519
openssl 3.3.3-r0
pax-utils 1.3.8-r1
♾️ scanelf 1.3.8-r1 1.3.7-r2
♾️ ssl_client 1.37.0-r12 1.36.1-r29
♾️ zlib 1.3.1-r2 1.3.1-r1
Changes for packages of type golang (36 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:4bbd0c2
Version
gatewaydio/gatewayd:latest
♾️ github.com/envoyproxy/protoc-gen-validate 1.2.1 1.1.0
♾️ github.com/gatewayd-io/gatewayd (devel) 0.0.0-20241214123014-b6df86a6fe94
♾️ github.com/gatewayd-io/gatewayd-plugin-sdk 0.4.0 0.3.5
♾️ github.com/getsentry/sentry-go 0.31.1 0.30.0
♾️ github.com/go-git/go-billy/v5 5.6.0 5.5.0
♾️ github.com/go-git/go-git/v5 5.13.0 5.12.0
critical: 1 high: 1 medium: 0 low: 0
Added vulnerabilities (2):
  • critical : CVE--2025--21613
  • high : CVE--2025--21614
♾️ github.com/google/go-github/v53 68.0.0 53.2.0
♾️ github.com/grpc-ecosystem/grpc-gateway/v2 2.26.0 2.24.0
github.com/hashicorp/go-metrics 0.5.4
♾️ github.com/hashicorp/go-plugin 1.6.3 1.6.2
♾️ github.com/hashicorp/raft 1.7.2 1.7.1
♾️ github.com/hashicorp/raft-boltdb 0.0.0-20250113192317-e8660f88bcc9 0.0.0-20241202213821-f9dd2ba30efd
♾️ github.com/invopop/jsonschema 0.13.0 0.12.0
♾️ github.com/jackc/pgx/v5 5.7.2 5.7.1
♾️ github.com/mattn/go-colorable 0.1.14 0.1.13
♾️ github.com/pganalyze/pg_query_go/v5 6.0.0 5.1.0
♾️ github.com/prometheus/common 0.62.0 0.61.0
♾️ github.com/protonmail/go-crypto 1.1.3 1.0.0
♾️ github.com/spf13/cast 1.7.1 1.7.0
♾️ github.com/wasilibs/go-pgquery 0.0.0-20241226024732-8bfaa0ac5969 0.0.0-20241011013927-817756c5aae4
♾️ github.com/wasilibs/wazero-helpers 0.0.0-20250123031827-cd30c44769bb 0.0.0-20240620070341-3dff1577cd52
♾️ go.opentelemetry.io/otel 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/metric 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/sdk 1.34.0 1.33.0
♾️ go.opentelemetry.io/otel/trace 1.34.0 1.33.0
♾️ go.opentelemetry.io/proto/otlp 1.5.0 1.4.0
♾️ golang.org/x/crypto 0.32.0 0.31.0
golang.org/x/exp 0.0.0-20241210194714-1829a127f884
♾️ golang.org/x/net 0.34.0 0.32.0
critical: 0 high: 1 medium: 0 low: 0
Added vulnerabilities (1):
  • high : CVE--2024--45338
golang.org/x/oauth2 0.24.0
♾️ golang.org/x/sys 0.29.0 0.28.0
♾️ google.golang.org/genproto/googleapis/rpc 0.0.0-20250124145028-65684f501c47 0.0.0-20241209162323-e6fa225c2576
♾️ google.golang.org/grpc 1.70.0 1.69.0
♾️ google.golang.org/protobuf 1.36.4 1.35.2
♾️ stdlib go1.23.6 1.23.4

@sinadarbouy sinadarbouy force-pushed the feature/dynamic-adding-raft-642 branch from e264d49 to c01cd47 Compare February 17, 2025 21:22
@hamedsalim1999 hamedsalim1999 force-pushed the feature/dynamic-adding-raft-642 branch 2 times, most recently from 0cc8570 to 0e49f89 Compare February 18, 2025 19:46
sinadarbouy and others added 23 commits February 18, 2025 21:24
…ower nodes

- Updated TestAddPeer to include checks for adding peers when the node is a leader and a follower.
- Introduced temporary directories for each node to ensure isolated testing environments.
- Added assertions to confirm that both new peers are successfully integrated into the cluster.
- Improved test reliability by implementing a loop to wait for both nodes to join the cluster before completing the test.
…ower nodes

- Updated TestAddPeer to include checks for adding peers when the node is a leader and a follower.
- Introduced temporary directories for each node to ensure isolated testing environments.
- Added assertions to confirm that both new peers are successfully integrated into the cluster.
- Improved test reliability by implementing a loop to wait for both nodes to join the cluster before completing the test.
- Added RemovePeer RPC endpoint to the Raft service, allowing nodes to remove peers from the cluster.
- Introduced RemovePeerRequest and RemovePeerResponse message types in the protobuf definitions.
- Updated RaftNode to handle peer removal, including forwarding requests to the leader if the node is not the leader.
- Enhanced the README documentation to include details about the new RemovePeerRequest and RemovePeerResponse.
- Implemented unit tests for the RemovePeer functionality, ensuring correct behavior when removing both leader and follower nodes.
- Updated gRPC and HTTP handlers to support the new RemovePeer functionality.

This change enhances the Raft protocol's capability to manage cluster membership dynamically.
Implement Raft cluster management API endpoints to retrieve, add, and remove peers:
- Add GetPeers method to retrieve current Raft cluster peers
- Implement AddPeer and RemovePeer RPC endpoints for dynamic cluster membership
- Update API service definition to include new Raft peer management methods
- Add corresponding gRPC and HTTP handlers for peer management
- Enhance protobuf definitions with new message types for peer operations

These changes provide a comprehensive API for managing Raft cluster membership, allowing dynamic peer addition and removal.
Implement peer discovery and graceful shutdown in raft.go
Configure Tempo tracing service with explicit endpoint binding and add health checks to Docker Compose. This ensures proper tracing integration and service readiness in the observability stack.
…and openssl

Modify Dockerfile to use more flexible version constraints for alpine packages, allowing minor version updates while maintaining compatibility.
Enhance Raft node configuration to support optional TLS encryption:
- Add IsSecure, CertFile, and KeyFile fields to Raft configuration
- Implement conditional TLS server credentials based on secure mode
- Update default configuration to disable secure mode
- Modify gRPC server startup to handle secure and insecure modes
- Improve logging for gRPC server initialization

This change provides flexibility in configuring Raft node communication security while maintaining backward compatibility.
Modify the HealthChecker to always return NOT_SERVING status by commenting out Raft-specific health checks.
Introduce getLeaderClient method to centralize leader client retrieval logic in AddPeer and RemovePeer methods. This reduces code duplication and improves maintainability by extracting the common pattern of finding the leader's gRPC address and obtaining a client.
Implement TestSecureGRPCConfiguration to validate secure Raft node configuration:
- Add test cases for valid and invalid secure configuration scenarios
- Introduce helper function to generate self-signed certificates for testing
- Verify TLS credential handling and error conditions
- Ensure proper configuration of secure and non-secure gRPC nodes
…c allocation

Modify TestSecureGRPCConfiguration to use port 0 for dynamic port allocation, improving test reliability and preventing potential port conflicts during parallel test execution.
…oring

 integrate Hashicorp's logger adapter using zerolog. This simplifies the Raft node initialization by leveraging built-in logging mechanisms and removing redundant leadership tracking logic.
…sertions

Enhance HTTP server test by:
- Adding error handling for gRPC server startup
- Using require assertions for clearer test failures
- Implementing panic recovery for gRPC server
- Improving server startup error detection
Remove the `monitorLeadership()` method from the Raft node initialization, which was previously commented out. This simplifies the node startup process and removes unnecessary leadership tracking logic that was likely superseded by more efficient Raft cluster management mechanisms.
Implement thorough test suite for RemovePeer API method, covering:
- Successful peer removal
- Error handling for uninitialized Raft node
- Handling of non-existent peer removal
- Proper gRPC error code validation
Implement thorough test suite for GetPeers API method, covering:
- Successful peer retrieval with a leader node
- Error handling for uninitialized Raft node
- Validation of returned peers map structure
- Proper gRPC error code validation
Enhance peer management functionality by introducing gRPC address tracking:
- Update AddPeer method to include gRPC address parameter
- Modify AddPeerRequest and related protobuf definitions
- Extend peer addition logic to store gRPC address in local peers list
- Update API and RPC methods to handle new gRPC address field
- Add comprehensive test cases for AddPeer with gRPC address validation
…error handling

Enhance peer management methods by:
- Adding context with timeout for AddPeer and RemovePeer operations
- Improving error messages with more context
- Using getter methods for request fields
- Updating test cases to reflect new method signatures
- Adding more robust error handling and logging
Implement TestFSMPeerOperations to validate Raft cluster peer management:
- Create a multi-node Raft cluster with bootstrap and follower nodes
- Verify peer synchronization across nodes
- Ensure consistent peer information in FSM state
- Validate leader election and consistency
- Add robust assertions for peer addition and state tracking
Improve peer management by:
- Adding CommandAddPeer and CommandRemovePeer to FSM
- Implementing peer synchronization across Raft cluster
- Adding waitForLeader method with retry mechanism
- Enhancing error handling and logging for peer operations
- Updating leader client retrieval with more reliable mechanism
Update Raft node and RPC methods to accept context parameter:
- Modify Apply method to include context for better request tracing and timeout control
- Update forwardToLeader and applyInternal methods to use context
- Adjust RPC server methods to pass context through
- Refactor test cases to provide context when calling Apply
- Improve error handling and request forwarding with context support
sinadarbouy and others added 24 commits February 18, 2025 21:26
…ation

Implement a new GetPeerInfo RPC method to support peer synchronization across Raft cluster:
- Add GetPeerInfoRequest and GetPeerInfoResponse protobuf definitions
- Create RPC method to query peer information from other nodes
- Implement peer synchronization mechanism with periodic checks
- Add method to query and update peer information across cluster
- Enhance peer management with cross-node information retrieval
Update testcontainers-go dependency to the latest version, which includes potential bug fixes and improvements.
Remove the placeholder DiscoverPeers method that was not implemented, keeping the codebase clean and focused on existing peer management functionality.
Improve peer synchronization and RPC method implementation:
- Remove error handling from syncPeers method
- Simplify StartPeerSynchronizer goroutine
- Update GetPeerInfo RPC method to use getter method
- Remove unnecessary logging and error checks
…guration management

Break down Raft node creation into smaller, focused functions:
- Extract node configuration initialization
- Create separate methods for FSM, stores, and transport setup
- Improve error handling and logging during node creation
- Add context cancellation for peer synchronization
- Enhance cluster configuration and bootstrapping logic
* Refactor global variables into a global struct
* Update tests
* Use param instead of global variable
* Update tests
* Fix formatting
* Add a hack to fix the tests
* Ignore raft files
* Fix tests
* Remove global variables
* Use a global app variable for test only and assign it only in tests
* Add stopGracefully as a method to the GatewayDInstance struct
* Move signal handler to the top to avoid half-states on exit
* Add a new function for creating an instance of GatewayDInstance using parsed flags
* Remove duplicate internal functions
* Update tests
* Use exported functions instead of internal variables
* Remove global variables
* Update tests
* Fix linter error
* Ignore dupl linter
* Fix tests with the actual log output
* Clean up after test
* Remove global variables
* Skip path slip verification
* Fix tests
* Remove backup
* Another try to fix the test
* Fix missing/unknown behavior
* Add a pull-only test
* Declare variable before assignment
* Fix plugin install behavior
* Fix plugin install with no overwrite
* Revert changes
* Rename variable
* Use filepath to join paths
* Remove unnecessary file
* Ignore plugins file if exists
* Remove duplicate code
* Reset the pull-only flag
* Check if the server is properly closed before erroring out
* Refactor run command into a separate file
* Fix bug in handling early exit
* Move left-over functions
* Add comments
* Fix linter errors
* Fix missing log message and span
* Handle errors when stopping the listener for gRPC server
* Graceful shutdown of gRPC server
* Revert changes to path
* Use local variable
* Replace all context.TODO with context.Background
* Split exit codes into a separate file
* Remove unused constant and renumber others
* Use exported function instead of internal variables
* Ignore linter errors
* Rename variable and comment to match the behavior
* Refactor health check scheduler
* Do not start metrics merger if no address is registered
* Refactor metrics merger function
* Refactor connection health check function
* Remove compatibility policy
* Regenerate stubs
* Remove compatibility policy
* Check fast for plugin version existence and simplify syntax
* Update SDK
* Update direct deps
* Add install-deps target
* Regenerate stubs
* Update alpine and its packages
* Fix linter errors
* Increase timeout
Modify package version constraints to use more flexible version matching for git, make, and openssl packages, allowing minor version updates while maintaining compatibility.
Update the protoc-gen-go version in generated protobuf files for both API and Raft services, ensuring compatibility and using the latest minor version.
Remove hardcoded ARM64 architecture setting in docker-compose-raft.yaml, allowing for more flexible deployment configurations.
Update load balancer strategies to accept a context parameter, enabling timeout and cancellation support for proxy selection. This change introduces context handling in:
- ConsistentHash
- Random
- RoundRobin
- WeightedRoundRobin

Also add a FindProxyTimeout constant in the server to provide a default timeout for proxy selection.
Add graceful handling for raft cluster bootstrap when the cluster is already initialized, preventing unnecessary errors and improving startup robustness. Log an informative message when skipping bootstrap due to existing cluster configuration.
Add comments to clarify the purpose of AddPeer and GetPeerInfo gRPC request handlers, improving code readability and documentation for Raft RPC server methods.
Improve input validation and error handling for Raft RPC methods:
- Add null and empty field checks for AddPeer and RemovePeer requests
- Provide more descriptive error messages
- Refactor GetPeerInfo to handle non-existent peer cases
- Ensure consistent error response formatting
…or creation

Simplify error creation in RPC methods by using errors.New instead of fmt.Errorf, improving code consistency and removing unnecessary formatting overhead.
Enhance Raft cluster operations with:
- Robust LeaveCluster method with timeout and logging
- Comprehensive peer validation in FSM
- Metrics tracking for peer additions, updates, and removals
- Improved error handling and state checks
- Added validation for peer payload addresses
…ions

Improve API documentation for Raft peer-related methods and messages:
- Add detailed descriptions and examples for GetPeers, AddPeer, and RemovePeer RPC methods
- Include comprehensive field descriptions for PeersResponse, PeerInfo, AddPeerRequest, and related response messages
- Update Swagger/OpenAPI specifications with more informative operation and schema descriptions
- Improve README.md documentation for peer-related message fields
Implement thorough test cases for LeaveCluster method covering:
- Single node cluster
- Follower leaving multi-node cluster
- Leader leaving multi-node cluster
- Handling nil node scenarios
- Verifying cluster state after node departure

Enhance test coverage for Raft cluster management and node removal logic.
Implement thorough test suites for Raft Node methods:
- GetPeers: Test peer retrieval in various cluster configurations
- GetLeaderClient: Verify leader client retrieval in single and multi-node clusters
- Shutdown: Validate node shutdown behavior with different scenarios

Enhance test coverage for Raft node management, improving reliability and robustness of cluster operations.
Implement thorough test suites for Raft RPC server methods:
- AddPeer: Test peer addition with various input scenarios
- RemovePeer: Validate peer removal in different conditions
- GetPeerInfo: Verify peer information retrieval

Enhance test coverage for Raft RPC server operations, improving reliability and robustness of cluster management methods.
@sinadarbouy sinadarbouy force-pushed the feature/dynamic-adding-raft-642 branch from 4bbd0c2 to b66e2bf Compare February 18, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance Raft Cluster Management with Health Checks, Dynamic Peer Management, and Security
3 participants