-
Notifications
You must be signed in to change notification settings - Fork 10
Fix segfault on unexpected node shutdown #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.8
Are you sure you want to change the base?
Fix segfault on unexpected node shutdown #1201
Conversation
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| read_context.packet = read_context.executor.getConnections().receivePacketUnlocked(async_callback); | ||
| read_context.has_read_packet_part = PacketPart::Body; | ||
| if (read_context.packet.type == Protocol::Server::Data) | ||
| read_context.has_data_packets = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip_unavailable_shards no longer masks early socket EOF
When a shard connection drops before any data is received (e.g., node crash) and skip_unavailable_shards is enabled, receivePacketUnlocked now throws ATTEMPT_TO_READ_AFTER_EOF and the exception bubbles out because Task::run no longer catches it. The prior logic swallowed this specific EOF when no data packets had arrived so the shard could be skipped; after this change the query fails instead of continuing on remaining shards, breaking the advertised skip behavior.
Useful? React with 👍 / 👎.
027c0c4 to
3be7ff6
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (e.code() == ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF | ||
| && !read_context.has_data_packets.load() | ||
| && read_context.executor.skipUnavailableShards()) | ||
| { | ||
| read_context.packet.type = Protocol::Server::ConnectionLost; | ||
| read_context.packet.exception = std::make_unique<Exception>(getCurrentExceptionMessageAndPattern(true), getCurrentExceptionCode()); | ||
| read_context.has_read_packet_part = PacketPart::None; | ||
| suspend_callback(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectionLost packet never marked as readable
When handling an early EOF from a shard (ATTEMPT_TO_READ_AFTER_EOF with skipUnavailableShards()), the new branch builds a ConnectionLost packet but leaves has_read_packet_part at PacketPart::None. RemoteQueryExecutor::readAsync() later unconditionally calls getPacket() for the completed read and chasserts hasReadPacket() (see RemoteQueryExecutor.cpp around lines 590-595 and RemoteQueryExecutorReadContext.cpp lines 205-208). In this shutdown-before-data scenario, the code will now hit that assert instead of cleanly rescheduling/ignoring the shard. The packet should be marked as a body (or handled without getPacket()) before suspending, mirroring other packet paths.
Useful? React with 👍 / 👎.
3be7ff6 to
be61867
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (e.code() == ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF | ||
| && !read_context.has_data_packets.load() | ||
| && read_context.executor.skipUnavailableShards()) | ||
| { | ||
| read_context.packet.type = Protocol::Server::ConnectionLost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve skip_unavailable_shards behavior on early EOF
With skip_unavailable_shards enabled, an ATTEMPT_TO_READ_AFTER_EOF before any data now produces a ConnectionLost packet that is rethrown in RemoteQueryExecutor (case Protocol::Server::ConnectionLost around lines 732–742 of RemoteQueryExecutor.cpp). The previous logic swallowed this exception when no data had been read so other shards could continue. This regression causes queries that should skip an unreachable shard to fail whenever the shard drops the socket before sending data.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception is rethrown only when tasks were not rescheduled on other replicas. Without rescheduling tasks can be lost forever, and response contain only part of data.
|
failed tests looks unrelated |
|
LGTM, |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Solved #1200
Fix segfault on unexpected node shutdown
Documentation entry for user-facing changes
Setting
allow_retries_in_cluster_requestsactually useless, and trying to process caused segfault.Logic behind
allow_retries_in_cluster_requestsis based on the assumption that swarm nodes send first data packet only when already processed some portion of data. But in real world nodes send first packet immediately without any data, but only with header (column list) - https://github.com/Altinity/ClickHouse/blob/antalya-25.8/src/Server/TCPHandler.cpp#L1296This makes this setting useless.
Segfault is a result of incorrect continue of reading after exception in
RemoteQueryExecutorReadContext::Task::run.Removing
allow_retries_in_cluster_requestsalso removes this try-catch block, which fixes the problem.EOF catched without continue reading, possible if node shutdowned in short period of time between getting subquery and answering "end of work" in case when swarm mode turned off.
CI/CD Options
Exclude tests:
Regression jobs to run: