Fix segfault on unexpected node shutdown #1201

ianton-ru · 2025-12-09T12:50:25Z

Changelog category (leave one):

Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1200
Fix segfault on unexpected node shutdown

Documentation entry for user-facing changes

Setting allow_retries_in_cluster_requests actually useless, and trying to process caused segfault.
Logic behind allow_retries_in_cluster_requests is based on the assumption that swarm nodes send first data packet only when already processed some portion of data. But in real world nodes send first packet immediately without any data, but only with header (column list) - https://github.com/Altinity/ClickHouse/blob/antalya-25.8/src/Server/TCPHandler.cpp#L1296
This makes this setting useless.
Segfault is a result of incorrect continue of reading after exception in RemoteQueryExecutorReadContext::Task::run.
Removing allow_retries_in_cluster_requests also removes this try-catch block, which fixes the problem.
EOF catched without continue reading, possible if node shutdowned in short period of time between getting subquery and answering "end of work" in case when swarm mode turned off.

CI/CD Options

Exclude tests:

Regression jobs to run:

github-actions · 2025-12-09T12:51:18Z

Workflow [PR], commit [be61867]

ianton-ru · 2025-12-09T12:51:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-09T12:59:13Z

src/QueryPipeline/RemoteQueryExecutorReadContext.cpp

+        read_context.packet = read_context.executor.getConnections().receivePacketUnlocked(async_callback);
        read_context.has_read_packet_part = PacketPart::Body;
+        if (read_context.packet.type == Protocol::Server::Data)
+            read_context.has_data_packets = true;


skip_unavailable_shards no longer masks early socket EOF

When a shard connection drops before any data is received (e.g., node crash) and skip_unavailable_shards is enabled, receivePacketUnlocked now throws ATTEMPT_TO_READ_AFTER_EOF and the exception bubbles out because Task::run no longer catches it. The prior logic swallowed this specific EOF when no data packets had arrived so the shard could be skipped; after this change the query fails instead of continuing on remaining shards, breaking the advertised skip behavior.

Useful? React with 👍 / 👎.

ianton-ru · 2025-12-09T14:41:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-09T14:50:10Z

src/QueryPipeline/RemoteQueryExecutorReadContext.cpp

+        if (e.code() == ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF
+            && !read_context.has_data_packets.load()
+            && read_context.executor.skipUnavailableShards())
+        {
+            read_context.packet.type = Protocol::Server::ConnectionLost;
+            read_context.packet.exception = std::make_unique<Exception>(getCurrentExceptionMessageAndPattern(true), getCurrentExceptionCode());
+            read_context.has_read_packet_part = PacketPart::None;
+            suspend_callback();


ConnectionLost packet never marked as readable

When handling an early EOF from a shard (ATTEMPT_TO_READ_AFTER_EOF with skipUnavailableShards()), the new branch builds a ConnectionLost packet but leaves has_read_packet_part at PacketPart::None. RemoteQueryExecutor::readAsync() later unconditionally calls getPacket() for the completed read and chasserts hasReadPacket() (see RemoteQueryExecutor.cpp around lines 590-595 and RemoteQueryExecutorReadContext.cpp lines 205-208). In this shutdown-before-data scenario, the code will now hit that assert instead of cleanly rescheduling/ignoring the shard. The packet should be marked as a body (or handled without getPacket()) before suspending, mirroring other packet paths.

Useful? React with 👍 / 👎.

ianton-ru · 2025-12-09T15:26:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-09T15:39:16Z

src/QueryPipeline/RemoteQueryExecutorReadContext.cpp

+        if (e.code() == ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF
+            && !read_context.has_data_packets.load()
+            && read_context.executor.skipUnavailableShards())
+        {
+            read_context.packet.type = Protocol::Server::ConnectionLost;


Preserve skip_unavailable_shards behavior on early EOF

With skip_unavailable_shards enabled, an ATTEMPT_TO_READ_AFTER_EOF before any data now produces a ConnectionLost packet that is rethrown in RemoteQueryExecutor (case Protocol::Server::ConnectionLost around lines 732–742 of RemoteQueryExecutor.cpp). The previous logic swallowed this exception when no data had been read so other shards could continue. This regression causes queries that should skip an unreachable shard to fail whenever the shard drops the socket before sending data.

Useful? React with 👍 / 👎.

Exception is rethrown only when tasks were not rescheduled on other replicas. Without rescheduling tasks can be lost forever, and response contain only part of data.

ianton-ru · 2025-12-10T08:02:02Z

failed tests looks unrelated

ilejn · 2025-12-10T09:40:28Z

LGTM,
~~although I am not confident enough to approve.~~

Remove useless allow_retries_in_cluster_requests setting

628904a

chatgpt-codex-connector bot reviewed Dec 9, 2025

View reviewed changes

ianton-ru force-pushed the bugfix/antalya-25.8/segfault_on_unexpected_shutdown branch from 027c0c4 to 3be7ff6 Compare December 9, 2025 14:38

chatgpt-codex-connector bot reviewed Dec 9, 2025

View reviewed changes

Return try-catch for unexpected EOF

be61867

ianton-ru force-pushed the bugfix/antalya-25.8/segfault_on_unexpected_shutdown branch from 3be7ff6 to be61867 Compare December 9, 2025 15:16

chatgpt-codex-connector bot reviewed Dec 9, 2025

View reviewed changes

svb-alt added antalya antalya-25.8 labels Dec 9, 2025

ilejn approved these changes Dec 11, 2025

View reviewed changes

mkmkme approved these changes Dec 15, 2025

View reviewed changes

Fix segfault on unexpected node shutdown #1201

Are you sure you want to change the base?

Fix segfault on unexpected node shutdown #1201

Conversation

ianton-ru commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianton-ru commented Dec 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Dec 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Dec 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Dec 10, 2025

Uh oh!

ilejn commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ianton-ru commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 9, 2025 •

edited

Loading

ilejn commented Dec 10, 2025 •

edited

Loading