Skip to content

Conversation

@bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Nov 10, 2025

Description

This PR fixes a bootstrap failure that occurs when streaming transport is used with remote cluster state. Without this fix, nodes fail to start with the error:

can't overwrite as repositories are already present
	at org.opensearch.repositories.RepositoriesService.updateRepositoriesMap(RepositoriesService.java:885)
	at org.opensearch.node.remotestore.RemoteStoreNodeService.createAndVerifyRepositories(RemoteStoreNodeService.java:163)
	at org.opensearch.node.Node$LocalNodeFactory.apply(Node.java:2355)
	at org.opensearch.node.Node$LocalNodeFactory.apply(Node.java:2323)
	at org.opensearch.transport.TransportService.doStart(TransportService.java:402)
	at org.opensearch.common.lifecycle.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:77)
	at org.opensearch.node.Node.start(Node.java:1850)

Root Cause

When streaming transport is enabled, the node bootstrap process creates two separate LocalNodeFactory instances:

  1. One for the regular TransportService
  2. Another for the StreamTransportService

During node startup, both transport services are started sequentially:

if (streamTransportService != null) {
    streamTransportService.start();  // First call to LocalNodeFactory.apply()
}
transportService.start();  // Second call to LocalNodeFactory.apply()

Both services inherit TransportService.doStart() which calls:

localNode = localNodeFactory.apply(transport.boundAddress());

Each call to LocalNodeFactory.apply() triggers:

  1. DiscoveryNode creation
  2. Remote store repository creation and verification via remoteStoreNodeService.createAndVerifyRepositories()

Since both factories attempt to register the same repositories (configured via node attributes), the second call fails with "can't overwrite as repositories are already present".

Solution

We don't do Remote store repository creation and verification when it's for stream transport

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for 566d7f6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@bowenlan-amzn bowenlan-amzn force-pushed the streaming-transport-bootstrap branch from 566d7f6 to b64bdc7 Compare November 11, 2025 02:13
@bowenlan-amzn bowenlan-amzn force-pushed the streaming-transport-bootstrap branch from b64bdc7 to 68fb9d8 Compare November 11, 2025 02:33
@github-actions
Copy link
Contributor

❌ Gradle check result for 68fb9d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <[email protected]>
@bowenlan-amzn bowenlan-amzn changed the title Reuse local node factory when streaming transport bootstrap Fix node bootstrap error when enable stream transport and remote cluster state Nov 11, 2025
Signed-off-by: bowenlan-amzn <[email protected]>
@github-actions
Copy link
Contributor

✅ Gradle check result for 3688824: SUCCESS

@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (4dc8704) to head (3688824).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19948      +/-   ##
============================================
+ Coverage     73.25%   73.26%   +0.01%     
- Complexity    71555    71592      +37     
============================================
  Files          5785     5785              
  Lines        326828   326831       +3     
  Branches      47295    47295              
============================================
+ Hits         239429   239464      +35     
+ Misses        68163    68133      -30     
+ Partials      19236    19234       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +2337 to +2338
RemoteStoreNodeService remoteStoreNodeService,
boolean useStreamTransport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weird thing here is that "useStreamTransport" doesn't make any sense because there's nothing about stream transport in the LocalNodeFactory. I think you can use composition to make this more clear, if a bit more verbose. You can refactor LocalNodeFactory into two separate functions. First, remove RemoteStoreNodeService from this implementation, then create one that looks like:

private static class RemoteStoreVerifyingLocalNodeFactory extends LocalNodeFactory {
    private final RemoteStoreNodeService remoteStoreNodeService;
    
    private RemoteStoreVerifyingLocalNodeFactory(Settings settings, String persistentNodeId, RemoteStoreNodeService remoteStoreNodeService) {
        super(settings, persistentNodeId);
        this.remoteStoreNodeService = remoteStoreNodeService;
    }

    @Override
    public DiscoveryNode apply(BoundTransportAddress boundTransportAddress) {
        final DiscoveryNode discoveryNode = super.apply(boundTransportAddress);
        if (isRemoteStoreAttributePresent(settings)) {
            remoteStoreNodeService.createAndVerifyRepositories(discoveryNode);
        }
        return discoveryNode;
    }
}

In the normal case you construct a RemoteStoreVerifyingLocalNodeFactory instance, and in the stream transport case you just construct a plain LocalNodeFactory instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants