Skip to content

Conversation

@unsuman
Copy link
Contributor

@unsuman unsuman commented Nov 14, 2025

Fixes #4343

[Edit]: Also fixes #4348

Copy link
Contributor

@norio-nomura norio-nomura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not resolve the real problem (See #4343 (comment)).
This change should not be accepted.

@unsuman unsuman force-pushed the fix/external-vz-startup branch from b4b0dc4 to dbb1bf6 Compare November 18, 2025 08:29
@unsuman unsuman requested a review from norio-nomura November 18, 2025 08:36
@unsuman
Copy link
Contributor Author

unsuman commented Nov 18, 2025

Now it waits for Start() to complete before running AdditionalSetupForSSH(), can you please check?

@norio-nomura
Copy link
Contributor

norio-nomura commented Nov 18, 2025

With following patch:

@@ -0,0 +1,21 @@
diff --git forkSrcPrefix/pkg/driver/vz/vz_driver_darwin.go forkDstPrefix/pkg/driver/vz/vz_driver_darwin.go
index c87995a62c8b53b7724ba4020981c3f5851cd86f..83b074f95bd39c5b7b371cc7861d5b7c3b298037 100644
--- forkSrcPrefix/pkg/driver/vz/vz_driver_darwin.go
+++ forkDstPrefix/pkg/driver/vz/vz_driver_darwin.go
@@ -299,6 +299,7 @@ func (l *LimaVzDriver) CreateDisk(ctx context.Context) error {
 
 func (l *LimaVzDriver) Start(ctx context.Context) (chan error, error) {
 	logrus.Infof("Starting VZ (hint: to watch the boot progress, see %q)", filepath.Join(l.Instance.Dir, "serial*.log"))
+	defer logrus.Infof("Started VZ")
 	vm, waitSSHLocalPortAccessible, errCh, err := startVM(ctx, l.Instance, l.SSHLocalPort)
 	if err != nil {
 		if errors.Is(err, vz.ErrUnsupportedOSVersion) {
@@ -441,6 +442,8 @@ func (l *LimaVzDriver) ForwardGuestAgent() bool {
 }
 
 func (l *LimaVzDriver) AdditionalSetupForSSH(ctx context.Context) error {
+	logrus.Infof("AdditionalSetupForSSH: waiting for SSH port %d to be accessible", l.SSHLocalPort)
+	defer logrus.Infof("AdditionalSetupForSSH: SSH port %d is accessible", l.SSHLocalPort)
 	ticker := time.NewTicker(100 * time.Millisecond)
 	defer ticker.Stop()
 

following logs are recorded in driver.stderr.log:

time="2025-11-18T18:42:58+09:00" level=debug msg="Received AdditionalSetupForSSH request"
time="2025-11-18T18:42:58+09:00" level=info msg="AdditionalSetupForSSH: waiting for SSH port 50725 to be accessible"
time="2025-11-18T18:42:58+09:00" level=debug msg="Received Start request"
time="2025-11-18T18:42:58+09:00" level=info msg="Starting VZ (hint: to watch the boot progress, see \"/Users/norio/.lima/default/serial*.log\")"
time="2025-11-18T18:42:58+09:00" level=info msg="Started VZ"
time="2025-11-18T18:42:58+09:00" level=debug msg="Stream context done, stopping Start"
time="2025-11-18T18:42:58+09:00" level=info msg="[VZ] - vm state change: running"
time="2025-11-18T18:43:05+09:00" level=info msg="Started vsock forwarder: 127.0.0.1:50725 -> vsock:22 on VM"
time="2025-11-18T18:43:05+09:00" level=info msg="Detected SSH server is listening on the vsock port; changed 127.0.0.1:50725 to proxy for the vsock port"
time="2025-11-18T18:43:06+09:00" level=info msg="AdditionalSetupForSSH: SSH port 50725 is accessible"
time="2025-11-18T18:43:06+09:00" level=debug msg="AdditionalSetupForSSH succeeded"

It indicates that LimaVzDriver.AdditionalSetupForSSH() is called before calling LimaVzDriver.Start().

I think:

  • DriverServer.Start() should send initial response that contains Error or Success that express the result returned by the driver's Start().
  • DriverClient.Start() should wait initial response, then returns them.
  • Server side stream will be closed when DriverServer.Start() returns, so it should not returns until errChan is closed by driver or stream is closed by client.
  • Client side stream will not be closed when DriverClient.Start() returns, and can be closed by stream.CloseSend(). it should be called when ctx.Done().
  • The driver's Start() should receive stream.Context() to handle Server shutdown.

additionally:

  • The closing driver's errChan event can't be expressed in current API design. But current drivers behavior does not defined when errChan should be closed. So, it won't be problems.

Thoughts about driver's implementation that not related to this PR:

  • server.Serve() is implemented to handling SIGINT and SIGINT, but server.Stop() sends SIGKILL.

@norio-nomura
Copy link
Contributor

Added link to gRPC API document on above comment.

@unsuman unsuman force-pushed the fix/external-vz-startup branch from dbb1bf6 to 4d6a8cc Compare November 18, 2025 13:38
@unsuman
Copy link
Contributor Author

unsuman commented Nov 18, 2025

Thanks for the detailed explanation!

@unsuman unsuman force-pushed the fix/external-vz-startup branch from 4d6a8cc to 0003cec Compare November 18, 2025 13:39
Copy link
Contributor Author

@unsuman unsuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check now?

Comment on lines 444 to 462
ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()

timeout := time.After(60 * time.Second)

for {
if l.waitSSHLocalPortAccessible != nil {
<-l.waitSSHLocalPortAccessible
return nil
}

select {
case <-ctx.Done():
return ctx.Err()
case <-timeout:
return errors.New("timeout waiting for Start() to initialize")
case <-ticker.C:
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting these changes also works but I think it can remain so to avoid listening on an empty or uninitialised channel? @norio-nomura

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is "we don't know why it's happening, but we know what's going on", I think it's an effective code to deal with, but now that we know why it's happening, it's an unnecessary code.

If there might be something else to deal with, try not to create a situation where the channel has not been created.
When I decided to continue using waitSSHLocalPortAccessible in #4251, I didn't expect AdditionalSetupForSSH() to be called before Start() was called, so I left it created in Start().
However, if I had created a channel at the same time as creating the LimaVzDriver instance, the problem with AdditionalSetupForSSH() would not have occurred this time, and it might not have caught my attention, and the behavior of DriverClient.Start() might not have been known.
And it was causing another problem in another place (already occurring in wsl2?) It may be, so I think it's good to be as it is now.

If it had been specified as a document that "DriverClient.Start() does not wait for the actual Driver.Start()", it might have been different.


I think the current Driver API lacks documentation (doc comment).

Previously, in #4248, I tried to use SSHAddress() as the current AdditionalSetupForSSH().
I stopped for the time being because the explanation was given, but to be honest, I didn't understand it well even after reading the explanation.

I want it to be clearly stated in the document.

  • When is it used: Before the instance starts? After activation?
  • What is it used for?

There may be people who want to judge from the usage situation.
But for example:

vSockPort := limaDriver.Info().VsockPort
virtioPort := limaDriver.Info().VirtioPort
noCloudInit := limaDriver.Info().Features.NoCloudInit
rosettaEnabled := limaDriver.Info().Features.RosettaEnabled
rosettaBinFmt := limaDriver.Info().Features.RosettaBinFmt

Questions such as come to mind:

  • Why does this call the same API 5 times in a row?
  • Does this assume that something will change on the driver side by calling it continuously?
  • Is this method of use correct?

To be honest, I'm not very interested in external drivers, so it doesn't make sense to answer the question here.
However, new contributors who may implement external drivers have no choice but to see how they are used in the current situation where there is no documentation, and I think they will have similar questions.
I don't know if there is anyone who can answer when someone asks that question in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it might not have caught my attention, and the behavior of DriverClient.Start() might not have been known.

Many thanks!🙏🏻

To be honest, I'm not very interested in external drivers, so it doesn't make sense to answer the question here.
However, new contributors who may implement external drivers have no choice but to see how they are used in the current situation where there is no documentation, and I think they will have similar questions.
I don't know if there is anyone who can answer when someone asks that question in the future.

Thanks for pointing this out. I implemented all these things as a sprint behaviour during GSOC and also I was new to the codebase. As you’ve pointed out, many of them are inefficient. Also when implementing registry package and driver-server I assumed that one limactl process may invoke multiple external drivers but I was wrong and have to work on this too!

Since the driver API hasn’t been finalised yet, a descriptive doc is yet to be created.

@unsuman unsuman force-pushed the fix/external-vz-startup branch from 0003cec to 2fe01b9 Compare November 18, 2025 13:45
@unsuman unsuman changed the title vmdriver(vz): fix race condition for AdditionalSetupForSSH() vmdriver(vz): wait for Start() to complete on the server side Nov 18, 2025
@unsuman unsuman changed the title vmdriver(vz): wait for Start() to complete on the server side driver(external): wait for Start() to complete on the server side Nov 18, 2025

// Blocking to receive an initial response to ensure Start() is initiated
// at the server-side.
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to loop here.
initialResp.Error should be returned as the return value of Start().
The subsequent StartResponse.Error should be sent to errCh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If initialResp.Success is false, there is no need to return errCh.


errCh := make(chan error, 1)
go func() {
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be defer func(), but go func() outside and wait for ctx.Done() to call stream.CloseSend().
In the current situation, the only way to get out of the go routine is to shut down on the server side.
When stream.CloseSend() is called, stream.Recv() should return an error and exit the go routine.

}()

for {
errorStream, err := stream.Recv()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name errStream is strange. The content is *pb.StartResponse.

for {
errorStream, err := stream.Recv()
if err != nil {
d.logger.Errorf("Error receiving response from driver: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the stream is closed from the server side, it is better to leave it in the log with Infof instead of Errorf().

if !errorStream.Success {
errCh <- errors.New(errorStream.Error)
} else {
errCh <- nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the driver closes errChan on the server side, it is decided to send StartResponse{Success:true} to the client, and it may be good to close errCh on the client side here.
Even if the driver closes or does not close errChan at any time, even if it is not defined.

func (s *DriverServer) Start(_ *emptypb.Empty, stream pb.Driver_StartServer) error {
s.logger.Debug("Received Start request")
errChan, err := s.driver.Start(stream.Context())
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The err here should be sent to the client as initialResponse.
If initialResponse.Success is true, the subsequent error received by errChan is sent to the client.
When errChan is closed, it would be a good idea to send pb.StartResponse{Success: true} to express channel closure.

@unsuman unsuman force-pushed the fix/external-vz-startup branch 3 times, most recently from d40d24a to 826b868 Compare November 19, 2025 20:05
Copy link
Contributor

@norio-nomura norio-nomura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it has become the desired operation. 👍🏻

After that, since it's a good opportunity, let's add the action decided in this PR as a doc comment to the definition below.

  • pkg/driver/external/driver.proto: StartResponse
  • pkg/driver/external/client/methods.go: Start()
  • pkg/driver/external/server/methods.go: Start()

Actually, it would be better to test the behavior with a mock driver, but I think it's okay to do another PR.

return nil, err
}
if !initialResp.Success {
return nil, errors.New(initialResp.Error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to stream.CloseSend() here?
The server side should also be closed, so there may be no problem, but I'm not familiar enough with gRPC to determine whether it's a problem or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server side should also be closed, so there may be no problem, but I'm not familiar enough with gRPC to determine whether it's a problem or not.

Yes, no need to close here!

// at the server-side.
initialResp, err := stream.Recv()
if err != nil {
d.logger.Errorf("Error receiving initial response from driver start: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have previously been asked to use logger.WithError(err) when recording Error with logger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then all the error logging of pkg/driver/external/*/methods.go needs to be changed which can be a new issue. I am only changing for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #4380

@unsuman unsuman force-pushed the fix/external-vz-startup branch from 826b868 to d784575 Compare November 20, 2025 14:00
@unsuman unsuman force-pushed the fix/external-vz-startup branch from d784575 to 12bf4bc Compare November 20, 2025 14:08
@unsuman
Copy link
Contributor Author

unsuman commented Nov 20, 2025

Actually, it would be better to test the behavior with a mock driver, but I think it's okay to do another PR.

Do you mean to add a test in the CI which builds the current drivers as external and tests the gRPC transport?

@norio-nomura
Copy link
Contributor

Actually, it would be better to test the behavior with a mock driver, but I think it's okay to do another PR.

Do you mean to add a test in the CI which builds the current drivers as external and tests the gRPC transport?

No. Isn't it possible to test the behavior of Start() with dummy driver in the Unit test? I thought about it, but I don't know if it's easy to do, and it seems too much.

However, there is no doubt that #4343 could have been found earlier if it had been tested in CI.
Personally, I think it would be better to test external drivers in a job scheduled to be performed once a day, rather than a test that is required to pass in PR.

Copy link
Contributor

@norio-nomura norio-nomura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍🏻
I don't have permission to merge PR, so I will wait for it to be checked by @lima-vm/committers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DriverClient.Start() should wait execution of actual driver's Start() VZ gets stuck waiting for AdditionalSetupForSSH()

2 participants