Skip to content

Add an option for determine read/write timeout for namenodes #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kimtree
Copy link
Contributor

@kimtree kimtree commented Sep 14, 2020

In most cases for the datanodes, the user can set read/write timeout by using SetDeadline to the Readers. But for namenodes, we can only use a single DialContext when initializing the namenode connection.

The issue that I faced was, I was able to connect to the namenode without any timeout issues, also works fine for write function. But, we had an issue that namenode can't respond to the request. At this point, the client hangs.
ref: https://github.com/colinmarc/hdfs/blob/master/internal/rpc/namenode.go#L209

I thought It would be great if we add options for determining read/write timeout value for the namenodes to prevent hanging when the namenode isn't available for responding.

Please let me know if I misunderstood the concept :)

@colinmarc
Copy link
Owner

colinmarc commented Sep 14, 2020

Hi @kimtree, thanks for the suggestion.

I think this may be the simplest way to solve the problem, but I've also been considering adding context support for some methods, eg RenameContext(ctx context.Context, oldpath, newpath string), with support for cancellation. Would that work in your case?

@kimtree
Copy link
Contributor Author

kimtree commented Sep 14, 2020

Hi @colinmarc, It's good to hear that you've been considering adding context support! I also made context support by wrapping the library function in our project like below.

func open(ctx context.Context, client hdfs.Client, path string) (*FileReader, error) {
reader, err := client.Open(path)
if err != nil {
	return nil, err
}

if deadline, ok := ctx.Deadline(); ok {
	if err := reader.SetDeadline(deadline); err != nil {
		return nil, fmt.Errorf("failed to set deadline: %w", err)
	}
}

return reader, nil
}

This way of using context is not complete since client.Open function is out of control. Thus, our problem occurs when executing c.namenode.Execute from client.Open function. Are you going to add context support when calling c.namenode.Execute function? If so, It would solve our problem!

@colinmarc
Copy link
Owner

Exactly. I would change the signature of Execute to take a context.

@kimtree
Copy link
Contributor Author

kimtree commented Sep 14, 2020

Yes, That's a good idea. So when do you plan to add this feature?

@colinmarc
Copy link
Owner

Currently my thinking would be to add RenameContext etc for a point release, and then change the old Rename etc for a 3.0 release. I'm not working on context support right now, and therefore can't give you an estimate on when it may land. If you'd like to take a stab at it, that would be more than welcome!

@colinmarc
Copy link
Owner

This way of using context is not complete since client.Open function is out of control. Thus, our problem occurs when executing c.namenode.Execute from client.Open function. Are you going to add context support when calling c.namenode.Execute function? If so, It would solve our problem!

That said, this example illustrates where it would be tricky. The call to Execute (for getBlocks) actually happens lazily, in Read. We could move that up, but that might introduce subtle changes in behaviour.

@kimtree
Copy link
Contributor Author

kimtree commented Sep 14, 2020

Yeah, that's true. The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward. To the users, It might be confused since the timeout occurs both namenode and datanode and couldn't understand the concept of how timeout works at a glance. (Timeout includes actions both namenode and datanode or the only actions for each namenode and datanode)

I'm not sure my approach is still acceptable in the meantime waiting for v3, Is it possible to accept this PR and release a minor version for v2?

@colinmarc
Copy link
Owner

The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward.

Hm, maybe Open is actually one of the functions that doesn't need a context parameter. We could just use the read deadline for the getBlocks. Or is that too confusing?

@kimtree
Copy link
Contributor Author

kimtree commented Sep 14, 2020

My problem wasn't from the FileReader since we can set deadline when reading the data later. In my case, our namenode failures when executing info, err := c.getFileInfo(name) from Open function and that causes hang while reading the response from the namenode. So, It'll also block the later operations like creating FileReader. That's why I add an option for read/write timeout for namenode specifically.

@lnn1988
Copy link

lnn1988 commented Feb 25, 2022

Exactly. I would change the signature of Execute to take a context.

really excited about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants