Skip to content

feat(services/azdls): list start from #5242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

alexwilcoxson-rel
Copy link

Which issue does this PR close?

Closes #.

Rationale for this change

This changes proposes implementation of the start_after workaround used by Hadoop to improve listing in ADLS from a particular point.

This is important for table formats like Delta Lake that make use of list_with_offset in object store to get the latest version of a table.

In my testing listing directories with > 10k records similar to what Delta Lake would do improves times from seconds to milliseconds (~3seconds to 60ms).

Also this adds SAS token support which could be split into its own PR but needed it for testing against my company's ADLS accounts.

What changes are included in this PR?

Changes to azdls module to include list with start_after functionality. This includes a module crc64 which computes crc for use in the continuation token generated for start after

Are there any user-facing changes?

Yes, start_after will function for azdls, and SAS token support.


@Xuanwo I'm opening this as a draft to get initial feedback. This is not directly supported by Microsoft, or I should say documented. In pursuit of them adding this capability to the azblob endpoints, they were the ones that directed me to Hadoop example, so I have confidence in it but could see if its something they would officially support/document.

One caveat as well is I believe (per the hadoop code) the change is will only work if hierarchical namespace (xns) is enabled which isn't something I believe we can tell when the list request is being made. It could be the user has to opt in and inform if xns is enabled. Hadoop code does another request to figure that out dynamically which I'd personally want to avoid.

@github-actions github-actions bot added the releases-note/feat The PR implements a new feature or has a title that begins with "feat" label Oct 24, 2024
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you @alexwilcoxson-rel for working on this. I've been considering porting logic from Hadoop to OpenDAL for a while but haven't taken action yet.

This feature is undocumented, so it's best not to enable it by default. How about adding a new flag called enable_list_with_start_after, where we set the capability list_with_start_after to true?

@alexwilcoxson-rel
Copy link
Author

alexwilcoxson-rel commented Oct 28, 2024

hey @Xuanwo as I was refactoring the code I wrote a test against one of our actual storage accounts. I then executed the list call with recursive = true as the object_store integration does. However when I do that list result from azure contains everything.

I'm trying to figure out how recursive could be impacting the azdls list but I don't see it used by azdls service.

EDIT: ah I see the CompleteLayer, looking into this.

@alexwilcoxson-rel
Copy link
Author

Ok so I found the FlatLister does not retain the OpList args.

Adding modifying FlatLister to take in the original args and pass them to the inner lister seems to work for this case, but I'm not sure how recursive listing and start_after should interact for all edge cases.

The underlying Azure REST API does have a recursive parameter, but when actually using that:

  1. the recursive listing performance in general is not great on azdls
  2. start_after continuation token causes 500 when recursive = true
    so I am hesitant to enable it

@Xuanwo
Copy link
Member

Xuanwo commented Oct 29, 2024

Thank you @alexwilcoxson-rel for catching this.

Adding modifying FlatLister to take in the original args and pass them to the inner lister seems to work for this case, but I'm not sure how recursive listing and start_after should interact for all edge cases.

It should be fine as long as we handle the capability correctly.

The underlying Azure REST API does have a recursive parameter, but when actually using that:

  1. the recursive listing performance in general is not great on azdls
  2. start_after continuation token causes 500 when recursive = true
    so I am hesitant to enable it

That seems to be a problem. Would you like to raise a separate issue for this? We can track it and switch to azdls's own recursive when available.

@alexwilcoxson-rel
Copy link
Author

alexwilcoxson-rel commented Oct 30, 2024

The underlying Azure REST API does have a recursive parameter, but when actually using that:

  1. the recursive listing performance in general is not great on azdls
  2. start_after continuation token causes 500 when recursive = true
    so I am hesitant to enable it

That seems to be a problem. Would you like to raise a separate issue for this? We can track it and switch to azdls's own recursive when available.

@alexwilcoxson-rel
Copy link
Author

alexwilcoxson-rel commented Oct 30, 2024

This feature is undocumented, so it's best not to enable it by default. How about adding a new flag called enable_list_with_start_after, where we set the capability list_with_start_after to true?

  • gate behind feature

@alexwilcoxson-rel
Copy link
Author

Hey @Xuanwo just letting you know I probably won't get back to this until later in the week. I will share though that I am using this in our fork of version ~0.48 I think. We're using it along size the base object_store azure store (since those have all the put support for delta), and it has gone well!

Looking forward to getting this in along with the put if-not-match changes that are in progress as well!

@Xuanwo
Copy link
Member

Xuanwo commented Apr 7, 2025

Hi, @alexwilcoxson-rel, do we have a status change here? I'm open to get this feature merged under a new config called enable_list_with_start_after.

@alexwilcoxson-rel
Copy link
Author

Hey @Xuanwo sorry for the delay on this, I have some cycles to try to get it across the line

@alexwilcoxson-rel
Copy link
Author

This feature is undocumented, so it's best not to enable it by default. How about adding a new flag called enable_list_with_start_after, where we set the capability list_with_start_after to true?

done

@alexwilcoxson-rel alexwilcoxson-rel marked this pull request as ready for review May 5, 2025 18:14
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 5, 2025
@alexwilcoxson-rel
Copy link
Author

In testing this with deltalake table loads, i see the object_store integration is doing a stat per list entry

is there a way around this if the list response contains the needed metadata? i notice the metadata key stuff was removed (for the better 😄 )

@Xuanwo
Copy link
Member

Xuanwo commented May 21, 2025

sas_token support has been added in #6205

@alexwilcoxson-rel
Copy link
Author

In testing this with deltalake table loads, i see the object_store integration is doing a stat per list entry

is there a way around this if the list response contains the needed metadata? i notice the metadata key stuff was removed (for the better 😄 )

@Xuanwo can you provide input here please?

@Xuanwo
Copy link
Member

Xuanwo commented May 28, 2025

In testing this with deltalake table loads, i see the object_store integration is doing a stat per list entry

is there a way around this if the list response contains the needed metadata? i notice the metadata key stuff was removed (for the better 😄 )

I think it's fine for us to simply remove the stat from the list, since object_store doesn't do that anyway. We always generate the same metadata as object_store does.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@@ -162,11 +162,11 @@ impl<A: Access> CompleteAccessor<A> {
(true, false) => {
// Forward path that ends with /
if path.ends_with('/') {
let p = FlatLister::new(self.inner.clone(), path);
let p = FlatLister::new(self.inner.clone(), path, args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to touch here since start_after is handled by azdls directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was related to this comment: #5242 (comment)

Copy link
Member

@Xuanwo Xuanwo May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was related to this comment: #5242 (comment)

Oh, I see. Sorry for overlooking that.

How about we split them into two PRs to handle them separately? I can also find some time to benchmark azdls’s recursive list.

We can add start_after support in this PR and consider removing recursive support for azdls if we confirm that its performance is really poor.

I wish we can merge this PR quickly so you can use the mainline opendal ASAP 💌

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should i should remove all changes in flat_list and complete and then there will be a test with recursive and azdls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should i should remove all changes in flat_list and complete and

Yes, let's remove all changes in flat_list and complete.

then there will be a test with recursive and azdls?

This has been covered in our behavior tests, so as long as our CI passes, everything should be fine.

@alexwilcoxson-rel alexwilcoxson-rel changed the title feat(services/azdls): list start from and sas token support feat(services/azdls): list start from May 28, 2025
@alexwilcoxson-rel
Copy link
Author

@Xuanwo if your okay with latest changes, I can create follow up issues for

  • recursive investigation on azdls
  • object_store stat removal during listing

@Xuanwo
Copy link
Member

Xuanwo commented May 29, 2025

if your okay with latest changes, I can create follow up issues for

Perfect! Most changes looks good to me. We only need to remove the FlatLister changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
releases-note/feat The PR implements a new feature or has a title that begins with "feat" size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants