Skip to content

refactor: ABFS implementation #11419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

majetideepak
Copy link
Collaborator

@majetideepak majetideepak commented Nov 4, 2024

Combine AbfsAccount and AbfsConfig in a separate file.
Clean up API naming and clarify semantics.
Add a new constructor for AbfsWriteFile to specify a client. This is used for testing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 4, 2024
Copy link

netlify bot commented Nov 4, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5475558
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673625f6696151000840c72a

@majetideepak majetideepak force-pushed the abfs-gcs-multifs branch 3 times, most recently from 58c4444 to be9b010 Compare November 4, 2024 16:12
@FelixYBW
Copy link

FelixYBW commented Nov 4, 2024

@zhli1142015

* To facilitate unit testing of file write scenarios, we define the
* AzureDatalakeFileClient here, which can be mocked during testing.
*/
class AdlsFileClient {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use AzureDatalakeFileClient, AdlsFileClient indicates a different thing in our context.
Thanks.

Copy link
Collaborator Author

@majetideepak majetideepak Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure client name for writing is DataLakeFileClient. I renamed the implementations accordingly and tried to keep the name short.

* https://github.com/Azure/Azurite/wiki/ADLS-Gen2-Implementation-Guidance
*
* To facilitate unit testing of file write scenarios, we define the
* IBlobStorageFileClient here, which can be mocked during testing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the name here also.


namespace facebook::velox::filesystems {

static std::string kAzureBlobEndpoint{"fs.azure.blob-endpoint"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this property do? Is this for testing only, if yes please add comments.
Thanks.

@majetideepak majetideepak marked this pull request as ready for review November 5, 2024 20:59
@majetideepak
Copy link
Collaborator Author

@zhli1142015 thanks for your review! I addressed your comments. Can you take another look?

@majetideepak
Copy link
Collaborator Author

@zhli1142015 I noticed that the usage of DataLakeFileClient::Flush is not optimal. We flush when the file is closed
and this might consume a lot of memory if the file written is big. We need to flush similarly to S3 periodically (10Mib chunks). What do you think?

Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor. Looks much more clean now.

std::string_view file;
bool isHttps = true;
if (path.find(kAbfssScheme) == 0) {
file = path.substr(8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use

file = path.substr(kAbfssScheme.length());

if (path.find(kAbfssScheme) == 0) {
file = path.substr(8);
} else if (path.find(kAbfsScheme) == 0) {
file = path.substr(7);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use

file = path.substr(kAbfsScheme.length());

const config::ConfigBase& config) {
auto abfsAccount = AbfsConfig(path, config);
std::shared_ptr<AzureDataLakeFileClient> client =
std::make_shared<DataLakeFileClientWrapper>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why the DataLakeFileClientWrapper is shared_ptr and not unique_ptr? Would something else access this if it is part of the client?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr made it easier to write tests. But I changed this to a unique_ptr as I think it should be as well.

abfssConfig.connectionString(),
"DefaultEndpointsProtocol=https;AccountName=foobar;AccountKey=456;EndpointSuffix=core.windows.net;");

// test with special characters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Commit not complete sentence.

{{"fs.azure.account.key.test.dfs.core.windows.net", key_},
{kAzureBlobEndpoint, endpoint}});

// Update the default config map with the supplied configOverride map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Missing .. Or do we even need this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the function documentation instead.


virtual ~AzuriteServer();

private:
int64_t port_;
std::string account_{"test"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe all of these 4 new members are const?

@zhli1142015
Copy link
Collaborator

ient::Flush is not optimal. We flush when the file is closed
and this might consume a lot of memory if the file written is big. We need to flush similarly to S3 periodically (10Mib chunks). What do you think?

I thought we don't cache data and send by chunk. The append API sends data to remote directly.
The behavior you mentioned in S3, may be better, we can take a look later.
Thanks.

@majetideepak
Copy link
Collaborator Author

@czentgr thanks for the review. I addressed your comments.

@majetideepak
Copy link
Collaborator Author

Filed #11456 for the write improvements.

Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one nit.

filePath_ = tempFile->getPath();
}

MockDataLakeFileClient(std::string_view filePath) : filePath_(filePath) {}

std::string_view path() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, this could be std::string_view path() const.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 8, 2024
@kevinwilfong
Copy link
Contributor

This needs a maintainer to approve, cc: @Yuhta @xiaoxmeng

https://velox-lib.io/docs/community/components-and-maintainers/

@majetideepak
Copy link
Collaborator Author

@kevinwilfong I am the maintainer for the storage_adapters and I approve this :).

@kevinwilfong
Copy link
Contributor

I just double checked with the PLC and it sounds like that's not sufficient.

@majetideepak
Copy link
Collaborator Author

@zhli1142015 Can you please take a look and approve this PR?

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp since @zhli1142015 has the most knowledge about this and he is good with the change.

@facebook-github-bot
Copy link
Contributor

@kevinwilfong has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kevinwilfong
Copy link
Contributor

It looks like if we merge this it will break Presto, could you make a VELOX_ENABLE_FORWARD_COMPATIBILITY change in Presto to handle changing the namespace

https://github.com/prestodb/presto/blob/ecfda83fe2987a191b1cd8f722e9fe2e9d4c8b0e/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L1304

@majetideepak
Copy link
Collaborator Author

@kevinwilfong Added Presto change here prestodb/presto#24063

@kevinwilfong
Copy link
Contributor

Thanks! I'll try merging this again.

@facebook-github-bot
Copy link
Contributor

@kevinwilfong has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kevinwilfong merged this pull request in 7d0b84e.

Copy link

Conbench analyzed the 1 benchmark run on commit 7d0b84e8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

athmaja-n pushed a commit to athmaja-n/velox that referenced this pull request Jan 10, 2025
Summary:
Combine AbfsAccount and AbfsConfig in a separate file.
Clean up API naming and clarify semantics.
Add a new constructor for AbfsWriteFile to specify a client. This is used for testing.

Pull Request resolved: facebookincubator#11419

Reviewed By: Yuhta

Differential Revision: D66015694

Pulled By: kevinwilfong

fbshipit-source-id: 7224aaa1e3cda99c1596546e8050676c635396a5
@majetideepak majetideepak deleted the abfs-gcs-multifs branch June 19, 2025 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants