Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: ABFS implementation #11419

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

majetideepak
Copy link
Collaborator

@majetideepak majetideepak commented Nov 4, 2024

Combine AbfsAccount and AbfsConfig in a separate file.
Clean up API naming and clarify semantics.
Add a new constructor for AbfsWriteFile to specify a client. This is used for testing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 4, 2024
Copy link

netlify bot commented Nov 4, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5475558
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673625f6696151000840c72a

@majetideepak majetideepak force-pushed the abfs-gcs-multifs branch 3 times, most recently from 58c4444 to be9b010 Compare November 4, 2024 16:12
@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 4, 2024

@zhli1142015

* To facilitate unit testing of file write scenarios, we define the
* AzureDatalakeFileClient here, which can be mocked during testing.
*/
class AdlsFileClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use AzureDatalakeFileClient, AdlsFileClient indicates a different thing in our context.
Thanks.

Copy link
Collaborator Author

@majetideepak majetideepak Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure client name for writing is DataLakeFileClient. I renamed the implementations accordingly and tried to keep the name short.

* https://github.com/Azure/Azurite/wiki/ADLS-Gen2-Implementation-Guidance
*
* To facilitate unit testing of file write scenarios, we define the
* IBlobStorageFileClient here, which can be mocked during testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the name here also.


namespace facebook::velox::filesystems {

static std::string kAzureBlobEndpoint{"fs.azure.blob-endpoint"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this property do? Is this for testing only, if yes please add comments.
Thanks.

@majetideepak majetideepak marked this pull request as ready for review November 5, 2024 20:59
@majetideepak
Copy link
Collaborator Author

@zhli1142015 thanks for your review! I addressed your comments. Can you take another look?

@majetideepak
Copy link
Collaborator Author

@zhli1142015 I noticed that the usage of DataLakeFileClient::Flush is not optimal. We flush when the file is closed
and this might consume a lot of memory if the file written is big. We need to flush similarly to S3 periodically (10Mib chunks). What do you think?

Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor. Looks much more clean now.

std::string_view file;
bool isHttps = true;
if (path.find(kAbfssScheme) == 0) {
file = path.substr(8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use

file = path.substr(kAbfssScheme.length());

if (path.find(kAbfssScheme) == 0) {
file = path.substr(8);
} else if (path.find(kAbfsScheme) == 0) {
file = path.substr(7);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use

file = path.substr(kAbfsScheme.length());


namespace facebook::velox::filesystems::abfs {
namespace facebook::velox::config {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of your changes but this header doesn't have #pragma once. I think we always should have that?

const config::ConfigBase& config) {
auto abfsAccount = AbfsConfig(path, config);
std::shared_ptr<AzureDataLakeFileClient> client =
std::make_shared<DataLakeFileClientWrapper>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why the DataLakeFileClientWrapper is shared_ptr and not unique_ptr? Would something else access this if it is part of the client?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr made it easier to write tests. But I changed this to a unique_ptr as I think it should be as well.

abfssConfig.connectionString(),
"DefaultEndpointsProtocol=https;AccountName=foobar;AccountKey=456;EndpointSuffix=core.windows.net;");

// test with special characters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Commit not complete sentence.

{{"fs.azure.account.key.test.dfs.core.windows.net", key_},
{kAzureBlobEndpoint, endpoint}});

// Update the default config map with the supplied configOverride map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Missing .. Or do we even need this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the function documentation instead.


virtual ~AzuriteServer();

private:
int64_t port_;
std::string account_{"test"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe all of these 4 new members are const?

@zhli1142015
Copy link
Contributor

ient::Flush is not optimal. We flush when the file is closed
and this might consume a lot of memory if the file written is big. We need to flush similarly to S3 periodically (10Mib chunks). What do you think?

I thought we don't cache data and send by chunk. The append API sends data to remote directly.
The behavior you mentioned in S3, may be better, we can take a look later.
Thanks.

@majetideepak
Copy link
Collaborator Author

@czentgr thanks for the review. I addressed your comments.

@majetideepak
Copy link
Collaborator Author

Filed #11456 for the write improvements.

Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one nit.

filePath_ = tempFile->getPath();
}

MockDataLakeFileClient(std::string_view filePath) : filePath_(filePath) {}

std::string_view path() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, this could be std::string_view path() const.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 8, 2024
@kevinwilfong
Copy link
Contributor

This needs a maintainer to approve, cc: @Yuhta @xiaoxmeng

https://velox-lib.io/docs/community/components-and-maintainers/

@majetideepak
Copy link
Collaborator Author

@kevinwilfong I am the maintainer for the storage_adapters and I approve this :).

@kevinwilfong
Copy link
Contributor

I just double checked with the PLC and it sounds like that's not sufficient.

@majetideepak
Copy link
Collaborator Author

@zhli1142015 Can you please take a look and approve this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants