Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pyTigerGraphLoading.py to add support on direct data loading #260

Merged
merged 12 commits into from
Oct 30, 2024

Conversation

chengbiao-jin
Copy link
Collaborator

Add function to support data loading from a string directly instead of a file.

Add function to support data loading from a string directly instead of a file.
Copy link
Contributor

@parkererickson-tg parkererickson-tg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we auto generate the docs from docstrings so we need to have full docstring on for runLoadingJobWithData. Is the data parameter a true string or a bytestring (which is what we read the filepath as with runLoadingJobWithData)

FILENAME definition will be updated to point to the data received.

NOTE: The argument `USING HEADER="true"` in the GSQL loading job may not be enough to
load the file correctly. Remove the header from the data file before using this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is confusing, same for runLoadingJobWithFile

So header should be removed before calling these two functions. If loading job still has Using header=true, will the first line be ignored?

Copy link
Collaborator Author

@chengbiao-jin chengbiao-jin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original function does not support it hence I did not make any change on it yet.

Actually I'd prefer to support HEADER=true in these 2 functions hence user can provide the parameters according to the loading job. @parkererickson-tg do you have any background information on why it might not loading correctly with HEADER specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a long-time bug in the ddl system

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header=False will be required in df.to_csv() in this case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, make it explicit that USING HEADER=false in loading job definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEADER=false is actually the default behavior

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I mean user should not set USING HEADER=true in loading job in this case? otherwise they will lose 1 row?

Copy link
Contributor

@parkererickson-tg parkererickson-tg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some unit tests on this too?

Copy link
Collaborator

@qe-tigergraph qe-tigergraph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit Test: FAILURE, Jenkins_job:http://192.168.99.101:30080/job/mlwb_build/1232/

Copy link
Collaborator

@qe-tigergraph qe-tigergraph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit Test: SUCCESS, e2e Test: SKIPPED, Jenkins_job:http://192.168.99.101:30080/job/mlwb_build/1234/

Copy link
Collaborator

@qe-tigergraph qe-tigergraph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QE Approved

@chengbiao-jin
Copy link
Collaborator Author

Can we get some unit tests on this too?

@parkererickson-tg where do we put the unit test? Is there an example?

@parkererickson-tg
Copy link
Contributor

Can we get some unit tests on this too?

@parkererickson-tg where do we put the unit test? Is there an example?

Looks like we actually are missing tests on our entire loading job execution functionality... here is a test file for our vertex functions: https://github.com/tigergraph/pyTigerGraph/blob/master/tests/test_pyTigerGraphVertex.py. We put test fixtures like GSQL files here: https://github.com/tigergraph/pyTigerGraph/blob/master/tests/fixtures/create_query_simple.gsql.

@parkererickson-tg
Copy link
Contributor

@chengbiao-jin It would be nice if you could add the support for async functionality as well, as I just merged that PR today, which was a pretty large refactor. If you don't have bandwidth, I can probably pick it up this week.

@chengbiao-jin
Copy link
Collaborator Author

chengbiao-jin commented Oct 28, 2024 via email

@parkererickson-tg parkererickson-tg merged commit dfc312d into master Oct 30, 2024
@parkererickson-tg parkererickson-tg deleted the cjin_add_data_loading branch October 30, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants