-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot read avro files without data #27
Comments
If helpful, here are the bytes of a problematic avro file (believe it is compressed with the snappy codec):
... and base64 encoded:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I just started using cyavro today, and it's wonderful so far. It precisely fills a need to parse a directory of avro files -- quickly -- into a panda dataframe.
However, running into a problem with directories that contain avro files without any rows.
The avro files I'm attemping to read by path are generated by Spark. Whether the total rows written to avro are 100, 1k, 100k, it splits them into a handful of files. I won't pretend to know why or how exactly, but I do fairly commonly see 4 avro files in a given directory.
The python spark code that writes these avro files looks somewhat like this:
The result is a structure like this:
As you can see, one of the avro file
part-r-00000
is164k
, and contains the majority (if not all) of the rows. This is loaded quickly and without issue. But attempting to parse the entire directory with.read_avro_path_as_dataframe
fails with the error:Afraid it was these "empty" avro files, confirmed that attempting to read the files
part-r-00001
orpart-r-00002
individually result in the same error. And this makes sense if the.read_avro_path_as_dataframe
is really just opening them up individually, and then concatenating.For what it's worth, I have parsed these "empty" avro files successfully with the python avro library, where iterating over the reader just results in nothing.
As mentioned, cyavro looks like a really great solution to our need of quickly parsing a path of avro files into a dataframe, but I'm afraid we can't avoid having these "empty" avro files present as well. Any thoughts would be much appreciated.
OS: Mac OS (will eventually build in Ubuntu 16.04)
Build: conda build, then local install to conda environment
The text was updated successfully, but these errors were encountered: