Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot read avro files without data #27

Open
ghukill opened this issue Sep 13, 2017 · 1 comment
Open

cannot read avro files without data #27

ghukill opened this issue Sep 13, 2017 · 1 comment

Comments

@ghukill
Copy link

ghukill commented Sep 13, 2017

I just started using cyavro today, and it's wonderful so far. It precisely fills a need to parse a directory of avro files -- quickly -- into a panda dataframe.

However, running into a problem with directories that contain avro files without any rows.

The avro files I'm attemping to read by path are generated by Spark. Whether the total rows written to avro are 100, 1k, 100k, it splits them into a handful of files. I won't pretend to know why or how exactly, but I do fairly commonly see 4 avro files in a given directory.

The python spark code that writes these avro files looks somewhat like this:

.write.format("com.databricks.spark.avro").save('/path/to/avros')

The result is a structure like this:

drwxr-xr-x  12  408B Sep 13 15:22 .
drwxr-xr-x   3  102B Sep 13 15:21 ..
-rw-r--r--   1    8B Sep 13 15:22 ._SUCCESS.crc
-rw-r--r--   1  1.3K Sep 13 15:22 .part-r-00000-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   20B Sep 13 15:22 .part-r-00001-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   20B Sep 13 15:22 .part-r-00002-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1   28B Sep 13 15:22 .part-r-00003-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro.crc
-rw-r--r--   1    0B Sep 13 15:22 _SUCCESS
-rw-r--r--   1  164K Sep 13 15:22 part-r-00000-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  1.3K Sep 13 15:22 part-r-00001-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  1.3K Sep 13 15:22 part-r-00002-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro
-rw-r--r--   1  2.2K Sep 13 15:22 part-r-00003-a705a70a-5107-4508-a5f4-a4dd309c0c03.avro

As you can see, one of the avro file part-r-00000 is 164k, and contains the majority (if not all) of the rows. This is loaded quickly and without issue. But attempting to parse the entire directory with .read_avro_path_as_dataframe fails with the error:

Exception: Can't read file : Cannot read 1 bytes from file

Afraid it was these "empty" avro files, confirmed that attempting to read the files part-r-00001 or part-r-00002 individually result in the same error. And this makes sense if the .read_avro_path_as_dataframe is really just opening them up individually, and then concatenating.

For what it's worth, I have parsed these "empty" avro files successfully with the python avro library, where iterating over the reader just results in nothing.

As mentioned, cyavro looks like a really great solution to our need of quickly parsing a path of avro files into a dataframe, but I'm afraid we can't avoid having these "empty" avro files present as well. Any thoughts would be much appreciated.

OS: Mac OS (will eventually build in Ubuntu 16.04)
Build: conda build, then local install to conda environment

@ghukill
Copy link
Author

ghukill commented Sep 18, 2017

If helpful, here are the bytes of a problematic avro file (believe it is compressed with the snappy codec):

Obj\x01\x04\x16avro.schema\xd2\x14{"type":"record","name":"topLevelRecord","fields":[{"name":"set","type":[{"type":"record","name":"set","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setSource","type":[{"type":"record","name":"setSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"record","type":[{"type":"record","name":"record","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setIds","type":[{"type":"array","items":["string","null"]},"null"]},{"name":"recordSource","type":[{"type":"record","name":"recordSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"error","type":[{"type":"record","name":"error","fields":[{"name":"message","type":["string","null"]},{"name":"errorSource","type":[{"type":"record","name":"errorSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]}]}\x14avro.codec\x0csnappy\x00%\xae\xecs\xfb\xbc`\xf4F\xc7\xf5\x9cL\xf5\x92\xb0

... and base64 encoded:

T2JqAQQWYXZyby5zY2hlbWHSFHsidHlwZSI6InJlY29yZCIsIm5hbWUiOiJ0b3BMZXZlbFJlY29yZCIsImZpZWxkcyI6W3sibmFtZSI6InNldCIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoic2V0IiwiZmllbGRzIjpbeyJuYW1lIjoiaWQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZG9jdW1lbnQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoic2V0U291cmNlIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJzZXRTb3VyY2UiLCJmaWVsZHMiOlt7Im5hbWUiOiJxdWVyeVBhcmFtcyIsInR5cGUiOlt7InR5cGUiOiJtYXAiLCJ2YWx1ZXMiOlsic3RyaW5nIiwibnVsbCJdfSwibnVsbCJdfSx7Im5hbWUiOiJ1cmwiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoidGV4dCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfV19LCJudWxsIl19XX0sIm51bGwiXX0seyJuYW1lIjoicmVjb3JkIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJyZWNvcmQiLCJmaWVsZHMiOlt7Im5hbWUiOiJpZCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJkb2N1bWVudCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJzZXRJZHMiLCJ0eXBlIjpbeyJ0eXBlIjoiYXJyYXkiLCJpdGVtcyI6WyJzdHJpbmciLCJudWxsIl19LCJudWxsIl19LHsibmFtZSI6InJlY29yZFNvdXJjZSIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoicmVjb3JkU291cmNlIiwiZmllbGRzIjpbeyJuYW1lIjoicXVlcnlQYXJhbXMiLCJ0eXBlIjpbeyJ0eXBlIjoibWFwIiwidmFsdWVzIjpbInN0cmluZyIsIm51bGwiXX0sIm51bGwiXX0seyJuYW1lIjoidXJsIiwidHlwZSI6WyJzdHJpbmciLCJudWxsIl19LHsibmFtZSI6InRleHQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX1dfSwibnVsbCJdfV19LCJudWxsIl19LHsibmFtZSI6ImVycm9yIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJlcnJvciIsImZpZWxkcyI6W3sibmFtZSI6Im1lc3NhZ2UiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZXJyb3JTb3VyY2UiLCJ0eXBlIjpbeyJ0eXBlIjoicmVjb3JkIiwibmFtZSI6ImVycm9yU291cmNlIiwiZmllbGRzIjpbeyJuYW1lIjoicXVlcnlQYXJhbXMiLCJ0eXBlIjpbeyJ0eXBlIjoibWFwIiwidmFsdWVzIjpbInN0cmluZyIsIm51bGwiXX0sIm51bGwiXX0seyJuYW1lIjoidXJsIiwidHlwZSI6WyJzdHJpbmciLCJudWxsIl19LHsibmFtZSI6InRleHQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX1dfSwibnVsbCJdfV19LCJudWxsIl19XX0UYXZyby5jb2RlYwxzbmFwcHkAJa7sc/u8YPRGx/WcTPWSsA==

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant