Unicode/Bytes Handling

I'm the author of [Hadoopy](http://hadoopy.co) and I had a question about your Unicode/Bytes mappings.  I've tried to keep my typedbytes implementation byte compatible with yours and produce the same python-side semantics; however, recently I noticed that the current mapping of Python strings can 1.) cause problems with binary data and 2.) has counter-intuitive behavior with unicode strings.  I am tempted to change this and wanted to see if I am missing something or if there is a clean solution that maintains compatibility.

The main issue is that unicode is mapped to a string (type code 7) but when it is parsed it comes back as a string when it would make sense to utf-8 decode it and return a unicode object.  This is because code #7 is defined to be UTF-8 bytes http://hadoop.apache.org/mapreduce/docs/r0.22.0/api/index.html?org/apache/hadoop/typedbytes/package-summary.html.  However, as strings are also mapped to type code 7, there are some strings that may contain arbitrary values (non-utf8) which is presumably why you don't do the decoding (https://github.com/klbostee/typedbytes/blob/master/typedbytes.py#L145 isn't used).  You have a Bytes class to differentiate but I don't think this is necessary.

Current problems
1. Unable to distinguish between unicode and strings, input/output shouldn't change what the user sees.
2. Bytes class will be unnecessary in Python 3 and will cause more confusion as the string/bytes distinction will be obviously wrong where now it just silently converts unicode to strings.

My proposed solution is
1. Make python strings map to type code 0 as they are not necessarily utf-8 (which is the source of the problem).
2. Make unicode map to type code 7, which means that it can be decoded properly.
3. Make a conversion utility to convert old data from typecode 7 to 0.  If there is a utf-8 decoding error it could say that it may be due to this change and provide steps for fixing it.  In this conversion it'd be possible for some unicode to be decoded as strings; however, this simply provides the current semantics (in the worst case).

Questions
1. Will this proposed solution work on the Java side?  Does java perform UTF-8 decoding of Type 7's (I haven't had a chance to look)?

I have been meaning to fix this but hesitant to do it on our side and break compatibility.  Also since you've surely run into this you probably have an opinion on it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode/Bytes Handling #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unicode/Bytes Handling #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions