Skip to content

Unicode/Bytes Handling #4

@bwhite

Description

@bwhite

I'm the author of Hadoopy and I had a question about your Unicode/Bytes mappings. I've tried to keep my typedbytes implementation byte compatible with yours and produce the same python-side semantics; however, recently I noticed that the current mapping of Python strings can 1.) cause problems with binary data and 2.) has counter-intuitive behavior with unicode strings. I am tempted to change this and wanted to see if I am missing something or if there is a clean solution that maintains compatibility.

The main issue is that unicode is mapped to a string (type code 7) but when it is parsed it comes back as a string when it would make sense to utf-8 decode it and return a unicode object. This is because code #7 is defined to be UTF-8 bytes http://hadoop.apache.org/mapreduce/docs/r0.22.0/api/index.html?org/apache/hadoop/typedbytes/package-summary.html. However, as strings are also mapped to type code 7, there are some strings that may contain arbitrary values (non-utf8) which is presumably why you don't do the decoding (https://github.com/klbostee/typedbytes/blob/master/typedbytes.py#L145 isn't used). You have a Bytes class to differentiate but I don't think this is necessary.

Current problems

  1. Unable to distinguish between unicode and strings, input/output shouldn't change what the user sees.
  2. Bytes class will be unnecessary in Python 3 and will cause more confusion as the string/bytes distinction will be obviously wrong where now it just silently converts unicode to strings.

My proposed solution is

  1. Make python strings map to type code 0 as they are not necessarily utf-8 (which is the source of the problem).
  2. Make unicode map to type code 7, which means that it can be decoded properly.
  3. Make a conversion utility to convert old data from typecode 7 to 0. If there is a utf-8 decoding error it could say that it may be due to this change and provide steps for fixing it. In this conversion it'd be possible for some unicode to be decoded as strings; however, this simply provides the current semantics (in the worst case).

Questions

  1. Will this proposed solution work on the Java side? Does java perform UTF-8 decoding of Type 7's (I haven't had a chance to look)?

I have been meaning to fix this but hesitant to do it on our side and break compatibility. Also since you've surely run into this you probably have an opinion on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions