Skip to content

Ability to preserve missingValues in dump #177

@cschloer

Description

@cschloer

Passing this on to you @roll as I think it will be a pretty big change across multiple repos that will require your codebase knowledge. See previous discussion here: BCODMO/frictionless-usecases#32 as well as previous PRs here: frictionlessdata/tableschema-py#260, #175, datahq/dataflows#119.

Sometimes missingValues have meaning beyond "no measurement taken", and sometimes one field will have multiple different kinds of missingValues (such as no measurement taken, measurement below threshold, etc). It's very important that we be able to preserve that data, and up until now we are usually just keeping the field as a string type even though it is clearly a number. It would be great if there was an option to dump_to_path, or a different kind of missingValue parameter, that would show the missingValues properly after the dump.

Solution thoughts:

  • update all processors to check if the value is in the missingValues parameter before doing any kind of processing. If in the missingValue list, continue. [probably way too much work]
  • create some kind of data type that is interpreted as None by all of the processors but dump_to_path can see it and extract the original value. During the load step instead of setting the fields where there are missingValues to None, set the field to this object (and store the original data in it).

(this is highest priority for us right now I believe)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions