-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Passing this on to you @roll as I think it will be a pretty big change across multiple repos that will require your codebase knowledge. See previous discussion here: BCODMO/frictionless-usecases#32 as well as previous PRs here: frictionlessdata/tableschema-py#260, #175, datahq/dataflows#119.
Sometimes missingValues have meaning beyond "no measurement taken", and sometimes one field will have multiple different kinds of missingValues (such as no measurement taken, measurement below threshold, etc). It's very important that we be able to preserve that data, and up until now we are usually just keeping the field as a string type even though it is clearly a number. It would be great if there was an option to dump_to_path, or a different kind of missingValue parameter, that would show the missingValues properly after the dump.
Solution thoughts:
- update all processors to check if the value is in the missingValues parameter before doing any kind of processing. If in the missingValue list, continue. [probably way too much work]
- create some kind of data type that is interpreted as None by all of the processors but dump_to_path can see it and extract the original value. During the load step instead of setting the fields where there are missingValues to None, set the field to this object (and store the original data in it).
(this is highest priority for us right now I believe)