Skip to content

Pandas serialization/deserialization logic with msgpack #5077

@sydney-runkle

Description

@sydney-runkle

Right now, users can use pandas series/dataframes in state and serialize them with a JsonPlusSerializer that has pickle_fallback enabled.

It'd be great to have these as first class citizens, able to be serialized via msgpack like numpy arrays.

This is a bit of a tricky task, as dataframes have lots of nuanced features like:

  • multiindexes (both row and column)
  • dtypes by column
  • opportunity for arbitrary objects in table cells

We want to preserve df structure during serialization and deserialization.

There are a few options here, assuming we continue to use msgpack:

  • Dump pickled content (this is a bit redundant, both are serialization protocols). One benefit here is that pandas x pickle work well together with all of the above nuances
  • Dump bytes directly, though custom logic will have to be written for the above pandas features
  • Use arrow - this is the most efficient storage wise, though there are some type inconsistencies (like with object dtype) that will need to be considered.

A PR addressing this should have thorough testing, perhaps mimicking many of the conditions tested for in #5057.

You might want to reference #5035 as a reference for how to add logic for new types to JsonPlusSerializer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions