Strongly typed datasets in a weakly typed world

Strongly typed Parquet Datasets / Hive Tables are often used to exchange and preserve data in a Pandas-driven environment, where types are rather unstable. This results in multiple issues and these as well as potential solutions will be presented, together with an RFC directed to the community.

Tags: Algorithms, Big Data, Data Science, Parallel Programming

Scheduled on friday 11:20 in room lecture


Marco Neumann (@crepererum)

Studied computer science at KIT (Karlsruhe, Germany), worked as a Tech Student at CERN, now a Data Scientist at Blue Yonder (Hamburg, Germany). Loves to travel and to exchange all kind of ideas.


We at Blue Yonder use Pandas quite a lot during our daily data science and engineering work. This choice, together with Python as an underlying programming language gives us flexibility, a feature-rich interface, and access to a large community and ecosystem. When it comes to preserving the data and exchanging it with different software stacks, we rely on Parquet Datasets / Hive Tables. During the write process, there is a shift from a rather weakly typed world to a strongly typed one. For example, Pandas may convert integers to floats for many operations without asking, but parquet files and the schema information stored alongside them dictate very precise types. The type situation may get even more "colorful", when datasets are written by multiple code versions or different software solutions over time. This then results in important questions regarding type compatibility.

This talk will first represent an overview on types at different layers (like NumPy, Pandas, Arrow and Parquet) and the transition between this layers. The second part of the talk will present examples of type compatibility we have seen and why+how we think they should be handled. At the end there will be a Q+A, which can be seen as the start of a potentially longer RFC process to align different software stacks (like Hive and Dask) to handle types in a similar way.