Pandas IO Tools: Reading and Writing DataFrames as Files and Databases

Learn to use the right tools to read and write your DataFrames as pickle, CSV, JSON, Msgpack, HTML, Excel, HDF5, Parquet, and a PostgreSQL database.

Tags: Data Science, Jupyter, Science

Scheduled on wednesday 11:30 in room openhub


Miroslav Šedivý (@eumiro)

Using Python to make the sun shine and the wind blow. hjkl juggler and languages enthusiast. Living in the Europe/Berlin timezone, saving text files as UTF-8, typing on a standard US keyboard layout with a Compose key.


Reading a CSV file in Pandas can be as easy as dfr = pd.read_csv('filename.csv') and work each time, but still could return unexpected results. You may have to deal with ambiguous timestamps, broken timezone handling, obscure NaN notation, non-standard numbers representation, language-specific formats, or heterogeneous value types. The worst case is when it handles these problems somehow automatically and returns wrong data without warning.

On a series of examples we'll try to import real-world cases, discuss the problems and find a stable way to handle them. After CSV we'll have a look at most other formats supported by Pandas IO tools, such as pickle, JSON, Msgpack, HTML, Excel, HDF5, Parquet and a PostgreSQL database.

The participants will need Python 3.x and a recent Pandas installation. Jupyter Notebook may be useful but not necessary. A repository with all code examples and test data will be published before the conference.