Storing and processing data efficiently is an integral part of successful data-driven applications. An efficient and scalable way to store big data is by using object stores of public cloud providers like ABS, S3 or GCS. These storages come with downsides attached which make the management of tabular data distributed over many objects not a trivial task.
Kartothek is a recently open sourced Python library we develop and use at Blue Yonder – JDA Software to manage tabular data in cloud object stores. It is built on Apache Arrow, Apache Parquet and is powered by Dask. It’s specification is compatible with the de-facto standard storage layouts used by other big data processing tools like Apache Spark and Hive but offers a native, seamless integration into the Python ecosystem.
What Kartothek offers, includes:
- Consistent dataset state at all times.
- Add or remove files with an atomic operation.
- Read without any locking mechanisms.
- Strongly typed and enforced table schemas using Apache Arrow.
O(1)remote storage calls to plan job dispatching.
- Inverted indices for fast and efficient querying.
- Integrated predicate pushdown on row group level using Apache Parquet.
- Seamless integration to pandas, Dask and the Python ecosystem.
At the end of this talk, I want you to leave with the knowledge about what the struggles in modern big data storage are and how to deal with them.
Affiliation: Blue Yonder - JDA Software
Florian Jetter started his career at Blue Yonder – JDA Software by building and running machine learning models. Eventually, he got frustrated by slow and inefficient data pipelines and built libraries and tools to support his fellow Data Scientists and Data Engineers. His focus quickly shifted from machine learning to data engineering and he leverages the knowledge of both worlds to build a reliable data platform.