Big Data Systems Performance: The Little Shop of Horrors
The confusion around terms such as like NoSQL, Big Data, Data Science, SQL, Spark, and Data Lakes often creates more fog than clarity. In my presentation, I will show that often at least three design dimensions are cluttered and confused in discussions when it comes to data management and how understanding these dfimensions helps you making your applications several orders of magnitude faster.
Tags: Algorithms, Big Data, Data Science, Infrastructure, Parallel Programming, Programming, Python, Science
Scheduled on thursday 10:30 in room cubus
Jens Dittrich is a Full Professor of Computer Science in the area of Databases, Data Management, and Big Data Analytics at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a BMBF VIP Grant in 2011, a best paper award at VLDB 2014, three CS teaching awards in 2011, 2013, and 2018, as well as several presentation awards including a qualification for the interdisciplinary German science slam finals in 2012 and three presentation awards at CIDR (Conference on Innovative Data systems Research, 2011, 2013, and 2015). He has been a PC member and area chair/group leader of prestigious international database conferences and journals such as PVLDB/VLDB, SIGMOD, ICDE, and VLDB Journal. He is a member of the scientific advisory board of Software AG. He was a keynote speaker at VLDB 2017: “Deep Learning (m)eats Databases” (http://bit.ly/DL_meats_DB) and will also be speaking at DEEM@SIGMOD (Data Management for End to End Machine learning, http://deem-workshop.org/). At Saarland University he co-organizes the Data Science Summer School.
His research focuses on big data analytics including in particular: data analytics on large datasets, scalability, main-memory databases, database indexing, reproducibility, and scalable data science. He enjoys coding data science problems in Python, in particular using the keras and tensorflow libraries for Deep Learning. Since 2017 he has been working on a start-up at the intersection of data science and databases (http://daimond.ai). He teaches some of his classes as flipped classrooms (https://www.youtube.com/user/jensdit) and tweets at https://twitter.com/jensdittrich.
The confusion around terms such as like NoSQL, Big Data, Data Science, Spark, SQL, and Data Lakes often creates more fog than clarity. However, clarity about the underlying technologies is crucial to designing the best technical solution in any field relying on huge amounts of data including data science, machine learning, but also more traditional analytical systems such as data integration, data warehousing, reporting, and OLAP.
In my presentation, I will show that often at least three dimensions are cluttered and confused in discussions when it comes to data management: First, buzzwords (labels & terms like "big data", "AI", "data lake"); second, data design patterns (principles & best practices like: selection push-down, materialization, indexing); and Third, software platforms (concrete implementations & frameworks like: Python, DBMS, Spark, and NoSQL-systems).
Only by keeping these three dimensions apart, it is possible to create technically-sound architectures in the field of big data analytics.
I will show concrete examples, which through a simple redesign and wise choice of the right tools and technologies, run thereby up to 1000 times faster. This in turn triggers tremendous savings in terms of development time, hardware costs, and maintenance effort.