Talks: Speed is Not All You Need for Data Processing

Saturday - May 18th, 2024 1:45 p.m.-2:15 p.m. in Hall C

Presented by:

Description

Pandas was the dominant local data processing framework for majority of the last decade. Now, there are many other options available like Polars and DuckDB. Is it worth switching to them? One of the main reasons developers switch is because of the supposed speed. TPCH benchmarks show Polars and DuckDB are an order of magnitude faster than Pandas (and Dask) because of the Rust-based or C++ implementation.

For large-scale data, we are often told to use pure native Spark whenever possible. Pandas UDFs are often discouraged because they are deemed as a bottleneck. The optimizer works best when it can see the entire query plan, but Pandas UDFs are a black box.

But as practitioners, we have to ask two related questions: 1. Are these assumptions true? Is it universally true that Pandas and Pandas UDFs are slower? 2. Even if it it slower, is it worth the development overhead to avoid using Pandas?

In this talk, we'll show benchmarks across data of various sizes to show that these common assumptions are not always true. In fact, we'll see that Pandas UDFs can actually be faster than native Spark in some cases. With this result in mind, data practitioners should just focus on the tools that serve them best rather than adjusting to the tools.