rw-book-cover

Metadata

Highlights

  • While Pandas is the most popular DataFrame library, it is terribly slow. • It only uses a single CPU core. • It has bulky DataFrames. • It eagerly executes code, which prevents any possible optimization. FireDucks is a highly optimized, drop-in replacement for Pandas with the same API. You just need to change one line of code → 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝 (View Highlight)
  • As you can tell, FireDucks is even faster than cuDF in this case. That said, the query in the above experiment loads all columns of the two parquet files. When I optimized it manually by only loading the required columns, the run-time dropped to: • Pandas: 14 seconds (from 48 seconds) • FireDucks: 0.8 seconds (from 0.8 seconds) [same as before] • cuDF: 0.9 seconds (from 2.6 seconds) This shows that the FireDucks’ compiler does the same optimization automatically, which one has to explicitly do in cuDF and Pandas. (View Highlight)