While Pandas is the most popular DataFrame library, it is terribly slow.
• It only uses a single CPU core.
• It has bulky DataFrames.
• It eagerly executes code, which prevents any possible optimization.
FireDucks is a highly optimized, drop-in replacement for Pandas with the same API.
You just need to change one line of code → 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝 (View Highlight)
As you can tell, FireDucks is even faster than cuDF in this case.
That said, the query in the above experiment loads all columns of the two parquet files.
When I optimized it manually by only loading the required columns, the run-time dropped to:
• Pandas: 14 seconds (from 48 seconds)
• FireDucks: 0.8 seconds (from 0.8 seconds) [same as before]
• cuDF: 0.9 seconds (from 2.6 seconds)
This shows that the FireDucks’ compiler does the same optimization automatically, which one has to explicitly do in cuDF and Pandas. (View Highlight)