Tuesday, March 19, 2024

Optimizing Pandas Code: The Influence of Operation Sequence | by Marcin Kozak | Mar, 2024

Must read


PYTHON PROGRAMMING

Discover ways to rearrange your code to realize vital pace enhancements.

Towards Data Science
Photograph by Nick Fewings on Unsplash

Pandas provide a unbelievable framework to function on dataframes. In knowledge science, we work with small, huge — and generally very huge dataframes. Whereas analyzing small ones will be blazingly quick, even a single operation on a giant dataframe can take noticeable time.

On this article I’ll present that always you may make this time shorter by one thing that prices virtually nothing: the order of operations on a dataframe.

Think about the next dataframe:

import pandas as pd

n = 1_000_000
df = pd.DataFrame({
letter: record(vary(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})

With 1,000,000 rows and 25 columns, it’s huge. Many operation on such a dataframe will likely be noticeable on present private computer systems.

Think about we need to filter the rows, with a view to take these which observe the next situation: a < 50_000 and b > 3000 and choose 5 columns: take_cols=['a', 'b', 'g', 'n', 'x']. We will do that within the following manner:

subdf = df[take_cols]
subdf = subdf[subdf['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]

On this code, we take the required columns first, after which we carry out the filtering of rows. We will obtain the identical in a unique order of the operations, first performing the filtering after which choosing the columns:

subdf = df[df['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]
subdf = subdf[take_cols]

We will obtain the exact same end result by way of chaining Pandas operations. The corresponding pipes of instructions are as follows:

# first take columns then filter rows
df.filter(take_cols).question(question)

# first filter rows then take columns
df.question(question).filter(take_cols)

Since df is huge, the 4 variations will most likely differ in efficiency. Which would be the quickest and which would be the slowest?

Let’s benchmark this operations. We’ll use the timeit module:



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article