Monday, September 9, 2024

Deep Dive into Pandas Copy-on-Write Mode — Half III | by Patrick Hoefler | Sep, 2023

Must read


Explaining the migration path for Copy-on-Write

Towards Data Science
Photograph by Zoe Nicolaou on Unsplash

Introduction

The introduction of Copy-on-Write (CoW) is a breaking change that may have some impression on present pandas code. We are going to examine how we are able to adapt our code to keep away from errors when CoW will likely be enabled by default. That is at the moment deliberate for the pandas 3.0 launch, which is scheduled for April 2024. The primary submit on this collection defined the conduct of Copy-on-Write whereas the second submit dove into efficiency optimizations which are associated to Copy-on-Write.

We’re planning on including a warning mode that may warn for all operations that may change conduct with CoW. The warning will likely be very noisy for customers and thus must be handled with some care. This submit explains frequent instances and how one can adapt your code to keep away from modifications in conduct.

Chained project

Chained project is a method the place one object is up to date via 2 subsequent operations.

import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3]})

df["x"][df["x"] > 1] = 100

The primary operation selects the column "x" whereas the second operation restricts the variety of rows. There are various totally different mixtures of those operations (e.g. mixed with loc or iloc). None of those mixtures will work below CoW. As a substitute, they may elevate a warning ChainedAssignmentError to take away these patterns as a substitute of silently doing nothing.

Usually, you need to use loc as a substitute:

df.loc[df["x"] > 1, "x"] = 100

The primary dimension of loc at all times corresponds to the row-indexer. Which means you’ll be able to choose a subset of rows. The second dimension corresponds to the column-indexer, which lets you choose a subset of columns.

It’s usually quicker utilizing loc if you need to set values right into a subset of rows, so this may clear up your code and supply a efficiency enchancment.

That is the apparent case the place CoW will have an effect. It’s going to additionally impression chained inplace operations:

df["x"].exchange(1, 100)

The sample is similar as above. The column choice is the primary operation. The exchange technique tries to function on the momentary object, which can fail to replace the preliminary object. You can even take away these patterns fairly simply via specifying the columns you need to function on.

df = df.exchange({"x": 1}, {"x": 100})

Patterns to keep away from

My earlier submit explains how the CoW mechanism works and the way DataFrames share the underlying knowledge. A defensiv copy will likely be carried out if two objects share the identical knowledge while you’re modifying one object inplace.

df2 = df.reset_index()
df2.iloc[0, 0] = 100

The reset_index operation will create a view of the underlying knowledge. The result’s assigned to a brand new variable df2, which means that two objects share the identical knowledge. This holds true till df is rubbish collected. The setitem operation will thus set off a replica. That is utterly pointless for those who do not want the preliminary object df anymore. Merely reassigning to the identical variable will invalidate the reference that’s held by the item.

df = df.reset_index()
df.iloc[0, 0] = 100

Summarizing, creating a number of references in the identical technique retains pointless references alive.

Non permanent references which are created when chaining totally different strategies collectively are advantageous.

df = df.reset_index().drop(...)

This may solely preserve one reference alive.

Accessing the underlying NumPy array

pandas at the moment offers us entry to the underlying NumPy array via to_numpy or .values. The returned array is a replica, in case your DataFrame consists of various dtypes, e.g.:

df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
df.to_numpy()

[[1. 1.5]
[2. 2.5]]

The DataFrame is backed by two arrays which need to be mixed into one. This triggers the copy.

The opposite case is a DataFrame that’s solely backed by a single NumPy array, e.g.:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df.to_numpy()

[[1 3]
[2 4]]

We will instantly entry the array and get a view as a substitute of a replica. That is a lot quicker than copying all knowledge. We will now function on the NumPy array and doubtlessly modify it inplace, which may even replace the DataFrame and doubtlessly all different DataFrames that share knowledge. This turns into way more sophisticated with Copy-on-Write, since we eliminated many defensive copies. Many extra DataFrames will now share reminiscence with one another.

to_numpy and .values will return a read-only array due to this. Which means the ensuing array just isn’t writeable.

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
arr = df.to_numpy()

arr[0, 0] = 1

This may set off a ValueError:

ValueError: project vacation spot is read-only

You’ll be able to keep away from this in two alternative ways:

  • Set off a replica manually if you wish to keep away from updating DataFrames that share reminiscence together with your array.
  • Make the array writeable. This can be a extra performant resolution however circumvents Copy-on-Write guidelines, so it must be used with warning.
arr.flags.writeable = True

There are instances the place this isn’t potential. One frequent prevalence is, if you’re accessing a single column which was backed by PyArrow:

ser = pd.Sequence([1, 2], dtype="int64[pyarrow]")
arr = ser.to_numpy()
arr.flags.writeable = True

This returns a ValueError:

ValueError: can not set WRITEABLE flag to True of this array

Arrow arrays are immutable, therefore it’s not potential to make the NumPy array writeable. The conversion from Arrow to NumPy is zero-copy on this case.

Conclusion

We’ve regarded on the most invasive Copy-on-Write associated modifications. These modifications will grow to be the default conduct in pandas 3.0. We’ve additionally investigated how we are able to adapt our code to keep away from breaking our code when Copy-on-Write is enabled. The improve course of must be fairly easy for those who can keep away from these patterns.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article