Sunday, March 31, 2024

Modeling Variable Seasonal Options with the Fourier Remodel | by Florin Andrei | Oct, 2023

Must read


Enhance time sequence forecast efficiency with a method from sign processing

Towards Data Science

Modeling time sequence knowledge and forecasting it are complicated matters. There are lots of strategies that may very well be used to enhance mannequin efficiency for a forecasting job. We’ll focus on right here a method which will enhance the best way forecasting fashions soak up and make the most of time options, and generalize from them. The principle focus would be the creation of the seasonal options that feed the time sequence forecasting mannequin in coaching — there are straightforward good points to be made right here in case you embody the Fourier rework within the function creation course of.

This text assumes you might be aware of the essential features of time sequence forecasting — we is not going to focus on this matter normally, however solely a refinement of 1 side of it. For an introduction to time sequence forecasting, see the Kaggle course on this matter — the method mentioned right here builds on prime of their lesson on seasonality.

Think about the Quarterly Australian Portland Cement manufacturing dataset, exhibiting the overall quarterly manufacturing of Portland cement in Australia (in tens of millions of tonnes) from 1956:Q1 to 2014:Q1.

df = pd.read_csv('Quarterly_Australian_Portland_Cement_production_776_10.csv', usecols=['time', 'value'])
# convert time from yr float to a correct datetime format
df['time'] = df['time'].apply(lambda x: str(int(x)) + '-' + str(int(1 + 12 * (x % 1))).rjust(2, '0'))
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time').to_period()
df.rename(columns={'worth': 'manufacturing'}, inplace=True)
        manufacturing
time
1956Q1 0.465
1956Q2 0.532
1956Q3 0.561
1956Q4 0.570
1957Q1 0.529
... ...
2013Q1 2.049
2013Q2 2.528
2013Q3 2.637
2013Q4 2.565
2014Q1 2.229

[233 rows x 1 columns]

cement production

That is time sequence knowledge with a development, seasonal parts, and different attributes. The observations are made quarterly, spanning a number of a long time. Let’s check out the seasonal parts first, utilizing the periodogram plot (all code is included within the companion pocket book, linked on the finish).

periodogram
periodogram

The periodogram reveals the facility density of spectral parts (seasonal parts). The strongest part is the one with a interval equal to 4 quarters, or 1 yr. This confirms the visible impression that the strongest up-and-down variations within the graph occur about 10 occasions per decade. There may be additionally a part with a interval of two quarters — that’s the identical seasonal phenomenon, and it merely means the 4-quarter periodicity shouldn’t be a easy sine wave, however has a extra complicated form. We’ll ignore parts with intervals larger than 4, for simplicity.

When you attempt to match a mannequin to this knowledge, maybe as a way to forecast the cement manufacturing for occasions past the top of the dataset, it could be a good suggestion to seize this yearly seasonality in a function or set of options, and supply these to the mannequin in coaching. Let’s see an instance.

Seasonal parts may be modeled in quite a few other ways. Usually, they’re represented as time dummies, or as sine-cosine pairs. These are artificial options with a interval equal to the seasonality you’re making an attempt to mannequin, or a submultiple of it.

Yearly time dummies:

seasonal_year = DeterministicProcess(index=df.index, fixed=False, seasonal=True).in_sample()
print(seasonal_year)
        s(1,4)  s(2,4)  s(3,4)  s(4,4)
time
1956Q1 1.0 0.0 0.0 0.0
1956Q2 0.0 1.0 0.0 0.0
1956Q3 0.0 0.0 1.0 0.0
1956Q4 0.0 0.0 0.0 1.0
1957Q1 1.0 0.0 0.0 0.0
... ... ... ... ...
2013Q1 1.0 0.0 0.0 0.0
2013Q2 0.0 1.0 0.0 0.0
2013Q3 0.0 0.0 1.0 0.0
2013Q4 0.0 0.0 0.0 1.0
2014Q1 1.0 0.0 0.0 0.0

[233 rows x 4 columns]

Yearly sine-cosine pairs:

cfr = CalendarFourier(freq='Y', order=2)
seasonal_year_trig = DeterministicProcess(index=df.index, seasonal=False, additional_terms=[cfr]).in_sample()
with pd.option_context('show.max_columns', None, 'show.expand_frame_repr', False):
print(seasonal_year_trig)
        sin(1,freq=A-DEC)  cos(1,freq=A-DEC)  sin(2,freq=A-DEC)  cos(2,freq=A-DEC)
time
1956Q1 0.000000 1.000000 0.000000 1.000000
1956Q2 0.999963 0.008583 0.017166 -0.999853
1956Q3 0.017166 -0.999853 -0.034328 0.999411
1956Q4 -0.999963 -0.008583 0.017166 -0.999853
1957Q1 0.000000 1.000000 0.000000 1.000000
... ... ... ... ...
2013Q1 0.000000 1.000000 0.000000 1.000000
2013Q2 0.999769 0.021516 0.043022 -0.999074
2013Q3 0.025818 -0.999667 -0.051620 0.998667
2013Q4 -0.999917 -0.012910 0.025818 -0.999667
2014Q1 0.000000 1.000000 0.000000 1.000000

[233 rows x 4 columns]

Time dummies can seize very complicated waveforms of the seasonal phenomenon. On the flip aspect, if the interval is N, you then want N time dummies — so, if the interval could be very lengthy, you’ll need a whole lot of dummy columns, which is probably not fascinating. For non-penalized fashions, typically simply N-1 dummies are used — one is dropped to keep away from collinearity points (we’ll ignore that right here).

Sine/cosine pairs can mannequin arbitrarily very long time intervals. Every pair will mannequin a pure sine waveform with some preliminary section. Because the waveform of the seasonal function turns into extra complicated, you’ll need so as to add extra pairs (improve the order of the output from CalendarFourier).

(Facet observe: we use a sine/cosine pair for every interval we wish to mannequin. What we actually need is only one column of A*sin(2*pi*t/T + phi) the place A is the load assigned by the mannequin to the column, t is the time index of the sequence, and T is the part interval. However the mannequin can’t alter the preliminary section phi whereas becoming the information — the sine values are pre-computed. Fortunately, the system above is equal to: A1*sin(2*pi*t/T) + A2*cos(2*pi*t/T) and the mannequin solely wants to seek out the weights A1 and A2.)

TLDR: What these two strategies have in widespread is that they characterize seasonality by way of a set of repetitive options, with values that cycle as typically as as soon as per the time interval being modeled (time dummies, and the bottom sine/cosine pair), or a number of occasions per interval (larger order sine/cosine). And every certainly one of these options has values various inside a hard and fast interval: from 0 to 1, or from -1 to 1. These are all of the options we have now to mannequin seasonality right here.

Let’s see what occurs after we match a linear mannequin on such seasonal options. However first, we have to add development options to the options assortment used to coach the mannequin.

Let’s create development options after which be a part of them with the seasonal_year time dummies generated above:

trend_order = 2
trend_year = DeterministicProcess(index=df.index, fixed=True, order=trend_order).in_sample()
X = trend_year.copy()
X = X.be a part of(seasonal_year)
        const  development  trend_squared  s(1,4)  s(2,4)  s(3,4)  s(4,4)
time
1956Q1 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1956Q2 1.0 2.0 4.0 0.0 1.0 0.0 0.0
1956Q3 1.0 3.0 9.0 0.0 0.0 1.0 0.0
1956Q4 1.0 4.0 16.0 0.0 0.0 0.0 1.0
1957Q1 1.0 5.0 25.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 1.0 0.0 0.0 0.0
2013Q2 1.0 230.0 52900.0 0.0 1.0 0.0 0.0
2013Q3 1.0 231.0 53361.0 0.0 0.0 1.0 0.0
2013Q4 1.0 232.0 53824.0 0.0 0.0 0.0 1.0
2014Q1 1.0 233.0 54289.0 1.0 0.0 0.0 0.0

[233 rows x 7 columns]

That is the X dataframe (options) that can be used to coach/validate the mannequin. We’re modeling the information with quadratic development options, plus the 4 time dummies wanted for the yearly seasonality. The y dataframe (goal) can be simply the cement manufacturing numbers.

Let’s carve a validation set out of the information, containing the yr 2010 observations. We’ll match a linear mannequin on the X options proven above and the y goal represented by cement manufacturing (the check portion), after which we’ll consider mannequin efficiency in validation. We may also do the entire above with a trend-only mannequin that can solely match the development options however ignores seasonality.

def do_forecast(X, index_train, index_test, trend_order):
X_train = X.loc[index_train]
X_test = X.loc[index_test]

y_train = df['production'].loc[index_train]
y_test = df['production'].loc[index_test]

mannequin = LinearRegression(fit_intercept=False)
_ = mannequin.match(X_train, y_train)
y_fore = pd.Sequence(mannequin.predict(X_test), index=index_test)
y_past = pd.Sequence(mannequin.predict(X_train), index=index_train)

trend_columns = X_train.columns.to_list()[0 : trend_order + 1]
model_trend = LinearRegression(fit_intercept=False)
_ = model_trend.match(X_train[trend_columns], y_train)
y_trend_fore = pd.Sequence(model_trend.predict(X_test[trend_columns]), index=index_test)
y_trend_past = pd.Sequence(model_trend.predict(X_train[trend_columns]), index=index_train)

RMSLE = mean_squared_log_error(y_test, y_fore, squared=False)
print(f'RMSLE: {RMSLE}')

ax = df.plot(**plot_params, title='AUS Cement Manufacturing - Forecast')
ax = y_past.plot(colour='C0', label='Backcast')
ax = y_fore.plot(colour='C3', label='Forecast')
ax = y_trend_past.plot(ax=ax, colour='C0', linewidth=3, alpha=0.333, label='Pattern Previous')
ax = y_trend_fore.plot(ax=ax, colour='C3', linewidth=3, alpha=0.333, label='Pattern Future')
_ = ax.legend()

do_forecast(X, index_train, index_test, trend_order)

RMSLE: 0.03846449744356434
model validation
mannequin validation

Blue is practice, pink is validation. The total mannequin is the sharp, skinny line. The trend-only mannequin is the large, fuzzy line.

This isn’t unhealthy, however there may be one obvious concern: the mannequin has realized a constant-amplitude yearly seasonality. Though the yearly variation truly will increase in time, the mannequin can solely keep on with fixed-amplitude variations. Clearly, it’s because we gave the mannequin solely fixed-amplitude seasonal options, and there is nothing else within the options (goal lags, and so on) to assist it overcome this concern.

Let’s dig deeper into the sign (the information) to see if there’s something there that would assist with the fixed-amplitude concern.

The periodogram will spotlight all spectral parts within the sign (all seasonal parts within the knowledge), and can present an summary of their total power, however it’s an combination, a sum over the entire time interval. It says nothing about how the amplitude of every seasonal part could range in time throughout the dataset.

To seize that variation, you need to use the Fourier spectrogram as an alternative. It’s just like the periodogram, however carried out repeatedly over many time home windows throughout the whole knowledge set. Each periodogram and spectrogram can be found as strategies within the scipy library.

Let’s plot the spectrogram for the seasonal parts with intervals of two and 4 quarters, talked about above. As at all times, the total code is within the companion pocket book linked on the finish.

spectrum = compute_spectrum(df['production'], 4, 0.1)
plot_spectrogram(spectrum, figsize_x=10)
spectrogram
spectrogram

What this diagram reveals is the amplitude, the “power” of the 2-quarter and 4-quarter parts over time. They’re fairly weak initially, however change into very sturdy round 2010, which matches the variations you’ll be able to see within the knowledge set plot on the prime of this text.

What if, as an alternative of feeding the mannequin fixed-amplitude seasonal options, we alter the amplitude of those options in time, matching what the spectrogram tells us? Would the ultimate mannequin carry out higher in validation?

Let’s decide the 4-quarter seasonal part. We may match a linear mannequin (referred to as the envelope mannequin) on the order=2 development of this part, on the practice knowledge (mannequin.match()), and lengthen that development into validation / testing (mannequin.predict()):

envelope_features = DeterministicProcess(index=X.index, fixed=True, order=2).in_sample()

spec4_train = compute_spectrum(df['production'].loc[index_train], max_period=4)
spec4_train

spec4_model = LinearRegression()
spec4_model.match(envelope_features.loc[spec4_train.index], spec4_train['4.0'])
spec4_regress = pd.Sequence(spec4_model.predict(envelope_features), index=X.index)

ax = spec4_train['4.0'].plot(label='part envelope', colour='grey')
spec4_regress.loc[spec4_train.index].plot(ax=ax, colour='C0', label='envelope regression: previous')
spec4_regress.loc[index_test].plot(ax=ax, colour='C3', label='envelope regression: future')
_ = ax.legend()

envelope fit
envelope match

The blue line is the power of the variation of the 4-quarter part within the practice knowledge, fitted as a quadratic development (order=2). The pink line is identical factor, prolonged (predicted) over the validation knowledge.

Now we have modeled the variation in time of the 4-quarter seasonal part. We are able to take the output from the envelope mannequin, and multiply it by the point dummies akin to the 4-quarter seasonal part:

spec4_regress = spec4_regress / spec4_regress.imply()

season_columns = ['s(1,4)', 's(2,4)', 's(3,4)', 's(4,4)']
for c in season_columns:
X[c] = X[c] * spec4_regress
print(X)

        const  development  trend_squared    s(1,4)    s(2,4)    s(3,4)    s(4,4)
time
1956Q1 1.0 1.0 1.0 0.179989 0.000000 0.000000 0.000000
1956Q2 1.0 2.0 4.0 0.000000 0.181109 0.000000 0.000000
1956Q3 1.0 3.0 9.0 0.000000 0.000000 0.182306 0.000000
1956Q4 1.0 4.0 16.0 0.000000 0.000000 0.000000 0.183581
1957Q1 1.0 5.0 25.0 0.184932 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 2.434701 0.000000 0.000000 0.000000
2013Q2 1.0 230.0 52900.0 0.000000 2.453436 0.000000 0.000000
2013Q3 1.0 231.0 53361.0 0.000000 0.000000 2.472249 0.000000
2013Q4 1.0 232.0 53824.0 0.000000 0.000000 0.000000 2.491139
2014Q1 1.0 233.0 54289.0 2.510106 0.000000 0.000000 0.000000

[233 rows x 7 columns]

The 4 time dummies should not both 0 or 1 anymore. They’ve been multiplied by the part envelope, and now their amplitude varies in time, identical to the envelope.

Let’s practice the principle mannequin once more, now utilizing the modified time dummies.

do_forecast(X, index_train, index_test, trend_order)
RMSLE: 0.02546321729737165
model validation, adjusted time dummies
mannequin validation, adjusted time dummies

We’re not aiming for an ideal match right here, since that is only a toy mannequin, however it’s apparent the mannequin performs higher, it’s extra intently following the 4-quarter variations within the goal. The efficiency metric has improved additionally, by 51%, which isn’t unhealthy in any respect.

Enhancing mannequin efficiency is an enormous matter. The habits of any mannequin doesn’t rely on a single function, or on a single method. Nevertheless, in case you’re seeking to extract all you’ll be able to out of a given mannequin, you must most likely feed it significant options. Time dummies, or sine/cosine Fourier pairs, are extra significant once they mirror the variation in time of the seasonality they’re modeling.

Adjusting the seasonal part’s envelope by way of the Fourier rework is especially efficient for linear fashions. Boosted timber don’t profit as a lot, however you’ll be able to nonetheless see enhancements virtually it doesn’t matter what mannequin you employ. When you’re utilizing a hybrid mannequin, the place a linear mannequin handles deterministic options (calendar, and so on), whereas a boosted timber mannequin handles extra complicated options, it’s important that the linear mannequin “will get it proper”, subsequently leaving much less work to do for the opposite mannequin.

It is usually necessary that the envelope fashions you employ to regulate seasonal options are solely skilled on the practice knowledge, and they don’t see any testing knowledge whereas in coaching, identical to every other mannequin. When you alter the envelope, the time dummies include data from the goal — they don’t seem to be purely deterministic options anymore, that may be computed forward of time over any arbitrary forecast horizon. So at this level data may leak from validation/testing again into coaching knowledge in case you’re not cautious.

The information set used on this article is obtainable right here beneath the Public Area (CC0) license:

The code used on this article may be discovered right here:

A pocket book submitted to the Retailer Gross sales — Time Sequence Forecasting competitors on Kaggle, utilizing concepts described on this article:

A GitHub repo containing the event model of the pocket book submitted to Kaggle is right here:

All photos and code used on this article are created by the writer.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article