Exponential Smoothing using Scikit-Learn wrapper & statsmodels

As we discussed in previous post, in this series we will be working on different type of regression problem and will try to parse them as sklearn model objects. Here are we will be working with ExponentialSmoothing from statsmodels library. This is one of the most common time series model used in general. statsmodels is heavily influence by R and its convention. If you take a deep dive into it’s functionality of or formula based approach in case of GLM, it is quite evident including the summary function.

Here in this post our objective will be to take this ExponentialSmoothing from statsmodels and build a model class which looks like sklearn model and the method signatures are same in nature so there is less learning curve involved when there is a context switch or in this case a user is trying to build different models.

from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np
import pandas as pd
import inspect

class HWTimeSeriesSkLearnWrapper(BaseEstimator, RegressorMixin):
    def __init__(
        self,
        trend=None,
        damped_trend=False,
        seasonal=None,
        seasonal_periods=None,
        use_boxcox=False,
        optimized=True,
        remove_bias=False,
        initialization_method = None,
        model=ExponentialSmoothing,
    ):
        args, _, _, values = inspect.getargvalues(inspect.currentframe())
        values.pop("self")
        for arg, val in values.items():
            setattr(self, arg, val)

    def fit(self, X, y = None):
        if not X.empty:
            self.X = X
        self.model_ = self.model(
            self.X,
            trend=self.trend,
            seasonal=self.seasonal,
            seasonal_periods=self.seasonal_periods,
            damped_trend=self.damped_trend,
            use_boxcox=self.use_boxcox,
        )
        self.result_ = self.model_.fit(
            optimized=self.optimized, remove_bias=self.remove_bias
        )
        self.level = self.result_.level
        self.trend = self.result_.trend
        self.season = self.result_.season
        self.fitted_values = self.result_.fittedvalues
        return self

    def predict(self, X, y = None):
        check_is_fitted(self, "result_")
        self.df = self.result_.forecast(X.index.size).to_frame()
        self.df.columns = [f"predicted_{name}" for name in X.columns]
        self.df.index.name = X.index.name
        return self.df

    def score(self, X, y):
        y_true = X.to_numpy().flatten()
        y_hat = self.predict(X).to_numpy().flatten()
        return mean_absolute_percentage_error(y_true= y_true, y_pred=y_hat)

As you can see in the above block, I have come up with this structure which can take all the args of __init__ and fit of ExponentialSmoothing while initializing the wrapper class HWTimeSeriesSkLearnWrapper. Note, these are not all the exhaustive set of parameter. You can take a look into the source code of the model class from statsmodels and figure out what are the other parameters which can be helpful for your context.

args, _, _, values = inspect.getargvalues(inspect.currentframe())
values.pop("self")
for arg, val in values.items():
    setattr(self, arg, val)

The above chunk of the code will help to initialize arbitrary number of parameters for your model and in subsequent methods such as initialization or fit you can pass the parameter values from self. In the next block, I am downloading the data and creating a train test split. Note, in case of time series we need to remember two things,

  • Train test split can not be random
  • There is no such concept of train and test data. Index is kind of your core feature along with meta features which can be derived from target.

The catch here is we need to create train and test in a way it can look like sklearn model but under the hood it should be able to use the same datasets for building times series model using statsmodels or any other library you may use.

def get_and_split_airline_data(test_size = 0.2, date_col = "month", date_format = "%Y-%m", date_freq = "MS"):
    df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv")
    df.columns = df.columns.str.lower()
    if date_col in df.columns:
        df[date_col] = pd.to_datetime(df[date_col], format = date_format)
        df = df.sort_values(date_col).set_index(date_col)
        df.index.freq = date_freq
    train_size = int(df.shape[0] * (1 - test_size))
    X_train = df.iloc[:train_size]
    X_test = df.iloc[train_size:]
    y_train = X_train.copy()
    y_test = X_test.copy()
    return X_train,X_test,y_train, y_test

Finally, once we have the data we can use fit, predict and score to calculate the required values.

X_train,X_test,y_train, y_test = get_and_split_airline_data()
model = HWTimeSeriesSkLearnWrapper(trend = "add", damped_trend = True, seasonal = "add", seasonal_periods = 12)
model.fit(X_train, y_train)
model.predict(X_test)
model.score(X_test, y_test)
Aritra Biswas
Aritra Biswas
Senior ML Engineer

My research interests include computational statistics, causal inference, simulation and mathematical optimization.

Related