LASSO using Scikit-Learn wrapper & CVXPY
Those who are working in ML space of sometime must be aware that there are multiple python libraries out there which support different algorithms/models. When you are trying to implement best model it is not possible to use just a single library or even a single language and compute infra. Also, note all the libraries are do not have the same method, signature and input and output. Hence if you are working on a library which is collection of multiple models from different library one good starting point can be to standardize you class, methods, signature and input and output. It helps us to avoid confusion, make things consistent and as result, when you are onboarding a new user, it takes them less time to learn things.
At high level, it may sound easy to create common API structure for all the models but it is not. Many models take input in form of np.ndarray
, pd.Series
, pd.DataFrame
, xarray
objects and some libraries implement their own data object. To handle this project my current go to approach is to parse the inner working of a model in a sklearn wrapper with common methods, signature and input and output. In this example, we will see how we can do this using cvxpy
as optimizer and sklearn
wrapper to generate a sklearn
model object. Same thing can be done for statsmodels
, pytorch
, tensorflow
or any other custom logic.
To start with we can create a conda
virtual env (assuming conda
is already installed. Otherwise install conda
python 3.8 first and then revisit this.). Run the following block to create a virtual env in conda
and install the required libraries,
conda create -n sklearn_wrapper python=3.8 -y
conda activate sklearn_wrapper
pip install scikit-learn cvxpy jupyter notebook pandas
Once this is done, we are good to start. Open a jupyter notebook
and execute the following lines in a cell,
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
import cvxpy as cp
import numpy as np
import pandas as pd
Here we are importing BaseEstimator
and RegressorMixin
which our custom model class will inherit from. BaseEstimator
will have default score method which can be over written and RegressorMixin
will have get_params
and set_params
methods which can be useful later while execution to change to value of a model parameter. Other than these we are importing check_X_y
, check_array
and check_is_fitted
from sklearn
which will be used to check if the X and y we are passing is of sklearn convention or not, the variable we are passing is of array
type of not and check_is_fitted
is to prevent runningpredict
before fit
. If we run predict, any model will expect coeff
or weight
to make any prediction, until we run fit
methods, these values will not be generated.
class CVXSkLearnWrapper(BaseEstimator, RegressorMixin):
def __init__(self, alpha=1.0):
self.alpha = alpha
def _loss_fn(self, X, y, beta):
return cp.norm2(X @ beta - y)**2
def _regularizer(self, beta):
return cp.norm1(beta)
def _obj_fn(self, X, y, beta, lambd):
return self._loss_fn(X, y, beta) + lambd * self._regularizer(beta)
def _mse(self, X, Y, beta):
return (1.0 / X.shape[0]) * self._loss_fn(X, Y, beta).value
def fit(self, X, y):
X, y = check_X_y(X, y)
n = X.shape[1]
beta = cp.Variable(n)
lambd = cp.Parameter(nonneg=True)
problem = cp.Problem(cp.Minimize(self._obj_fn(X, y, beta, lambd)))
lambd.value = self.alpha
problem.solve()
self.coeff_ = beta.value
self.intercept_ = 0.0
return self
def predict(self, X):
check_is_fitted(self)
X = check_array(X)
return X @ self.coeff_
Here is the implementation LASSO using CVXPY
as SKLearn
model. Here in __init__
method we are taking alpha
as user input. Note, all the argument in __init__
method should have a default values. While storing them in self
the argument name of __init__
and reference in self
should be same. For example if we are using __init__(self, alpha=1.0)
then it must be store in self
as self.alpha = alpha
. This is a parameter which can be changed during execution using get_params
and set_params
method. To know about this optimization more check this link. Note, here inside fit
method we have X, y = check_X_y(X, y)
this is not ensure that the X, y
variable we have passed to the fit
method is correct with respect to data type and share. Also, fit
method of sklearn
always return self
. In case of sklearn this is convention that any variable within in class with have a suffix _
. Here also we are saving, coefficients in self.coeff_
. Also, in predict
we are using check_is_fitted(self)
to check that the model has been fitted or not. I am using check_array(X)
in fit to ensure that the argument X is of type array.
In the following block we are generating some synthetic data to fit the above model. Here beta_star
is the true parameter. X
and Y
is the data on which we will fit the model and will try to estimate unknow beta_star
with derived beta_hat
.
def generate_data(m=100, n=20, sigma=5, density=0.2):
"""Generates data matrix X and observations Y."""
np.random.seed(1)
beta_star = np.random.randn(n)
idxs = np.random.choice(range(n), int((1-density)*n), replace=False)
for idx in idxs:
beta_star[idx] = 0
X = np.random.randn(m,n)
Y = X.dot(beta_star) + np.random.normal(0, sigma, size=m)
return X, Y, beta_star
In the following block we are generating the synthetic dataset, initializing lasso model with class CVXSkLearnWrapper
and CVXSkLearnWrapper
. After that we are executing fit
which will generate the coeffs beta_hat
, predict
will generate prediction y_hat
and score
will calculate R-sq between y
and y_hat
.
X, y, _ = generate_data()
lasso = CVXSkLearnWrapper(alpha = 1.1)
model = lasso.fit(X, y)
model.predict(X)
model.score(X, y)
Notes:
- Here we are using
RegressorMixin
but depending on the model type it any canClassifierMixin
,RegressorMixin
,ClusterMixin
orTransformerMixin
. - All the methods and signature of them should be same if you are implementing more than one model to build a library.
- If you have
_
prefix before any method in the model class, it will not be exposed. It will be considered as internal method and can be accessed with the class. - If we inherit from
BaseEstimator
&RegressorMixin
there will be a default score function but it can be overwritten. - To change score function in hyper-parameter tuning you can use
make_scorer
andgreater_is_better
in it. - Dynamic parsing of
args
is possible in__init__
method usinginspect
module,
import inspect
def __init__(self, arg1, arg2, arg3, ..., argN):
args, _, _, values = inspect.getargvalues(inspect.currentframe())
values.pop("self")
for arg, val in values.items():
setattr(self, arg, val)
- In this example, we are overwriting case
score
function usingmean_absolute_percentage_error
. Lets assume that for some reason we want to useMAPE
instate as default scoring method it can be useful. Other usingmake_scorer
can be used for any other custom score function or other sklearn score functions.
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
import cvxpy as cp
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_percentage_error
class CVXSkLearnWrapper(BaseEstimator, RegressorMixin):
def __init__(self, alpha=1.0):
self.alpha = alpha
def _loss_fn(self, X, y, beta):
return cp.norm2(X @ beta - y)**2
def _regularizer(self, beta):
return cp.norm1(beta)
def _obj_fn(self, X, y, beta, lambd):
return self._loss_fn(X, y, beta) + lambd * self._regularizer(beta)
def _mse(self, X, Y, beta):
return (1.0 / X.shape[0]) * self._loss_fn(X, Y, beta).value
def fit(self, X, y):
X, y = check_X_y(X, y)
n = X.shape[1]
beta = cp.Variable(n)
lambd = cp.Parameter(nonneg=True)
problem = cp.Problem(cp.Minimize(self._obj_fn(X, y, beta, lambd)))
lambd.value = self.alpha
problem.solve()
self.coeff_ = beta.value
self.intercept_ = 0.0
return self
def predict(self, X):
check_is_fitted(self)
X = check_array(X)
return X @ self.coeff_
def score(self, X, y):
y_hat = self.predict(X)
return mean_absolute_percentage_error(y, y_hat)
- All the hyper-parameters (not derived from data) has to be initialized in
__init__
method. Any model parameters (derived from data) must be initialized infit
. Variable names in init should be always same as arg name and variables in fit should always have a suffix_
. fit
andpredict
are mandatory methods inBaseEstimator
class.
Reference: