First try of auto-sklearn
Since we are big users of scikit-learn and XGBoost, we wanted to try a package that would automate the process of building a machine learning model with these tools. Here is the introduction to auto-sklearn from its github.io website:
auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.
Here is the wikipedia definition of AutoML:
Automated machine learning (AutoML) is the process of automating end-to-end the process of applying machine learning to real-world problems. In a typical machine learning application, practitioners have a dataset consisting of input data points to train on. The raw data itself may not be in a form that all algorithms may be applicable to it out of the box. An expert may have to apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning.
The theory behind auto-sklearn package is presented in the paper published at NIPS2015 (to be honest, I did not read the article because I just wanted to give auto-sklearn
a very quick try with some very basic examples).
Here are the main features of the API. You can:
- set time and memory limits
- restrict the searchspace by selecting or excluding some preprocessing methods or some estimators
- specify some resampling strategies (e.g. 5-fold cv)
- perform some parallel computation with the SMAC algorithm (sequential model-based algorithm configuration) that stores some data on a shared file system
- save your model as you would do with scikit-learn (
pickle
)
We are going to try it on two toy datasets for classification algorithms:
- hand-written digits
- breast cancer
Installation
At first we had a Segmentation fault when running the model. We found some useful information on github, and here are the steps performed to install auto-sklearn
(0.5.2) on an Ubuntu 18.04 OS, in a python 3.7 conda
environment:
$ conda create -n autosklearn python=3.7
$ source activate autosklearn
(autosklearn) $ conda install -c conda-forge 'swig<4'
(autosklearn) $ conda install gxx_linux-64 gcc_linux-64
(autosklearn) $ curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
(autosklearn) $ pip install 'pyrfr>=0.8'
(autosklearn) $ pip install auto-sklearn
Note that the scikit-learn version associated with auto-sklearn is 0.19.2 (latest is 0.21.3).
First data set
This is a description of the UCI ML hand-written digits dataset: each datapoint is a 8x8 image of a digit. However we only have a subset (copy of the test set) included in scikit-learn. So we can say that it is rather small (table from scikit-lean’s documentation):
Characteristics | |
---|---|
Classes | 10 |
Samples per class | ~180 |
Samples total | 1797 |
Dimensionality | 64 |
Features | integers 0-16 |
Basic algorithm
Let’s run a very simple algorithm (taken from this scikit-learn example) in order to compare elapsed time and accuracy with auto-sklearn:
%%time
import sklearn.datasets
from sklearn.metrics import accuracy_score
from sklearn import svm
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
classifier.fit(X_train, y_train)
y_hat = classifier.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, y_hat): 6.3f}")
Accuracy score: 0.991
CPU times: user 262 ms, sys: 0 ns, total: 262 ms
Wall time: 316 ms
Auto-sklearn
This is the first python script given in the auto-sklearn documentation:
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
from sklearn.metrics import accuracy_score
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, y_hat): 6.3f}")
We just made one small change; since we want to make sure to use several cores, we instanciate the autosklearn
model with this argument:
automl = autosklearn.classification.AutoSklearnClassifier(n_jobs=4)
And let’s run it……………………………………………………………………………………………………
Accuracy score: 0.993
14020,06s user 3161,23s system 477% cpu 1:00:01,78 total
The first remark is that it takes a very long time to run, about 53 minutes. However, we did not set any constraint on the algorithms to try, or set a time limit.
The accuracy score is not so bad: 0.993! The auto-sklearn algorithm did not see the test set at all, so there is no data leakage.
Algorithm | Elapsed time (s) | Accuracy score |
---|---|---|
Basic | 0.32 | 0.991 |
auto-sklearn | 3601.78 | 0.993 |
Second data set
This is a description of the Breast Cancer Wisconsin (Diagnostic) Data Set: each datapoint is a collection of geometric measurements of a breast mass computed from a digitized image. Here is description table from scikit-lean’s documentation:
Characteristics | |
---|---|
Classes | 2 |
Samples per class | 212(M),357(B) |
Samples total | 569 |
Dimensionality | 30 |
Features | real, positive |
M: malignant
B: benign
Again, this is a very small dataset.
Basic algorithm
We run a logistic regression classifier with default settings, and without any preprocessing except scaling:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline
RS = 124 # random seed
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.target = 1 - df.target
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=RS)
rsc = RobustScaler()
lrc = LogisticRegression(random_state=RS, n_jobs=-1)
pipeline = make_pipeline(rsc, lrc)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, predictions): 6.3f}")
print(f"Balanced accuracy score: {roc_auc_score(y_test, predictions): 6.3f}")
Accuracy score: 0.971
Balanced accuracy score: 0.964
python breast_cancer_01.py 0,74s user 0,27s system 167% cpu 0,603 total
We compute the balanced accuracy score since the classes are a little imbalanced. We could not find sklearn.metrics.balanced_accuracy_score
in the 0.19.2 release, so used roc_auc_score
instead, which is equivalent in the case of a binary classification.
Auto-sklearn
The auto-sklearn code is very similar to the previous one:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import autosklearn.classification
import sklearn.datasets
from sklearn.metrics import accuracy_score, roc_auc_score
RS = 124 # random seed
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.target = 1 - df.target
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=RS)
automl = autosklearn.classification.AutoSklearnClassifier(n_jobs=4)
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
print(f"Accuracy score : {accuracy_score(y_test, predictions): 6.3f}")
print(f"Balanced accuracy score : {roc_auc_score(y_test, predictions): 6.3f}")
python autosklearn_02.py 15038,88s user 923,99s system 443% cpu 1:00:01,35 total
Accuracy score : 0.947
Balanced accuracy score : 0.936
Algorithm | Elapsed time (s) | Balanced accuracy score |
---|---|---|
Basic | 0.60 | 0.964 |
auto-sklearn | 3601.35 | 0.936 |
Final thoughts
It it a little weird that the elapsed time for each dataset is about 1 hour with auto-sklearn, maybe there is a default time limit for small datasets?
Anyway, in both case auto-sklearn gives some decent accuracy results, but with a very large amount of computation with respect to the problem given.
We should try some other examples and maybe reduce the search space and give some hints to the algorithms…
Although we are often using some automatic hyperparameter tuning tools, it seems that automating the end-to-end process is not very efficient, energy-wise :) However, we are still interested in trying auto-sklearn on some other test cases and some other AutoML packages as well.