Some cool open-source Python packages for Machine Learning Ep 1

moebius

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. Of course, this list is far from being exhaustive and should evolve as fast as the Python ecosystem does. Also, we exclude from the current list:

main ML algorithm frameworks (scikit-learn, LightGBM, PyTorch, …),
famous user-friendly libraries built on top of deep-learning libraries (fastai, Keras, …),
specific application-oriented libraries (spaCy, scikit-image, StellarGraph, …),
packages dealing with the general data/analytics environment (JupyterLab, Pandas, Dask, Conda, …) that are also used in many other domains, even if some of the following tools are more on the data-engineering side than on the ML one.

We hope you will find this list informative!

Data cleaning

Pyjanitor - a clean API for cleaning data.

Auto-ML

Featuretools - a library for automated feature engineering.
TPOT - an automated tool that optimizes ML pipelines using genetic programming.
Scikit-Optimize - a simple and efficient library to minimize expensive and noisy black-box functions.
Randopt - a package for ML experiment management, hyper-parameter optimization, and results visualization.
Optuna - an automatic hyper-parameter optimization software framework, particularly designed for ML.

Dimension reduction and visualization

UMAP - Uniform Manifold Approximation and Projection is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction.

Model analysis

ELI5 - a library which allows to visualize and debug various ML models using unified API.
Yellowbrick - a suite of visual diagnostic tools called “Visualizers” that extend the scikit-learn API to allow human steering of the model selection process.
SHAP - SHapley Additive exPlanations is a unified approach to explain the output of any ML model.

Experimentation frameworks and tools

Guild AI - a toolkit that automates and optimizes ML experiments.
ModelChimp - an experiment tracker for Deep Learning and ML experiments.
Sacred - a tool to help you configure, organize, log and reproduce experiments.
SKLL - SciKit-Learn Laboratory provides command-line utilities to make it easier to run ML experiments with scikit-learn.
DVC - Data Version Control is a tool for data science and ML projects.

Model export

ONNXMLTools - enables you to convert models from different ML toolkits into ONNX (Open Neural Network Exchange)

Worflows

MLflow - a platform to streamline ML development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.
Kubeflow - a Cloud Native platform for ML based on Google’s internal ML pipelines.

Data pipelines

Kedro - a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned.
Dagster - a system for building modern data applications.