Some cool open-source Python packages for Machine Learning Ep 2

Aug 8, 2019 • François Pacull

Moebius

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities.

This post is following that one:

Some cool open-source Python packages for Machine Learning EP 1 (2019/07/11)

(☞ﾟヮﾟ)☞

Database connectivity

Turbodbc - a module to access relational databases via the Open Database Connectivity (ODBC) interface.
ibis - a toolbox to bridge the gap between local Python environments, remote storage, execution systems like Hadoop components (HDFS, Impala, Hive, Spark) and SQL databases.

Data description

Pandas Profiling - Generates profile reports from a pandas DataFrame.

Data preparation

Snorkel - a system for quickly generating training data with weak supervision.
imbalanced-learn - a package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Feature engineering

dirty_cat - dirty cat helps with machine-learning on non-curated categories, by providing encoders that are robust to morphological variants, such as typos, in the category strings.

Dimension reduction

ivis - a machine learning algorithm for reducing dimensionality of very large datasets.

Auto-ML

auto-sklearn - an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
Auto-Keras - an open source software library for automated machine learning.
Keras Tuner - An hyperparameter tuner for Keras.

Model analysis

Skater - a unified framework to enable Model Interpretation for all forms of model.

Workflow management

prefect - a workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine.
papermill - a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

Model management

Studio - a model management framework written in Python to help simplify and expedite your model building experience.

Data visualization

kepler.gl - a powerful open source geospatial analysis tool for large-scale data sets with a jupyter widget to render large-scale interactive maps in Jupyter Notebook.
glue - a library to explore relationships within and among related datasets.
KeplerMapper - an implementation of the TDA Mapper algorithm for visualization of high-dimensional data.

Models

pytorch-transformers - a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
spacy-pytorch-transformers - provides spaCy model pipelines that wrap Hugging Face’s pytorch-transformers package, so you can use them in spaCy.

Time series

STUMPY - a powerful and scalable library that can be used for a variety of time series data mining tasks.