A curated list of awesome resources for practicing data science using Python. This list includes not only packages, but links to other resources such as tutorials, code snippets and talks.
- Core
- Pandas and Jupyter
- Extraction
- Big Data
- Exploration and Cleaning
- Feature Engineering
- Feature Selection
- Dimensionality Reduction
- Visualization
- Geopraphical Tools
- Recommender Systems
- Decision Trees
- Natural Language Processing (NLP) / Text Processing
- Automated Machine Learning
- Evolutionary Algorithms & Optimization
- Image processing
- Neural Networks
- Regression
- Classification
- Clustering
- Interpretable Classifiers and Regressors
- Multi-label classification
- Time Series
- Financial Data
- Survival Analysis
- Outlier Detection & Anomaly Detection
- Ranking
- Bayes
- Stacking Models
- Model Evaluation
- Model Explanation and Feature Importance
- Hyperparameter Tuning
- Reinforcement Learning
- Frameworks
- Lifecycle Management
- Other
- General Python Programming
- Other Lists
- Things I google a lot
pandas - Data structures built on top of numpy.
scikit-learn - Core ML library.
matplotlib - Plotting library.
seaborn - Python data visualization library based on matplotlib.
pandas_summary - Basic statistics using DataFrameSummary(df).summary()
.
pandas_profiling - Descriptive statistics using ProfileReport
.
sklearn_pandas - Helpful DataFrameMapper
class.
janitor - Clean messy column names.
missingno - Missing data visualization.
General ticks: link
cookiecutter-data-science - Project template for data science projects.
nteract - Open Jupyter Notebooks with doubleclick.
modin - Parallelization library for faster pandas DataFrame
.
xarray - Extends pandas to n-dimensional arrays.
blackcellmagic - Code formatting for jupyter notebooks.
pivottablejs - Drag n drop Pivot Tables and Charts for jupyter notebooks.
qgrid - Pandas DataFrame
sorting.
nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.
textract - Extract text from any document.
Awesome List: AI on Kubernetes
spark - DataFrame
for big data.
spark cheatsheet
dask - Pandas DataFrame
for big data, talk.
dask-ml - Scalable machine learning.
turicreate - Helpful SFrame
class for out-of-memory dataframes.
h2o - Helpful H2OFrame
class for out-of-memory dataframes.
ray - Flexible, high-performance distributed execution framework.
sparkit-learn - PySpark + Scikit-learn.
mars - Tensor-based unified framework for large-scale data computation.
ni - Command line tool for big data.
xsv - Command line tool for indexing, slicing, analyzing, splitting and joining CSV files.
csvkit - Another command line tool for CSV files.
csvsort - Sort large csv files.
fancyimpute - Matrix completion and imputation algorithms.
imbalanced-learn - Resampling for imbalanced datasets.
tspreprocess - Time series preprocessing: Denoising, Compression, Resampling.
sklearn - Pipeline, examples.
skoot - Pipeline helper functions.
categorical-encoding - Categorical encoding of variables.
patsy - R-like syntax for statistical models.
mlxtend - LDA.
featuretools - Automated feature engineering, example.
tsfresh - Time series feature engineering.
Tutorial, Talk
scikit-feature - Feature selection algorithms.
stability-selection - Stability selection.
scikit-rebate - Relief-based feature selection algorithms.
boruta_py - Feature selection, explaination, example.
linselect - Feature selection package.
prince - Dimensionality reduction, factor analysis (PCA, MCA, CA, FAMD).
sklearn - Multidimensional scaling.
sklearn - t-distributed Stochastic Neighbor Embedding. Faster implementations: lvdmaaten, MulticoreTSNE.
sklearn - Truncated SVD (aka LSA).
mdr - Dimensionality reduction, multifactor dimensionality reduction (MDR).
umap - Uniform Manifold Approximation and Projection.
All charts, Austrian monuments.
cufflinks - Dynamic visualization library, wrapper for plotly, medium, example.
physt - Better histograms, talk.
joypy - Draw stacked density plots.
yellowbrick - Wrapper for matplotlib for diagnosic ML plots.
bokeh - Interactive visualization library, Examples, Examples.
altair - Declarative statistical visualization library.
holoviews - Visualization library.
dtreeviz - Decision tree visualization and model interpretation.
chartify - Generate charts.
panel - Dashboarding solution.
dash - Dashboarding solution.
VivaGraphJS - Graph visualization (JS package).
pm - Navigatable 3D graph visualization (JS package), example.
visdom - Dashboarding library.
python-ternary - Triangle plots.
folium - Plot geographical maps using the Leaflet.js library.
stadiamaps - Plot geographical maps.
datashader - Draw millions of points on a map.
sklearn - BallTree, Example.
pynndescent - Nearest neighbor descent for approximate nearest neighbors.
geocoder - Geocoding of addresses, IP addresses.
Conversion of different geo formats: talk, repo
geopandas - Tools for geographic data
Low Level Geospatial Tools (GEOS, GDAL/OGR, PROJ.4)
Vector Data (Shapely, Fiona, Pyproj)
Raster Data (Rasterio)
Plotting (Descartes, Catropy)
Predict economic indicators from Open Street Map ipynb.
List
Microsoft Repo
Examples: 1, 2, 2-ipynb, 3.
surprise - Recommender, talk.
turicreate - Recommender.
implicit - Fast Python Collaborative Filtering for Implicit Feedback Datasets.
spotlight - Deep recommender models using PyTorch.
lightfm - Recommendation algorithms for both implicit and explicit feedback.
lightgbm - Gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, doc.
xgboost - Gradient boosting (GBDT, GBRT or GBM) library, doc, Methods for CIs: link1, link2.
catboost - Gradient boosting.
h2o - Gradient boosting.
forestci - Confidence intervals for random forests.
scikit-garden - Quantile Regression.
grf - Generalized random forest.
dtreeviz - Decision tree visualization and model interpretation.
rfpimp - Feature Importance for RandomForests using Permuation Importance.
Why the default feature importance for random forests is wrong: link
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
bartpy - Bayesian Additive Regression Trees.
Awesome Sentence Embedding List.
talk-nb, nb2, talk.
Text classification Intro, Preprocessing blog post.
gensim - NLP, doc2vec, word2vec, text processing, topic modelling (LSA, LDA), Example, Coherence Model for evaluation.
Embeddings - GloVe ([1], [2]), StarSpace, wikipedia2vec.
pyldavis - Visualization for topic modelling.
spaCy - NLP.
NTLK - NLP, helpful KMeansClusterer
with cosine_distance
.
pytext - NLP from Facebook.
fastText - Efficient text classification and representation learning.
annoy - Approximate nearest neighbor search.
faiss - Approximate nearest neighbor search.
pysparnn - Approximate nearest neighbor search.
infomap - Cluster (word-)vectors to find topics, example.
textract - Extract text from any document.
datasketch - Probabilistic data structures for large data (MinHash, HyperLogLog).
flair - NLP Framework by Zalando.
standfordnlp - NLP Library.
AdaNet - Automated machine learning based on tensorflow.
tpot - Automated machine learning tool, optimizes machine learning pipelines.
auto_ml - Automated machine learning for analytics & production.
autokeras - AutoML for deep learning.
deap - Evolutionary computation framework (Genetic Algorithm, Evolution strategies).
evol - DSL for composable evolutionary algorithms, talk.
platypus - Multiobjective optimization.
nevergrad - Derivation-free optimization.
gplearn - Sklearn-like interface for genetic programming.
blackbox - Optimization of expensive black-box functions.
Optometrist algorithm - paper.
cv2 - OpenCV, classical algorithms: Gaussian Filter, Morphological Transformations.
scikit-image - Image processing.
mahotas - Image processing (Bioinformatics), example.
Convolutional Neural Networks for Visual Recognition
Awesome Deep Learning List
Awesome Semantic Segmentation List
keras preprocessing - Preprocess images.
imgaug - More suffisticated image preprocessing.
tcav - Interpretability method.
keras - Neural Networks on top of tensorflow.
hyperas - Keras + Hyperopt: Convenient hyperparameter optimization wrapper.
elephas - Distributed Deep learning with Keras & Spark.
tflearn - Neural Networks on top of tensorflow.
tensorlayer - Neural Networks on top of tensorflow, tricks.
tensorforce - Tensorflow for applied reinforcement learning.
fastai - Neural Networks in pytorch.
Detectron - Object Detection by Facebook.
autokeras - AutoML for deep learning.
simpledet - Object Detection and Instance Recognition.
PlotNeuralNet - Plot neural networks.
pyearth - Multivariate Adaptive Regression Splines (MARS), tutorial.
pygam - Generalized Additive Models (GAMs), Explanation.
pyclustering - All sorts of clustering algorithms.
somoclu - Self-organizing map.
hdbscan - Clustering algorithm.
nmslib - Dimilarity search library and toolkit for evaluation of k-NN methods.
sklearn-expertsys - Interpretable classifiers, producing easily understood decision rules instead of black box models.
sklearn-interpretable-tree - Simplified tree-based classifier and regressor for interpretable machine learning.
skope-rules - Interpretable classifier, IF-THEN rules.
scikit-multilearn - Multi-label classification, talk.
Awesome Time Series List
Awesome Time Series Anomaly Detection List
Signal Processing Book
Filter Design: Article, Interactive Tool, Filter examples
statsmodels - Time series analysis, seasonal decompose example, SARIMA, granger causality.
pyramid, pmdarima - Wrapper for (Auto-) ARIMA.
pyflux - Time series prediction algorithms (ARIMA, GARCH, GAS, Bayesian).
prophet - Time series prediction library.
htsprophet - Hierarchical Time Series Forecasting using Prophet.
tensorflow - LSTM and others, examples: link, link, link, Explain LSTM
tspreprocess - Preprocessing: Denoising, Compression, Resampling.
tsfresh - Time series feature engineering.
thunder - Data structures and algorithms for loading, processing, and analyzing time series data.
gatspy - General tools for Astronomical Time Series, talk.
gendis - shapelets, example.
tslearn - Time series clustering and classification, TimeSeriesKMeans
, TimeSeriesKMeans
.
pastas - Simulation of time series.
fastdtw - Dynamic Time Warp Distance.
fable - Time Series Forecasting (R package).
CausalImpact - Causal Impact Analysis (R package).
PyAF - Automatic Time Series Forecasting.
luminol - Anomaly Detection and Correlation library from Linkedin.
matrixprofile-ts - Detecting patterns and anomalies, website, ppt.
obspy - Seismology package. Useful classic_sta_lta
function.
pyfolio - Portfolio and risk analytics.
zipline - Algorithmic trading.
alphalens - Performance analysis of predictive stock factors.
Time-dependent Cox Model in R.
lifelines - Survival analysis, Cox PH Regression, talk, talk2.
scikit-survival - Survival analysis.
survivalstan - Survival analysis, intro.
convoys - Analyze time lagged conversions.
RandomSurvivalForests (R packages: randomForestSRC, ggRandomForests).
List
sklearn - Isolation Forest and others.
pyod - Outlier Detection / Anomaly Detection.
eif - Extended Isolation Forest.
AnomalyDetection - Anomaly detection (R package).
luminol - Anomaly Detection and Correlation library from Linkedin.
lightning - Large-scale linear classification, regression and ranking.
Intro, Guide
PyMC3 - Baysian modelling, intro
pomegranate - Probabilistic modelling, talk.
pmlearn - Probabilistic machine learning.
arviz - Exploratory analysis of Bayesian models.
mlxtend - EnsembleVoteClassifier
, StackingRegressor
, StackingCVRegressor
for model stacking.
vecstack - Stacking ML models.
StackNet - Stacking ML models.
pycm - Multi-class confusion matrix.
pandas_ml - Confusion matrix.
Plotting learning curve: link.
yellowbrick - Learning curve.
List: Awesome Machine Learning Interpretability, Book, Examples
shap - Explain predictions of machine learning models, talk.
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
lime - Explaining the predictions of any machine learning classifier, talk, Warning (Myth 7).
lime_xgboost - Create LIMEs for XGBoost.
eli5 - Inspecting machine learning classifiers and explaining their predictions.
lofo-importance - Leave One Feature Out Importance, talk.
pybreakdown - Generate feature contribution plots.
FairML - Model explanation, feature importance.
pycebox - Individual Conditional Expectation Plot Toolbox.
pdpbox - Partial dependence plot toolbox, example.
partial_dependence - Visualize and cluster partial dependence.
skater - Unified framework to enable model interpretation.
anchor - High-Precision Model-Agnostic Explanations for classifiers.
l2x - Instancewise feature selection as methodology for model interpretation.
contrastive_explanation - Contrastive explanations.
sklearn - GridSearchCV, RandomizedSearchCV.
hyperopt - Hyperparameter optimization.
hyperopt-sklearn - Hyperopt + sklearn.
skopt - BayesSearchCV
for Hyperparameter search.
tune - Hyperparameter search with a focus on deep learning and deep reinforcement learning.
optuna - Hyperparamter optimization.
hypergraph - Global optimization methods and hyperparameter optimization.
YouTube, YouTube
Intro to Monte Carlo Tree Search (MCTS) - 1, 2, 3
AlphaZero methodology - 1, 2, 3, Cheat Sheet
RLLib - Library for reinforcement learning.
Horizon - Facebook RL framework.
h2o - Scalable machine learning.
turicreate - Apple Machine Learning Toolkit.
astroml - ML for astronomical data.
mlflow - Manage the machine learning lifecycle, including experimentation, reproducibility and deployment.
modelchimp - Experiment Tracking.
skll - Command-line utilities to make it easier to run machine learning experiments.
dvc - Versioning for ML projects.
daft - Render probabilistic graphical models using matplotlib.
unyt - Working with units.
scrapy - Web scraping library.
VowpalWabbit - ML Toolkit from Microsoft.
funcy - Fancy and practical functional tools.
more_itertools - Extension of itertools.
dill - Serialization, alternative to pickle.
attrs - Python classes without boilerplate.
dateparser - A better date parser.
PocketCluster - Blog.
Awesome AI booksmarks
Awesome Python Data Science
Awesome Machine Learning
Awesome Python
Frequency codes for time series
Date parsing codes
Feature Calculators tsfresh
Do you know a package that should be on this list? Did you spot a package that is no longer maintained and should be removed from this list? Then feel free to read the contribution guidelines and submit your pull request or create a new issue.
MIT