2023-02-12    Share on: Twitter | Facebook | HackerNews | Reddit

Libraries for Automated Exploratory Data Analysis (EDA)

EDA Made Easy - Discover Top-10 Python Libraries That Will Take Your Data Analysis to the Next Level! Learn the Secrets of Automated EDA!

Exploratory Data Analysis (EDA) is an important step in the data analysis process. It allows us to explore and understand the dataset, identify patterns, and make informed decisions about data cleaning, feature engineering, and modeling. In recent years, several Python libraries have been developed to automate and streamline the EDA process. Here are 10 popular Python libraries for automated EDA:

Top-10 Tools for Automated EDA

Pandas Profiling

github stars shield

Pandas Profiling generates a report with descriptive statistics and visualizations for each variable in a Pandas DataFrame. The report includes correlations, missing values, and data types.

pandas-profiling

2. DataPrep

github stars shield

DataPrep provides a set of functions for data cleaning and preprocessing, including automatic column type detection, outlier detection, and missing value imputation.

dataprep

pip install dataprep

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

3. Sweetviz

github stars shield

Sweetviz generates a report with detailed visualizations and statistical analysis for each variable in a Pandas DataFrame. The report includes comparisons between different subgroups and correlation matrices.

Sweetviz

4. Lux

github stars shield

Lux is a library for interactive data visualization that provides a powerful and intuitive interface for exploring and visualizing data. It includes a recommendation system that suggests relevant visualizations based on the current selection.

Lux

5. dabl

github stars shield

dabl is a library that provides a set of functions for automated data analysis and machine learning. It includes tools for data cleaning, feature engineering, and modeling, and provides an easy-to-use interface for non-experts.

dabl

6. Autoviz

github stars shield

Autoviz is a library that automatically generates visualizations for each variable in a Pandas DataFrame. It includes different types of charts such as scatterplots, histograms, and bar charts, and it can be used for both regression and classification tasks.

AutoViz

7. Klib

github stars shield

Klib is a library that provides a set of functions for data cleaning and preprocessing, including feature selection, missing value imputation, and correlation analysis. It includes useful visualizations and statistical analysis for each variable.

Klib

8. ExplainerDashboard

github stars shield

ExplainerDashboard is a library that provides a dashboard for exploring and visualizing the results of machine learning models. It includes visualizations for feature importance, confusion matrices, and partial dependence plots.

ExplainerDashboard

9. PyCaret

github stars shield

PyCaret is a library for automated machine learning that includes tools for data preprocessing, feature selection, and model training. It includes a user-friendly interface that allows non-experts to build and deploy machine learning models.

PyCaret

Missingno

github stars shield

Missingno is a library that provides a set of tools for visualizing and understanding missing data in a dataset. It includes tools for matrix visualization, bar charts, and heatmaps.

Missingno

Honorable mentions

There are three other tools that might be useful during data exploration.

Featuretools

github stars shield

Featuretools is a library for automated feature engineering that allows you to automatically generate features from multiple tables. It includes tools for handling time-based data and can generate a set of feature definitions in just a few lines of code.

Featuretools

PyExplainer

github stars shield

PyExplainer is a library that allows you to easily explain and interpret the results of machine learning models. It includes tools for feature importance, partial dependence plots, and permutation feature importance.

PyExplainer

Any comments or suggestions? Let me know.

References

There is interesting article that features EDA tools:

Modern Exploratory Data Analysis. Review of 4 libraries for automatic EDA | by ChiefHustler | Towards Data Science

It covers:

  • pandas-profiling (python)
  • summarytools (R)
  • explore (R)
  • dataMaid (R)

To cite this article:

@article{Saf2023Libraries,
    author  = {Krystian Safjan},
    title   = {Libraries for Automated Exploratory Data Analysis (EDA)},
    journal = {Krystian's Safjan Blog},
    year    = {2023},
}