2023-02-21    Share on: Twitter | Facebook | HackerNews | Reddit

List of Features With Strongest Correlation

The code from this note is useful in case when there is a lot of features (e.g 1k+). In such case it is difficult to analyse visually heatmap of correlation matrix (e.g. plotted with sns.heatmap(), see beautiful example here). Instead we extract pairs with the strongest correlation.

To get a list of features with the strongest correlation in a pandas DataFrame, you can use the corr() method to calculate the correlation between all pairs of columns. Here is the Python code to do so:

import pandas as pd
import seaborn as sns

# Load the dataset
df = sns.load_dataset('tips')

# Calculate the correlation matrix
corr_matrix = df.corr()

# Get the top n pairs with the highest correlation
n = 5 # change this to the number of pairs you want to get
top_pairs = corr_matrix.unstack().sort_values(ascending=False)[:n*2]

# Create a list to store the top pairs without duplicates
unique_pairs = []

# Iterate over the top pairs and add only unique pairs to the list
for pair in top_pairs.index:
    if pair[0] != pair[1] and (pair[1], pair[0]) not in unique_pairs:
        unique_pairs.append(pair)

# Create a dataframe with the top pairs and their correlation coefficients
top_pairs_df = pd.DataFrame(columns=['feature_1', 'feature_2', 'corr_coef'])
for i, pair in enumerate(unique_pairs[:n]):
    top_pairs_df.loc[i] = [pair[0], pair[1], corr_matrix.loc[pair[0], pair[1]]]

# Print the top pairs as a dataframe
display(top_pairs_df)

In this code, we use the unstack() method to transform the correlation matrix into a Series of pairs of column names and their correlation values. We then sort the Series in descending order and get the top 2*n pairs (in correlation matrix pairs appear twice, except correlation of the feature with itself).