In our last tutorial on dimensionality reduction with PCA, we explained how you can reduce dimensions of your dataset using the principal component analysis algorithm.

In this tutorial, we’re going to show you another way to reduce dimensions. We’ll walk through how to reduce the number of features in a dataset using the linear discriminant analysis (LDA) technique, which is another commonly used technique for dimensionality reduction.

The selection criteria for PCA or LDA for dimensionality reduction is motivated mainly by the dataset. PCA is an unsupervised technique which means that it doesn’t require labeled data. LDA, on the other hand, is a supervised technique and it requires labeled data.

This article shows how to use the LDA algorithm to reduce features in a labeled dataset using Python with help of the scikit-learn library. In this context, features are also called as components or dimensions.

Machine Learning Model with Default Features

Establishing a Baseline

In this section, you’ll train a machine learning model on 8 attributes of different patients to solve a binary classification task of predicting whether or not a patient is diabetic. We’re not actually going to be using LDA in this section - we’re simply establishing our baseline accuracy with all the features in our dataset. In the next section, we’ll use LDA to remove some of the features and see if we can still get optimal prediction performance.

The dataset we’ll use to train our machine learning model can be downloaded in CSV format from this Kaggle link. It’s been mirrored here in case you prefer to download it directly. This is the same dataset we used to train the machine learning models in our Python PCA tutorial. We used the same model so we can compare results between the two dimensionality reduction techniques.

To begin, first execute the following script to import our required libraries:

import pandas as pd
import numpy as np
import seaborn as sns

Our next script uses the Pandas read_csv() method to import our dataset. You’ll need to update the file path to match where you saved your CSV file. By default, the dataset doesn’t have any headers so we passed a custom list of headers in our script below, as well.

header_list = ["Preg", "Glucose", "BP", "skinThick", "Insulin", "BMI", "DPF", "Age", "Class"]
diabetes_ds = pd.read_csv(r"C:\Datasets\pima-indians-diabetes.csv", names = header_list)

diabetes_ds.head()

The output shows that the dataset contains 9 columns. The first 8 columns contain patient attributes on the basis of which the value in the 9th column, Class, will be predicted. The Class column is a binary column reflecting whether or not the patient is diabetic.

Output:

dataset header

The next couple lines divide the data into a features and labels set. This is essential when building a machine learning algorithm.

features = diabetes_ds.drop(['Class'], axis=1)
labels = diabetes_ds["Class"]

Once that’s done, we can further divide our dataset into a training set and a test set. We’ll use 80% of our data to train and the remaining 20% to test the performance of our algorithm. You can easily split the data using the train_test_split() method from the sklearn.model_selection module, where the test_size attribute controls the fraction of the data used for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels,  test_size=0.20, random_state=0)

The following script normalizes our feature set using the StandardScaler class from sklearn.preprocessing module:

from sklearn.preprocessing import StandardScaler
stand_scal = StandardScaler()
X_train = stand_scal.fit_transform(X_train)
X_test = stand_scal.transform (X_test)

So far all of this should look really familiar since these are the same steps we used to establish our baseline in our PCA tutorial. Just like last time, we’ll use the Random Forest Classifier to train our model. The random forest classifier is one of the most commonly used classifiers for machine learning and often yields best accuracy among traditional machine learning classifiers.

To train the model, use the fit() method of the RandomForestClassifier class of the sklearn.ensemble module, as shown in the following script. To make predictions on the test set, you need to call the predict() method and pass it your test set.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=50, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

The model accuracy can be calculated via the accuracy_score method from the sklearn.metrics module as shown in the script below, which compares our predicted values, y_pred, with the actual known test results in y_test.

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Output:

0.7987012987012987

The output shows that our algorithm, when trained using all 8 features in our dataset, achieves an accuracy of 79.87% for correct prediction of whether or not a person is diabetic. This is the same score as our baseline PCA test because it’s an identical dataset and we haven’t reduced any dimensions yet.


Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Dimensionality Reduction with Linear Discriminant Analysis

Now that we have our baseline, we’re going to reduce the number of features in our dataset using Linear Discriminant Analysis. We’ll train our machine learning model once again on the reduced set of features to see how well the algorithm performs compared to our baseline.

You can use the LinearDiscriminantAnalysis class from the sklearn.discriminant_analysis module to implement LDA in Python. The number of reduced features you desire should be passed to the n_components attribute of the LinearDiscriminantAnalysis class constructor.

To apply the LDA algorithm to your feature set, you need to pass your feature set and labels to the fit_transform() method of the LinearDiscriminantAnalysis class, which returns the data reduced into the specified number of dimensions. Notice the difference here between LDA and PCA. Since LDA is a supervised dimensionality reduction technique, which depends on the output label, you need to pass your label set along with your feature set. Since PCA was unsupervised, you only had to pass the feature set to the fit_transform() method.

Finally, you’ll need to reduce the features in your test set since this test set will be used for evaluation. To do so, you can use the transform() method of the LinearDiscriminantAnalysis class and pass it your test set.

Let’s reduce our dataset to 1 feature using the LDA algorithm and see how our performance looks.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda_model = LDA(n_components = 1)
X_train_lda = lda_model.fit_transform(X_train, y_train)
X_test_lda = lda_model.transform(X_test)

To confirm our dataset has actually been reduced to a single feature, let’s print the shape of our training and test sets:

print(X_train_lda.shape)
print(X_test_lda.shape)

Output:

(614, 1)
(154, 1)

The output above confirms we only have 1 feature for all the records in our training and test sets.

Let’s again train the Random Forest Classifier algorithm on the reduced feature set with 1 feature and see how well our model performs.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=50, random_state=0)
model.fit(X_train_lda, y_train)
y_pred = model.predict(X_test_lda)

print(accuracy_score(y_test, y_pred))

The output below shows that with only a single feature, our machine learning model achieves an accuracy of 77.92% which is only 2% less than the accuracy achieved using all 8 features. That is impressive!

Output:

0.7792207792207793

Comparing LDA to PCA

For the sake of comparison with PCA, let’s see what our accuracy would be with 1 principal component using the PCA algorithm.

from sklearn.decomposition import PCA

pca_model = PCA(n_components= 1)
X_train_pca = pca_model.fit_transform(X_train)
X_test_pca = pca_model.transform(X_test)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=50, random_state=0)
model.fit(X_train_pca, y_train)
y_pred = model.predict(X_test_pca)

print(accuracy_score(y_test, y_pred))

Output:

0.6623376623376623

The above output shows that when we reduce our feature set to 1 component using the principal component analysis (PCA) technique, we only get an accuracy of 66.23% which is around 11% less than the 77.92% achieved using the LDA technique. This clearly shows that for our dataset, LDA is a more suited technique for dimensionality reduction and should convince you that you need to play around with different techniques when reducing dimensions in your own dataset.

If you liked this tutorial and you want more Python machine learning tips, subscribe using the form below.


Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit