Week #4 in Machine Learning

Eliud Nduati
4 min readFeb 7, 2022

Diabetes classification — Supervised ML classification problem

source

In this notebook, I am applying supervised machine learning classifications to a diabetic dataset. I aim to determine whether the tested data has diabetes or not. I will use KNN , decision tree , random forest ,Support vector machine, logistic regression and Naive Bayes algorithms. I will also make evaluation of all the models used by using confusion matrix.

Data Loading

png
df.shape(768, 9)

Descriptive statistics

This step shows the decriptive statistics of all the numerical columns in the dataset.

df.describe()
png

EDA

Exploratory Data Analysis

This step focuses on exploring throuhg the data to determine the data types, check for missing data point and fix them etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
sns.countplot(x="Outcome", data=df)<AxesSubplot:xlabel='Outcome', ylabel='count'>
png

This chat shows how the Outcomes are classified. There are more Negative outcomes, 0, than postive outcomes, 1. There’s not much cleaning to be done on our dataset. We can therefore go directly to the Machine learning processes.

Machine Learning

x_data.head()
png
y[1:10]array([0, 1, 0, 1, 0, 1, 0, 1, 1])

Logistic Regression Classification

Logistic regression is a powerful algorithm when you have a binary classification problem

test accuracy 0.7662337662337663
png

KNN Classification

We need to choose a small k value but not too small that it causes overfitting while big k value causes underfitting. The K value we choose needs to be as close to our test points as possible. For this case, we use the standard k value whcih is k=3

11 nn score: 0.7207792207792207
png

K=11, 12 gives the best accuracy for our case

png

Decision Tree Classification

Decision trees builds classification on regression model in the form of tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time asn associated decison tree is incrementaly developed. It can eb used for both binary and multiclass

score:  0.7467532467532467
png

Random Forest Classification

This approach uses multiple decision trees and take the averge of the results of these decison trees. this average is used to determien the calss of the test points. It is one fo the ensember methods that uses multiple classes to predict the target

random forest model score:  0.7207792207792207
png

Support Vector Machine(SVM)

SVM is used both for regression and classification problems. The C parameter inside the SVM algorithm has a default value fo 1. If C is kept very small, it can case misclassification, if it is too bigh it can cause overfitting. As a reuslt, differnt C values are tried to get the best value

Accuracy of SVM:  0.7467532467532467
png

Naive Bayes Classification

this is a probabilistic classifier whcih applies Bayes theorem with strong independence assumption between the features. It works by determinign similarity range and calculating probability of the X points in the A feature P(A_feature|x)

accuracy of naive bayes:  0.7662337662337663
png

Comparision Using Confusion matrix

Below I visualize all confusion matrices to all classifiers

png

Conclusion

From the past 3 week’s posts, we have looked at the theoretical part of classification algorithms, here we have applied the algorithms. Check out for next weeks post.

--

--