Week #4 in Machine Learning

4 min readFeb 7, 2022

Diabetes classification — Supervised ML classification problem

In this notebook, I am applying supervised machine learning classifications to a diabetic dataset. I aim to determine whether the tested data has diabetes or not. I will use KNN , decision tree , random forest ,Support vector machine, logistic regression and Naive Bayes algorithms. I will also make evaluation of all the models used by using confusion matrix.

Data Loading

df.shape(768, 9)

Descriptive statistics

This step shows the decriptive statistics of all the numerical columns in the dataset.

df.describe()

EDA

Exploratory Data Analysis

This step focuses on exploring throuhg the data to determine the data types, check for missing data point and fix them etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KBsns.countplot(x="Outcome", data=df)<AxesSubplot:xlabel='Outcome', ylabel='count'>

This chat shows how the Outcomes are classified. There are more Negative outcomes, 0, than postive outcomes, 1. There’s not much cleaning to be done on our dataset. We can therefore go directly to the Machine learning processes.

Machine Learning

x_data.head()

y[1:10]array([0, 1, 0, 1, 0, 1, 0, 1, 1])

Logistic Regression Classification

Logistic regression is a powerful algorithm when you have a binary classification problem

test accuracy 0.7662337662337663

KNN Classification

We need to choose a small k value but not too small that it causes overfitting while big k value causes underfitting. The K value we choose needs to be as close to our test points as possible. For this case, we use the standard k value whcih is k=3

11 nn score: 0.7207792207792207

K=11, 12 gives the best accuracy for our case

Decision Tree Classification

Decision trees builds classification on regression model in the form of tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time asn associated decison tree is incrementaly developed. It can eb used for both binary and multiclass

score:  0.7467532467532467

Random Forest Classification

This approach uses multiple decision trees and take the averge of the results of these decison trees. this average is used to determien the calss of the test points. It is one fo the ensember methods that uses multiple classes to predict the target

random forest model score:  0.7207792207792207

Support Vector Machine(SVM)

SVM is used both for regression and classification problems. The C parameter inside the SVM algorithm has a default value fo 1. If C is kept very small, it can case misclassification, if it is too bigh it can cause overfitting. As a reuslt, differnt C values are tried to get the best value

Accuracy of SVM:  0.7467532467532467

Naive Bayes Classification

this is a probabilistic classifier whcih applies Bayes theorem with strong independence assumption between the features. It works by determinign similarity range and calculating probability of the X points in the A feature P(A_feature|x)

accuracy of naive bayes:  0.7662337662337663

Comparision Using Confusion matrix

Below I visualize all confusion matrices to all classifiers

Conclusion

From the past 3 week’s posts, we have looked at the theoretical part of classification algorithms, here we have applied the algorithms. Check out for next weeks post.