# Week #4 in Machine Learning

*Diabetes classification — Supervised ML classification problem*

In this notebook, I am applying supervised machine learning classifications to a diabetic dataset. I aim to determine whether the tested data has diabetes or not. I will use KNN , decision tree , random forest ,Support vector machine, logistic regression and Naive Bayes algorithms. I will also make evaluation of all the models used by using confusion matrix.

## Data Loading

df.shape(768, 9)

## Descriptive statistics

This step shows the decriptive statistics of all the numerical columns in the dataset.

`df.describe()`

## EDA

Exploratory Data Analysis

This step focuses on exploring throuhg the data to determine the data types, check for missing data point and fix them etc.

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 768 entries, 0 to 767

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pregnancies 768 non-null int64

1 Glucose 768 non-null int64

2 BloodPressure 768 non-null int64

3 SkinThickness 768 non-null int64

4 Insulin 768 non-null int64

5 BMI 768 non-null float64

6 DiabetesPedigreeFunction 768 non-null float64

7 Age 768 non-null int64

8 Outcome 768 non-null int64

dtypes: float64(2), int64(7)

memory usage: 54.1 KBsns.countplot(x="Outcome", data=df)<AxesSubplot:xlabel='Outcome', ylabel='count'>

This chat shows how the Outcomes are classified. There are more Negative outcomes, 0, than postive outcomes, 1. There’s not much cleaning to be done on our dataset. We can therefore go directly to the Machine learning processes.

## Machine Learning

`x_data.head()`

y[1:10]array([0, 1, 0, 1, 0, 1, 0, 1, 1])

## Logistic Regression Classification

Logistic regression is a powerful algorithm when you have a binary classification problem

`test accuracy 0.7662337662337663`

## KNN Classification

We need to choose a small k value but not too small that it causes overfitting while big k value causes underfitting. The K value we choose needs to be as close to our test points as possible. For this case, we use the standard k value whcih is k=3

`11 nn score: 0.7207792207792207`

K=11, 12 gives the best accuracy for our case

## Decision Tree Classification

Decision trees builds classification on regression model in the form of tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time asn associated decison tree is incrementaly developed. It can eb used for both binary and multiclass

`score: 0.7467532467532467`

## Random Forest Classification

This approach uses multiple decision trees and take the averge of the results of these decison trees. this average is used to determien the calss of the test points. It is one fo the ensember methods that uses multiple classes to predict the target

`random forest model score: 0.7207792207792207`

## Support Vector Machine(SVM)

SVM is used both for regression and classification problems. The C parameter inside the SVM algorithm has a default value fo 1. If C is kept very small, it can case misclassification, if it is too bigh it can cause overfitting. As a reuslt, differnt C values are tried to get the best value

`Accuracy of SVM: 0.7467532467532467`

## Naive Bayes Classification

this is a probabilistic classifier whcih applies Bayes theorem with strong independence assumption between the features. It works by determinign similarity range and calculating probability of the X points in the A feature P(A_feature|x)

`accuracy of naive bayes: 0.7662337662337663`

## Comparision Using Confusion matrix

Below I visualize all confusion matrices to all classifiers

## Conclusion

From the past 3 week’s posts, we have looked at the theoretical part of classification algorithms, here we have applied the algorithms. Check out for next weeks post.