Week #4 in Machine Learning
Diabetes classification — Supervised ML classification problem
In this notebook, I am applying supervised machine learning classifications to a diabetic dataset. I aim to determine whether the tested data has diabetes or not. I will use KNN , decision tree , random forest ,Support vector machine, logistic regression and Naive Bayes algorithms. I will also make evaluation of all the models used by using confusion matrix.
Data Loading
df.shape(768, 9)
Descriptive statistics
This step shows the decriptive statistics of all the numerical columns in the dataset.
df.describe()
EDA
Exploratory Data Analysis
This step focuses on exploring throuhg the data to determine the data types, check for missing data point and fix them etc.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KBsns.countplot(x="Outcome", data=df)<AxesSubplot:xlabel='Outcome', ylabel='count'>
This chat shows how the Outcomes are classified. There are more Negative outcomes, 0, than postive outcomes, 1. There’s not much cleaning to be done on our dataset. We can therefore go directly to the Machine learning processes.
Machine Learning
x_data.head()
y[1:10]array([0, 1, 0, 1, 0, 1, 0, 1, 1])
Logistic Regression Classification
Logistic regression is a powerful algorithm when you have a binary classification problem
test accuracy 0.7662337662337663
KNN Classification
We need to choose a small k value but not too small that it causes overfitting while big k value causes underfitting. The K value we choose needs to be as close to our test points as possible. For this case, we use the standard k value whcih is k=3
11 nn score: 0.7207792207792207
K=11, 12 gives the best accuracy for our case
Decision Tree Classification
Decision trees builds classification on regression model in the form of tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time asn associated decison tree is incrementaly developed. It can eb used for both binary and multiclass
score: 0.7467532467532467
Random Forest Classification
This approach uses multiple decision trees and take the averge of the results of these decison trees. this average is used to determien the calss of the test points. It is one fo the ensember methods that uses multiple classes to predict the target
random forest model score: 0.7207792207792207
Support Vector Machine(SVM)
SVM is used both for regression and classification problems. The C parameter inside the SVM algorithm has a default value fo 1. If C is kept very small, it can case misclassification, if it is too bigh it can cause overfitting. As a reuslt, differnt C values are tried to get the best value
Accuracy of SVM: 0.7467532467532467
Naive Bayes Classification
this is a probabilistic classifier whcih applies Bayes theorem with strong independence assumption between the features. It works by determinign similarity range and calculating probability of the X points in the A feature P(A_feature|x)
accuracy of naive bayes: 0.7662337662337663
Comparision Using Confusion matrix
Below I visualize all confusion matrices to all classifiers
Conclusion
From the past 3 week’s posts, we have looked at the theoretical part of classification algorithms, here we have applied the algorithms. Check out for next weeks post.