Advanced Classification techniques with Model Prediction Analysis

Machine Learning

Implementing advanced classification techniques for precise model prediction analysis to enhance accuracy and efficiency

Author

Affiliation

Nishant Bharali

Published

November 24, 2023

Advanced Classification techniques with Model Prediction Analysis

Implementing advanced classification techniques for precise model prediction analysis to enhance accuracy and efficiency

Introduction

Machine learning is a fascinating field that empowers computers to learn and make predictions or decisions without being explicitly programmed. One of the fundamental tasks in machine learning is classification, where the goal is to categorize data points into predefined classes or labels. In this blog post, we dive deep into the world of classification, exploring advanced techniques and their application on a real-world dataset.

The Iris dataset is a well-known benchmark in the machine learning community. It consists of measurements of four features from three different species of iris flowers. This seemingly simple dataset serves as an excellent playground for understanding and implementing classification algorithms. However, we won’t stop at the basics; we’ll explore advanced classification techniques, model tuning, and even dive into ensemble methods and neural networks.

Data Loading and Preliminary Analysis

# Importing essential libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load dataset
iris_df = pd.read_csv('Iris_dataset.csv')

# Basic dataset information
print(iris_df.head())
print(iris_df.describe())
print(iris_df.info())

# Visualizing the distribution of classes
sns.countplot(x='species', data=iris_df)
plt.show()

# Pairplot to explore relationships between features
sns.pairplot(iris_df, hue='species')
plt.show()

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0           6.370695          2.771952           5.118790          1.542084   
1           5.979286          2.612095           5.086595          1.560228   
2           6.741563          2.804321           4.758669          1.443702   
3           6.346538          2.796799           5.601084          2.114922   
4           5.558280          2.831451           4.876331          2.045125   

      species  
0   virginica  
1  versicolor  
2  versicolor  
3   virginica  
4   virginica  
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count        3000.000000       3000.000000        3000.000000   
mean            5.844003          3.055776           3.756746   
std             0.825203          0.435899           1.761100   
min             4.180767          1.918769           0.907839   
25%             5.118694          2.780525           1.559288   
50%             5.777010          3.025634           4.347984   
75%             6.416866          3.342169           5.098831   
max             7.963257          4.516746           7.046886   

       petal width (cm)  
count       3000.000000  
mean           1.199022  
std            0.760822  
min           -0.048888  
25%            0.310048  
50%            1.326659  
75%            1.818514  
max            2.601413  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  3000 non-null   float64
 1   sepal width (cm)   3000 non-null   float64
 2   petal length (cm)  3000 non-null   float64
 3   petal width (cm)   3000 non-null   float64
 4   species            3000 non-null   object 
dtypes: float64(4), object(1)
memory usage: 117.3+ KB
None

Data Preprocessing

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Encoding categorical data
encoder = LabelEncoder()
iris_df['species'] = encoder.fit_transform(iris_df['species'])

# Splitting dataset into features and target variable
X = iris_df.drop('species', axis=1)
y = iris_df['species']

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exploratory Data Analysis (EDA)

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(iris_df.corr(), annot=True, cmap='viridis')
plt.show()

# Advanced pairplot with distribution and regression
sns.pairplot(iris_df, kind='reg', hue='species')
plt.show()

Model Building and Evaluation For Classification

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Function to train and evaluate models
def train_evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f'Model: {model.__class__.__name__}')
    print(classification_report(y_test, predictions))
    # Confusion matrix
    cm = confusion_matrix(y_test, predictions)
    sns.heatmap(cm, annot=True)
    plt.show()

# Decision Tree Classifier
train_evaluate_model(DecisionTreeClassifier(), X_train, y_train, X_test, y_test)

# RandomForestClassifier
train_evaluate_model(RandomForestClassifier(), X_train, y_train, X_test, y_test)

# GradientBoostingClassifier
train_evaluate_model(GradientBoostingClassifier(), X_train, y_train, X_test, y_test)

# Support Vector Machine (SVC)
train_evaluate_model(SVC(), X_train, y_train, X_test, y_test)

# K-Nearest Neighbors (KNN)
train_evaluate_model(KNeighborsClassifier(), X_train, y_train, X_test, y_test)

# Logistic Regression
train_evaluate_model(LogisticRegression(), X_train, y_train, X_test, y_test)

Model: DecisionTreeClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.98      0.99       299
           2       0.98      0.99      0.99       312

    accuracy                           0.99       900
   macro avg       0.99      0.99      0.99       900
weighted avg       0.99      0.99      0.99       900

Model: RandomForestClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.99      0.99       299
           2       0.99      0.99      0.99       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

Model: GradientBoostingClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.99      0.99       299
           2       0.99      0.99      0.99       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

Model: SVC
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.96      0.98       299
           2       0.97      0.99      0.98       312

    accuracy                           0.98       900
   macro avg       0.99      0.98      0.98       900
weighted avg       0.98      0.98      0.98       900

Model: KNeighborsClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       1.00      1.00      1.00       299
           2       1.00      1.00      1.00       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

Model: LogisticRegression
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.98      0.95      0.97       299
           2       0.96      0.98      0.97       312

    accuracy                           0.98       900
   macro avg       0.98      0.98      0.98       900
weighted avg       0.98      0.98      0.98       900

In this code, we introduce different types of classification models, including Random Forest Classifier, Gradient Boosting Classifier, Support Vector Classifier (SVC), K-Nearest Neighbors Classifier (KNN), and Logistic Regression.

For each model, we train it on the training data and evaluate its performance using accuracy and a classification report that includes precision, recall, and F1-score. This allows you to compare the performance of various classification algorithms on the Iris dataset.

Advanced Model Tuning and Analysis

from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for RandomForestClassifier
param_grid = {'n_estimators': [10, 50, 100], 'max_features': ['sqrt', 'log2', None]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_

# Evaluating the tuned model
train_evaluate_model(best_rf, X_train, y_train, X_test, y_test)

Model: RandomForestClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.99      0.99       299
           2       0.99      0.99      0.99       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

Hyperparameter Tuning for the Classification Model

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Proper Hyperparameter tuning for Random Forest Classifier
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=5)
grid_search_rf.fit(X_train, y_train)

best_rf_classifier = grid_search_rf.best_estimator_

# Evaluating the tuned Random Forest Classifier
tuned_rf_predictions = best_rf_classifier.predict(X_test)
print("Tuned Random Forest Classifier - Model Evaluation")
print("Accuracy:", accuracy_score(y_test, tuned_rf_predictions))
print("Classification Report:")
print(classification_report(y_test, tuned_rf_predictions))

Tuned Random Forest Classifier - Model Evaluation
Accuracy: 0.9955555555555555
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       0.99      0.99      0.99       299
           2       0.99      0.99      0.99       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

ROC Curve Analysis for Multiple Models

Comparing the performance of the various classification models using ROC Curve analysis, here we discuss the plots of the ROC Curve for each model. This will involve calculating the True Positive Rate (TPR) and False Positive Rate (FPR) for each model and plotting them.

from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
import matplotlib.pyplot as plt

# Binarize the output classes for ROC analysis
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]

# Splitting the data again for multiclass ROC analysis
X_train, X_test, y_train_bin, y_test_bin = train_test_split(X, y_bin, test_size=0.3, random_state=42)

# Classifier list
classifiers = [
    OneVsRestClassifier(DecisionTreeClassifier()),
    OneVsRestClassifier(RandomForestClassifier()),
    OneVsRestClassifier(GradientBoostingClassifier()),
    OneVsRestClassifier(SVC(probability=True)),
    OneVsRestClassifier(KNeighborsClassifier()),
    OneVsRestClassifier(LogisticRegression())
]

# Plotting ROC Curves
plt.figure(figsize=(10, 8))

# Compute ROC curve and ROC area for each class
for classifier in classifiers:
    classifier.fit(X_train, y_train_bin)
    y_score = classifier.predict_proba(X_test)

    # Compute ROC curve and ROC area for each class
    for i in range(n_classes):
        fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_score[:, i])
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label=f'{classifier.estimator.__class__.__name__} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Multi-class')
plt.legend(loc="lower right")
plt.show()

This code will generate ROC curves for each of the classifiers used, providing a visual comparison of their performance in terms of the trade-off between the True Positive Rate and False Positive Rate. The area under the curve (AUC) is also displayed as a measure of the model’s performance, with a higher AUC indicating a better model. This analysis is crucial for understanding the performance of classification models, especially in multi-class settings.

Dimensionality Reduction and Visualization

from sklearn.decomposition import PCA

# PCA for dimensionality reduction
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Visualizing PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, cmap='viridis', label='Train set')
plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=y_test, cmap='plasma', label='Test set', marker='x')
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.legend()
plt.title('PCA of Iris Dataset')
plt.show()

Additional Advanced Analysis

1. Cross-Validation

Cross-validation is a technique used to evaluate the generalizability of a model by training and testing it on different subsets of the dataset.

from sklearn.model_selection import cross_val_score

# Example using RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Performing 10-fold cross-validation
cv_scores = cross_val_score(rf_classifier, X, y, cv=10)

print("Cross-Validation Scores for RandomForestClassifier:", cv_scores)
print("Average Score:", np.mean(cv_scores))

Cross-Validation Scores for RandomForestClassifier: [0.99666667 1.         0.99333333 0.99333333 1.         0.99666667
 0.99666667 0.99333333 1.         1.        ]
Average Score: 0.9969999999999999

2. Ensemble Methods

Ensemble methods combine multiple models to improve the overall performance. Here, I will use an ensemble of different classifiers.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Update Logistic Regression in the ensemble
classifiers = [
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Gradient Boosting', GradientBoostingClassifier()),
    ('SVC', SVC(probability=True)),
    ('KNN', KNeighborsClassifier()),
    ('Logistic Regression', LogisticRegression(max_iter=1000))
]

ensemble = VotingClassifier(estimators=classifiers, voting='soft')
ensemble.fit(X_train, y_train)
ensemble_predictions = ensemble.predict(X_test)

print("Ensemble Model Classification Report:")
print(classification_report(y_test, ensemble_predictions))

Ensemble Model Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       289
           1       1.00      0.99      0.99       299
           2       0.99      1.00      1.00       312

    accuracy                           1.00       900
   macro avg       1.00      1.00      1.00       900
weighted avg       1.00      1.00      1.00       900

PREDICTION

Prediction: Data Preprocessing with Visualization

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Encoding categorical data
encoder = LabelEncoder()
iris_df['species'] = encoder.fit_transform(iris_df['species'])

# Visualizing the distribution of the target variable
plt.figure(figsize=(8, 5))
sns.countplot(x='species', data=iris_df)
plt.title('Distribution of Target Variable (Species)')
plt.show()

# Splitting dataset into features and target variable
X = iris_df.drop('species', axis=1)
y = iris_df['species']

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Pairplot to explore relationships between features
pairplot_df = iris_df.copy()
pairplot_df['species'] = encoder.inverse_transform(pairplot_df['species'])

plt.figure(figsize=(10, 8))
sns.pairplot(pairplot_df, hue='species')
plt.title('Pairplot of Features with Species as Hue')
plt.show()

<Figure size 960x768 with 0 Axes>

Model Building, Evaluation, and Visualization for Prediction

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Function to train and evaluate regression models
def train_evaluate_regression_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    # Model Evaluation
    print(f'Model: {model.__class__.__name__}')
    print(f'Mean Squared Error: {mean_squared_error(y_test, predictions)}')
    print(f'R-squared (R2) Score: {r2_score(y_test, predictions)}')
    
    # Visualization
    plt.figure(figsize=(8, 5))
    plt.scatter(y_test, predictions)
    plt.xlabel('True Values')
    plt.ylabel('Predictions')
    plt.title(f'{model.__class__.__name__} - True vs. Predicted Values')
    plt.show()

# Linear Regression
train_evaluate_regression_model(LinearRegression(), X_train, y_train, X_test, y_test)

# Decision Tree Regressor
train_evaluate_regression_model(DecisionTreeRegressor(), X_train, y_train, X_test, y_test)

# Random Forest Regressor
train_evaluate_regression_model(RandomForestRegressor(), X_train, y_train, X_test, y_test)

Model: LinearRegression
Mean Squared Error: 0.047703003079178956
R-squared (R2) Score: 0.9284946222241109
Model: DecisionTreeRegressor
Mean Squared Error: 0.0077777777777777776
R-squared (R2) Score: 0.9883413432623143
Model: RandomForestRegressor
Mean Squared Error: 0.006041444444444443
R-squared (R2) Score: 0.990944055102883

Advanced Model Tuning and Analysis for Prediction

from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Random Forest Regressor
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

# Evaluating the tuned model
predictions = best_rf.predict(X_test)
print("Tuned Random Forest Regressor - Model Evaluation")
print(f'Mean Squared Error: {mean_squared_error(y_test, predictions)}')
print(f'R-squared (R2) Score: {r2_score(y_test, predictions)}')

Tuned Random Forest Regressor - Model Evaluation
Mean Squared Error: 0.0072306666666666665
R-squared (R2) Score: 0.9891614464876909

Feature Selection and Importance Analysis

Feature selection is crucial for improving model performance and reducing overfitting. Here, we use techniques like Recursive Feature Elimination (RFE) and feature importance analysis to select the most relevant features for classification or prediction.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Using Recursive Feature Elimination (RFE) with Logistic Regression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=3)  # Select the top 3 features
fit = rfe.fit(X_train, y_train)

# List of selected features
selected_features = [feature for idx, feature in enumerate(X.columns) if fit.support_[idx]]
print("Selected Features:", selected_features)

Selected Features: ['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Additionally, we can analyze feature importance for tree-based models like Random Forest or Gradient Boosting to understand which features contribute the most to predictions.

from sklearn.ensemble import RandomForestClassifier

# Feature Importance Analysis for Random Forest Classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Plot feature importance
feature_importance = pd.Series(rf_model.feature_importances_, index=X.columns)
feature_importance.nlargest(5).plot(kind='barh')
plt.title("Feature Importance (Random Forest)")
plt.show()

Handling Class Imbalance

In real-world datasets, class imbalance is common, where one class has significantly fewer samples than others. Techniques like oversampling, undersampling, and Synthetic Minority Over-sampling Technique (SMOTE) can be employed to address this issue.

from imblearn.over_sampling import SMOTE

# Using SMOTE to handle class imbalance
smote = SMOTE(sampling_strategy='auto')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Conclusion

In this journey through the Iris dataset and the realm of classification, we’ve covered a wide range of topics. Starting with data loading and preprocessing, we explored the relationships between features, ensuring that we understood our data thoroughly. We then introduced a variety of classification models, from decision trees to support vector machines, and compared their performance using robust evaluation metrics.

But we didn’t stop there. We delved into advanced techniques, including cross-validation to ensure the generalizability of our models, ensemble methods that combined the strengths of multiple classifiers, and even a taste of neural networks for classification tasks.

Our exploration of ROC curves allowed us to visualize and compare the trade-offs between true positive and false positive rates across different models, providing valuable insights into their performance.

In the end, classification is a powerful tool in the machine learning toolkit, with applications ranging from medical diagnosis to spam email filtering. The Iris dataset served as an ideal playground to learn and experiment with these techniques, but the knowledge gained can be applied to more complex and real-world classification problems.

As you continue your journey in machine learning, remember that classification is just the tip of the iceberg. The world of machine learning is vast and ever-evolving, and there are countless exciting challenges and opportunities awaiting those who dare to explore it further.

Reuse

https://creativecommons.org/licenses/by/4.0/