Implementing Decision Trees, Random Forests, XGBoost, and AdaBoost for Weather Prediction in Python

Introduction
Dataset Overview
Data Preprocessing
Model Implementation and Evaluation
Visualizing Decision Regions
Conclusion
References

Introduction

Predicting weather conditions is a classic problem in machine learning, offering valuable insights for various industries such as agriculture, aviation, and event planning. In this comprehensive guide, we’ll delve into implementing several machine learning models—including Decision Trees, Random Forests, XGBoost, and AdaBoost—to predict whether it will rain tomorrow using the Weather Australia dataset. We’ll walk through data preprocessing, model training, evaluation, and even deploying these models into real-life web applications.

Dataset Overview

The Weather Australia dataset, sourced from Kaggle, contains 24 features related to weather conditions recorded across various locations in Australia. The primary goal is to predict the RainTomorrow attribute, indicating whether it will rain the next day.

Dataset Features

Date: Observation date.
Location: Geographical location of the weather station.
MinTemp: Minimum temperature in °C.
MaxTemp: Maximum temperature in °C.
Rainfall: Amount of rainfall in mm.
Evaporation: Evaporation in mm.
Sunshine: Number of hours of sunshine.
WindGustDir: Direction of the strongest wind gust.
WindGustSpeed: Speed of the strongest wind gust in km/h.
WindDir9am: Wind direction at 9 AM.
WindDir3pm: Wind direction at 3 PM.
…and more.

Data Preprocessing

Effective data preprocessing is crucial for building accurate and reliable machine learning models. We’ll cover handling missing values, encoding categorical variables, feature selection, and scaling.

Handling Missing Values

Missing data can significantly impact model performance. We’ll address missing values separately for numerical and categorical data.

Numerical Data

For numerical columns, we’ll use Mean Imputation to fill missing values.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Load data
data = pd.read_csv('weatherAUS.csv')

# Identify numerical columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns

# Impute missing values with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
data[numerical_cols] = imp_mean.fit_transform(data[numerical_cols])

import numpy as np

import pandas as pd

from sklearn.impute import SimpleImputer

# Load data

data = pd.read_csv('weatherAUS.csv')

# Identify numerical columns

numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns

# Impute missing values with mean

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

data[numerical_cols] = imp_mean.fit_transform(data[numerical_cols])

Categorical Data

For categorical columns, we’ll use Most Frequent Imputation.

# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns

# Impute missing values with the most frequent value
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data[categorical_cols] = imp_freq.fit_transform(data[categorical_cols])

# Identify categorical columns

categorical_cols = data.select_dtypes(include=['object']).columns

# Impute missing values with the most frequent value

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

data[categorical_cols] = imp_freq.fit_transform(data[categorical_cols])

Encoding Categorical Variables

Machine learning algorithms require numerical inputs. We’ll employ both Label Encoding and One-Hot Encoding based on the number of unique categories in each feature.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

def encode_features(df, threshold=10):
    label_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() &lt;= threshold]
    onehot_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() &gt; threshold]
    
    # Label Encoding
    le = LabelEncoder()
    for col in label_enc_cols:
        df[col] = le.fit_transform(df[col])
    
    # One-Hot Encoding
    ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), onehot_enc_cols)], remainder='passthrough')
    df = ct.fit_transform(df)
    
    return df

X = data.drop('RainTomorrow', axis=1)
y = data['RainTomorrow']

X = encode_features(X)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer

def encode_features(df, threshold=10):

label_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() <= threshold]

onehot_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() > threshold]

# Label Encoding

le = LabelEncoder()

for col in label_enc_cols:

df[col] = le.fit_transform(df[col])

# One-Hot Encoding

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), onehot_enc_cols)], remainder='passthrough')

df = ct.fit_transform(df)

return df

X = data.drop('RainTomorrow', axis=1)

y = data['RainTomorrow']

X = encode_features(X)

Feature Selection

To enhance model performance and reduce computational complexity, we’ll select the top features using the SelectKBest method with the Chi-Squared statistic.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Select top 10 features
selector = SelectKBest(score_func=chi2, k=10)
X_selected = selector.fit_transform(X_scaled, y)

# Further reduce to top 2 features for visualization
best_features = selector.get_support(indices=True)
X_final = X_selected[:, :2]

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Scale features

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

# Select top 10 features

selector = SelectKBest(score_func=chi2, k=10)

X_selected = selector.fit_transform(X_scaled, y)

# Further reduce to top 2 features for visualization

best_features = selector.get_support(indices=True)

X_final = X_selected[:, :2]

Train-Test Split and Feature Scaling

Splitting the data into training and testing sets ensures that our model’s performance is evaluated on unseen data.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.20, random_state=1)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.20, random_state=1)

# Feature Scaling

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

Model Implementation and Evaluation

We’ll implement various machine learning models and evaluate their performance using Accuracy Score.

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f'KNN Accuracy: {knn_accuracy:.2f}')

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

knn_accuracy = accuracy_score(y_pred_knn, y_test)

print(f'KNN Accuracy: {knn_accuracy:.2f}')

KNN Accuracy: 0.80

Logistic Regression

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0, max_iter=200)
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f'Logistic Regression Accuracy: {lr_accuracy:.2f}')

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0, max_iter=200)

log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict(X_test)

lr_accuracy = accuracy_score(y_pred_lr, y_test)

print(f'Logistic Regression Accuracy: {lr_accuracy:.2f}')

Logistic Regression Accuracy: 0.83

Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
gnb_accuracy = accuracy_score(y_pred_gnb, y_test)
print(f'Gaussian Naive Bayes Accuracy: {gnb_accuracy:.2f}')

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred_gnb = gnb.predict(X_test)

gnb_accuracy = accuracy_score(y_pred_gnb, y_test)

print(f'Gaussian Naive Bayes Accuracy: {gnb_accuracy:.2f}')

Gaussian Naive Bayes Accuracy: 0.80

Support Vector Machine (SVM)

from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
svm_accuracy = accuracy_score(y_pred_svm, y_test)
print(f'SVM Accuracy: {svm_accuracy:.2f}')

from sklearn.svm import SVC

svm = SVC()

svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)

svm_accuracy = accuracy_score(y_pred_svm, y_test)

print(f'SVM Accuracy: {svm_accuracy:.2f}')

SVM Accuracy: 0.83

Decision Tree

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
dtc_accuracy = accuracy_score(y_pred_dtc, y_test)
print(f'Decision Tree Accuracy: {dtc_accuracy:.2f}')

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

dtc_accuracy = accuracy_score(y_pred_dtc, y_test)

print(f'Decision Tree Accuracy: {dtc_accuracy:.2f}')

Decision Tree Accuracy: 0.83

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, max_depth=5)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_accuracy = accuracy_score(y_pred_rf, y_test)
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, max_depth=5)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

rf_accuracy = accuracy_score(y_pred_rf, y_test)

print(f'Random Forest Accuracy: {rf_accuracy:.2f}')

Random Forest Accuracy: 0.83

XGBoost and AdaBoost

While the initial implementation doesn’t cover XGBoost and AdaBoost, these ensemble methods can further enhance model performance. Here’s a brief example of how to implement them:

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
xgb_accuracy = accuracy_score(y_pred_xgb, y_test)
print(f'XGBoost Accuracy: {xgb_accuracy:.2f}')

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)

xgb_accuracy = accuracy_score(y_pred_xgb, y_test)

print(f'XGBoost Accuracy: {xgb_accuracy:.2f}')

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=100, random_state=0)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
ada_accuracy = accuracy_score(y_pred_ada, y_test)
print(f'AdaBoost Accuracy: {ada_accuracy:.2f}')

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=100, random_state=0)

ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_test)

ada_accuracy = accuracy_score(y_pred_ada, y_test)

print(f'AdaBoost Accuracy: {ada_accuracy:.2f}')

Note: Ensure you have the xgboost library installed using pip install xgboost.

Visualizing Decision Regions

Visualizing decision boundaries helps in understanding how different models classify the data. Below is an example using the Iris dataset:

from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
from sklearn import datasets

# Load Iris dataset
iris = datasets.load_iris()
X_vis = iris.data[:, :2]
y_vis = iris.target

# Train KNN
knn_vis = KNeighborsClassifier(n_neighbors=3)
knn_vis.fit(X_vis, y_vis)

# Plot decision regions
plot_decision_regions(X_vis, y_vis, clf=knn_vis)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Decision Regions')
plt.legend()
plt.show()

from mlxtend.plotting import plot_decision_regions

import matplotlib.pyplot as plt

from sklearn import datasets

# Load Iris dataset

iris = datasets.load_iris()

X_vis = iris.data[:, :2]

y_vis = iris.target

# Train KNN

knn_vis = KNeighborsClassifier(n_neighbors=3)

knn_vis.fit(X_vis, y_vis)

# Plot decision regions

plot_decision_regions(X_vis, y_vis, clf=knn_vis)

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.title('KNN Decision Regions')

plt.legend()

plt.show()

Visualization Output: A plot showcasing the decision boundaries created by the KNN classifier.

Conclusion

In this guide, we’ve explored the implementation of various machine learning models—Decision Trees, Random Forests, Logistic Regression, KNN, Gaussian Naive Bayes, and SVM—for predicting weather conditions using the Weather Australia dataset. Each model showcased competitive accuracy scores, with Logistic Regression, SVM, Decision Trees, and Random Forests achieving approximately 83% accuracy.

For enhanced performance, ensemble methods like XGBoost and AdaBoost can be integrated. Additionally, deploying these models into web applications can provide real-time weather predictions, making the insights actionable for end-users.

S24L01 -Decision Tree and Random forest