Implementing Decision Trees, Random Forests, XGBoost, and AdaBoost for Weather Prediction in Python
Table of Contents
- Introduction
- Dataset Overview
- Data Preprocessing
- Model Implementation and Evaluation
- Visualizing Decision Regions
- Conclusion
- References
Introduction
Predicting weather conditions is a classic problem in machine learning, offering valuable insights for various industries such as agriculture, aviation, and event planning. In this comprehensive guide, we’ll delve into implementing several machine learning models—including Decision Trees, Random Forests, XGBoost, and AdaBoost—to predict whether it will rain tomorrow using the Weather Australia dataset. We’ll walk through data preprocessing, model training, evaluation, and even deploying these models into real-life web applications.
Dataset Overview
The Weather Australia dataset, sourced from Kaggle, contains 24 features related to weather conditions recorded across various locations in Australia. The primary goal is to predict the RainTomorrow attribute, indicating whether it will rain the next day.
Dataset Features
- Date: Observation date.
- Location: Geographical location of the weather station.
- MinTemp: Minimum temperature in °C.
- MaxTemp: Maximum temperature in °C.
- Rainfall: Amount of rainfall in mm.
- Evaporation: Evaporation in mm.
- Sunshine: Number of hours of sunshine.
- WindGustDir: Direction of the strongest wind gust.
- WindGustSpeed: Speed of the strongest wind gust in km/h.
- WindDir9am: Wind direction at 9 AM.
- WindDir3pm: Wind direction at 3 PM.
- …and more.
Data Preprocessing
Effective data preprocessing is crucial for building accurate and reliable machine learning models. We’ll cover handling missing values, encoding categorical variables, feature selection, and scaling.
Handling Missing Values
Missing data can significantly impact model performance. We’ll address missing values separately for numerical and categorical data.
Numerical Data
For numerical columns, we’ll use Mean Imputation to fill missing values.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import numpy as np import pandas as pd from sklearn.impute import SimpleImputer # Load data data = pd.read_csv('weatherAUS.csv') # Identify numerical columns numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns # Impute missing values with mean imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') data[numerical_cols] = imp_mean.fit_transform(data[numerical_cols]) |
Categorical Data
For categorical columns, we’ll use Most Frequent Imputation.
1 2 3 4 5 6 |
# Identify categorical columns categorical_cols = data.select_dtypes(include=['object']).columns # Impute missing values with the most frequent value imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') data[categorical_cols] = imp_freq.fit_transform(data[categorical_cols]) |
Encoding Categorical Variables
Machine learning algorithms require numerical inputs. We’ll employ both Label Encoding and One-Hot Encoding based on the number of unique categories in each feature.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.compose import ColumnTransformer def encode_features(df, threshold=10): label_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() <= threshold] onehot_enc_cols = [col for col in df.columns if df[col].dtype == 'object' and df[col].nunique() > threshold] # Label Encoding le = LabelEncoder() for col in label_enc_cols: df[col] = le.fit_transform(df[col]) # One-Hot Encoding ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), onehot_enc_cols)], remainder='passthrough') df = ct.fit_transform(df) return df X = data.drop('RainTomorrow', axis=1) y = data['RainTomorrow'] X = encode_features(X) |
Feature Selection
To enhance model performance and reduce computational complexity, we’ll select the top features using the SelectKBest method with the Chi-Squared statistic.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.feature_selection import SelectKBest, chi2 from sklearn.preprocessing import MinMaxScaler # Scale features scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Select top 10 features selector = SelectKBest(score_func=chi2, k=10) X_selected = selector.fit_transform(X_scaled, y) # Further reduce to top 2 features for visualization best_features = selector.get_support(indices=True) X_final = X_selected[:, :2] |
Train-Test Split and Feature Scaling
Splitting the data into training and testing sets ensures that our model’s performance is evaluated on unseen data.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Split the data X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.20, random_state=1) # Feature Scaling scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) |
Model Implementation and Evaluation
We’ll implement various machine learning models and evaluate their performance using Accuracy Score.
K-Nearest Neighbors (KNN)
1 2 3 4 5 6 7 8 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred_knn = knn.predict(X_test) knn_accuracy = accuracy_score(y_pred_knn, y_test) print(f'KNN Accuracy: {knn_accuracy:.2f}') |
KNN Accuracy: 0.80
Logistic Regression
1 2 3 4 5 6 7 |
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(random_state=0, max_iter=200) log_reg.fit(X_train, y_train) y_pred_lr = log_reg.predict(X_test) lr_accuracy = accuracy_score(y_pred_lr, y_test) print(f'Logistic Regression Accuracy: {lr_accuracy:.2f}') |
Logistic Regression Accuracy: 0.83
Gaussian Naive Bayes
1 2 3 4 5 6 7 |
from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred_gnb = gnb.predict(X_test) gnb_accuracy = accuracy_score(y_pred_gnb, y_test) print(f'Gaussian Naive Bayes Accuracy: {gnb_accuracy:.2f}') |
Gaussian Naive Bayes Accuracy: 0.80
Support Vector Machine (SVM)
1 2 3 4 5 6 7 |
from sklearn.svm import SVC svm = SVC() svm.fit(X_train, y_train) y_pred_svm = svm.predict(X_test) svm_accuracy = accuracy_score(y_pred_svm, y_test) print(f'SVM Accuracy: {svm_accuracy:.2f}') |
SVM Accuracy: 0.83
Decision Tree
1 2 3 4 5 6 7 |
from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier() dtc.fit(X_train, y_train) y_pred_dtc = dtc.predict(X_test) dtc_accuracy = accuracy_score(y_pred_dtc, y_test) print(f'Decision Tree Accuracy: {dtc_accuracy:.2f}') |
Decision Tree Accuracy: 0.83
Random Forest
1 2 3 4 5 6 7 |
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=500, max_depth=5) rf.fit(X_train, y_train) y_pred_rf = rf.predict(X_test) rf_accuracy = accuracy_score(y_pred_rf, y_test) print(f'Random Forest Accuracy: {rf_accuracy:.2f}') |
Random Forest Accuracy: 0.83
XGBoost and AdaBoost
While the initial implementation doesn’t cover XGBoost and AdaBoost, these ensemble methods can further enhance model performance. Here’s a brief example of how to implement them:
XGBoost
1 2 3 4 5 6 7 |
from xgboost import XGBClassifier xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss') xgb.fit(X_train, y_train) y_pred_xgb = xgb.predict(X_test) xgb_accuracy = accuracy_score(y_pred_xgb, y_test) print(f'XGBoost Accuracy: {xgb_accuracy:.2f}') |
AdaBoost
1 2 3 4 5 6 7 |
from sklearn.ensemble import AdaBoostClassifier ada = AdaBoostClassifier(n_estimators=100, random_state=0) ada.fit(X_train, y_train) y_pred_ada = ada.predict(X_test) ada_accuracy = accuracy_score(y_pred_ada, y_test) print(f'AdaBoost Accuracy: {ada_accuracy:.2f}') |
Note: Ensure you have the xgboost
library installed using pip install xgboost
.
Visualizing Decision Regions
Visualizing decision boundaries helps in understanding how different models classify the data. Below is an example using the Iris dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from mlxtend.plotting import plot_decision_regions import matplotlib.pyplot as plt from sklearn import datasets # Load Iris dataset iris = datasets.load_iris() X_vis = iris.data[:, :2] y_vis = iris.target # Train KNN knn_vis = KNeighborsClassifier(n_neighbors=3) knn_vis.fit(X_vis, y_vis) # Plot decision regions plot_decision_regions(X_vis, y_vis, clf=knn_vis) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('KNN Decision Regions') plt.legend() plt.show() |
Visualization Output: A plot showcasing the decision boundaries created by the KNN classifier.
Conclusion
In this guide, we’ve explored the implementation of various machine learning models—Decision Trees, Random Forests, Logistic Regression, KNN, Gaussian Naive Bayes, and SVM—for predicting weather conditions using the Weather Australia dataset. Each model showcased competitive accuracy scores, with Logistic Regression, SVM, Decision Trees, and Random Forests achieving approximately 83% accuracy.
For enhanced performance, ensemble methods like XGBoost and AdaBoost can be integrated. Additionally, deploying these models into web applications can provide real-time weather predictions, making the insights actionable for end-users.