मशीन लर्निंग में वर्गीकरण समस्याओं के लिए डेटा पूर्वप्रसंस्करण के बारे में व्यापक गाइड

सामग्री सूची

वर्गीकरण समस्याओं का परिचय
डेटा आयात और अवलोकन
गुम डेटा को संभालना
- क. संख्यात्मक डेटा
- ख. श्रेणिय डेटा
श्रेणिय चर को एन्कोड करना
- क. लेबल एन्कोडिंग
- ख. वन-हॉट एन्कोडिंग
फ़ीचर चयन
ट्रेन-टेस्ट विभाजन
फ़ीचर स्केलिंग
निष्कर्ष

वर्गीकरण समस्याओं का परिचय

वर्गीकरण एक पर्यवेक्षित शिक्षण तकनीक है जिसका उपयोग श्रेणिय लेबल की भविष्यवाणी के लिए किया जाता है। यह ऐतिहासिक डेटा के आधार पर इनपुट डेटा को पूर्वनिर्धारित श्रेणियों में वर्गीकृत करने में शामिल है। वर्गीकरण मॉडल सरल एल्गोरिदम जैसे लॉजिस्टिक रिग्रेशन से लेकर अधिक जटिल एल्गोरिदम जैसे रैंडम फॉरेस्ट और न्यूरल नेटवर्क तक के बीच विविध होते हैं। इन मॉडलों की सफलता न केवल चुने गए एल्गोरिदम पर निर्भर करती है बल्कि काफी हद तक इस बात पर भी निर्भर करती है कि डेटा को कैसे तैयार और पूर्वप्रसंस्कृत किया गया है।

डेटा आयात और अवलोकन

पूर्वप्रसंस्करण में गोता लगाने से पहले, डेटासेट को समझना और आयात करना आवश्यक है। इस गाइड के लिए, हम Kaggle से WeatherAUS डेटासेट का उपयोग करेंगे, जिसमें ऑस्ट्रेलिया भर में दैनिक मौसम अवलोकन शामिल हैं।

# Importing necessary libraries
import pandas as pd 
import seaborn as sns

# Loading the dataset
data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset
data.tail()

# Importing necessary libraries

import pandas as pd

import seaborn as sns

# Loading the dataset

data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset

data.tail()

आउटपुट:

           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No

[5 rows x 24 columns]

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow

142188 2017-06-20 Uluru 3.5 21.8 0.0 NaN NaN E 31.0 ESE ... 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No 0.0 No

142189 2017-06-21 Uluru 2.8 23.4 0.0 NaN NaN E 31.0 SE ... 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No 0.0 No

142190 2017-06-22 Uluru 3.6 25.3 0.0 NaN NaN NNW 22.0 SE ... 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No 0.0 No

142191 2017-06-23 Uluru 5.4 26.9 0.0 NaN NaN N 37.0 SE ... 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No 0.0 No

142192 2017-06-24 Uluru 7.8 27.0 0.0 NaN NaN SE 28.0 SSE ... 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No 0.0 No

[5 rows x 24 columns]

डेटासेट में तापमान, वर्षा, आर्द्रता, वायु गति और अन्य विभिन्न विशेषताएं शामिल हैं, जो यह भविष्यवाणी करने के लिए महत्वपूर्ण हैं कि क्या कल वर्षा होगी (RainTomorrow)।

गुम डेटा को संभालना

वास्तविक दुनिया के डेटासेट अक्सर गुम या अधूरे डेटा के साथ आते हैं। इन अंतरालों को संभालना मॉडल की विश्वसनीयता सुनिश्चित करने के लिए महत्वपूर्ण है। हम गुम डेटा को दो श्रेणियों में विभाजित करेंगे: संख्यात्मक और श्रेणिय।

क. संख्यात्मक डेटा

संख्यात्मक विशेषताओं के लिए, एक सामान्य रणनीति गुम मानों को सांख्यिकीय उपायों जैसे माध्य, माध्यिका, या बहुलक के साथ प्रतिस्थापित करना है। यहां, हम गुम मानों को भरने के लिए माध्य का उपयोग करेंगे।

import numpy as np
from sklearn.impute import SimpleImputer

# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identifying numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns

imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

ख. श्रेणिय डेटा

श्रेणिय विशेषताओं के लिए, सबसे बार-बार होने वाला मान (बहुलक) गुम डेटा के लिए उपयुक्त प्रतिस्थापन है।

# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

# Identifying categorical columns

string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns

imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data

X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

श्रेणिय चर को एन्कोड करना

मशीन लर्निंग मॉडल को संख्यात्मक इनपुट की आवश्यकता होती है। इसलिए, श्रेणिय चर को संख्यात्मक प्रारूपों में बदलना आवश्यक है। हम यह लेबल एन्कोडिंग और वन-हॉट एन्कोडिंग का उपयोग करके प्राप्त कर सकते हैं।

क. लेबल एन्कोडिंग

लेबल एन्कोडिंग एक फीचर में प्रत्येक अद्वितीय श्रेणी को एक अद्वितीय पूर्णांक असाइन करता है। यह सरल है लेकिन इसमें ऐसी क्रमबद्ध संबंधों को पेश कर सकता है जो वास्तव में नहीं होते।

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 

# Encoding the target variable
y = LabelEncoderMethod(y)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

return le.fit_transform(series)

# Encoding the target variable

y = LabelEncoderMethod(y)

ख. वन-हॉट एन्कोडिंग

वन-हॉट एन्कोडिंग प्रत्येक श्रेणी के लिए बाइनरी कॉलम बनाता है, क्रमबद्ध संबंधों को समाप्त करता है और सुनिश्चित करता है कि प्रत्येक श्रेणी को विशिष्ट रूप से माना जाए।

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')

return columnTransformer.fit_transform(data)

फ़ीचर के लिए एन्कोडिंग चयन

अद्वितीय श्रेणियों की संख्या पर निर्भर करते हुए, लेबल एन्कोडिंग और वन-हॉट एन्कोडिंग के बीच चयन करना कुशल होता है।

def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Applying encoding selection
X = EncodingSelection(X)

def EncodingSelection(X, threshold=10):

# Step 1: Select the string columns

string_cols = list(np.where((X.dtypes == np.object))[0])

one_hot_encoding_indices = []

# Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

# Step 3: Apply One-Hot Encoding where necessary

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Applying encoding selection

X = EncodingSelection(X)

आउटपुट:

(142193, 23)

1	(142193, 23)

यह चरण फीचर स्पेस को इस प्रकार कम करता है कि केवल सबसे प्रासंगिक एन्कोडेड फीचर्स का चयन हो।

फ़ीचर चयन

सभी फीचर्स समान रूप से भविष्यवाणी कार्य में योगदान नहीं करते हैं। फ़ीचर चयन सबसे जानकारीपूर्ण फीचर्स की पहचान करने और उन्हें बनाए रखने में मदद करता है, मॉडल के प्रदर्शन को बढ़ाता है और गणनात्मक ओवरहेड को कम करता है।

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)

# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape
print(X.shape)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic

kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection

MMS = preprocessing.MinMaxScaler()

x_temp = MMS.fit_transform(X)

# Fitting SelectKBest

x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores

best_features = np.argsort(x_temp.scores_)[-13:]

features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features

X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape

print(X.shape)

आउटपुट:

(142193, 13)

1	(142193, 13)

यह प्रक्रिया फीचर सेट को 23 से 13 तक कम करती है, हमारे वर्गीकरण कार्य के लिए सबसे प्रभावशाली फीचर्स पर ध्यान केंद्रित करती है।

ट्रेन-टेस्ट विभाजन

हमारे वर्गीकरण मॉडल के प्रदर्शन का मूल्यांकन करने के लिए, हमें डेटासेट को प्रशिक्षण और परीक्षण उपसमूहों में विभाजित करने की आवश्यकता है।

from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data
print(X_train.shape)

from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data

print(X_train.shape)

आउटपुट:

(113754, 13)

1	(113754, 13)

फ़ीचर स्केलिंग

फ़ीचर स्केलिंग सुनिश्चित करती है कि सभी फीचर्स परिणाम में समान रूप से योगदान दें, विशेष रूप से उन एल्गोरिदम के लिए महत्वपूर्ण है जो फीचर मैग्निट्यूड के प्रति संवेदनशील होते हैं जैसे सपोर्ट वेक्टर मशीन या K-नज़दीकी पड़ोसी।

मानकीकरण

मानकीकरण डेटा को इस प्रकार पुनः स्केल करता है कि इसका माध्य शून्य और मानक विचलन एक हो।

from sklearn import preprocessing

# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data
sc.fit(X_train)

# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)

from sklearn import preprocessing

# Initializing the StandardScaler

sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data

sc.fit(X_train)

# Transforming both training and testing data

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

# Verifying the shape after scaling

print(X_train.shape)

print(X_test.shape)

आउटपुट:

(113754, 13)
(28439, 13)

1 2	(113754, 13) (28439, 13)

नोट: पैरामीटर with_mean=False का उपयोग वन-हॉट एन्कोडिंग से उत्पन्न स्पार्स डेटा मैट्रिक्स के साथ समस्याओं से बचने के लिए किया जाता है।

निष्कर्ष

डेटा पूर्वप्रसंस्करण मजबूत और सटीक वर्गीकरण मॉडल के निर्माण में एक महत्वपूर्ण चरण है। गुम डेटा को व्यवस्थित रूप से संभाल कर, श्रेणिय चर को एन्कोड कर, प्रासंगिक फीचर्स का चयन कर और स्केलिंग करके, हम किसी भी मशीन लर्निंग मॉडल के लिए एक मजबूत आधार स्थापित करते हैं। इस गाइड ने Python और इसकी शक्तिशाली लाइब्रेरीज का उपयोग करते हुए एक व्यावहारिक दृष्टिकोण प्रदान किया है, यह सुनिश्चित करते हुए कि आपकी वर्गीकरण समस्याएँ मॉडल प्रशिक्षण और मूल्यांकन के लिए अच्छी तरह से तैयार हैं। याद रखें, कहावत “कचरा अंदर, कचरा बाहर” मशीन लर्निंग में सही रहती है; इसलिए, डेटा पूर्वप्रसंस्करण में समय निवेश करना मॉडल के प्रदर्शन में लाभ देता है।

कीवर्ड्स: वर्गीकरण समस्याएँ, डेटा पूर्वप्रसंस्करण, मशीन लर्निंग, डेटा क्लीनिंग, फीचर चयन, लेबल एन्कोडिंग, वन-हॉट एन्कोडिंग, फीचर स्केलिंग, Python, Pandas, Scikit-learn, वर्गीकरण मॉडल्स