S18L08 – 짧은 토론

html
머신 러닝의 분류 문제를 위한 데이터 전처리 종합 가이드

목차

    분류 문제 소개
    데이터 가져오기 및 개요
    결측 데이터 처리
        
            A. 수치 데이터
            B. 범주형 데이터
        
    
    범주형 변수 인코딩
        
            A. 레이블 인코딩
            B. 원-핫 인코딩
        
    
    특징 선택
    학습-테스트 분할
    특징 스케일링
    결론




분류 문제 소개

분류는 범주형 레이블을 예측하는 데 사용되는 지도 학습 기법입니다. 이는 입력 데이터를 과거 데이터를 기반으로 사전 정의된 범주로 할당하는 것과 관련됩니다. 분류 모델은 로지스틱 회귀와 같은 단순한 알고리즘부터 랜덤 포레스트와 신경망과 같은 더 복잡한 모델에 이르기까지 다양합니다. 이러한 모델의 성공은 단순히 선택된 알고리즘에만 의존하는 것이 아니라 데이터가 준비되고 전처리되는 방식에 크게 좌우됩니다.

데이터 가져오기 및 개요

전처리에 뛰어들기 전에 데이터 세트를 이해하고 가져오는 것이 중요합니다. 이 가이드에서는 오스트레일리아 전역의 일일 기상 관측 데이터를 포함하는 WeatherAUS 데이터셋을 Kaggle에서 사용할 것입니다.





		
		
			
			
Java
			
			# Importing necessary libraries
import pandas as pd 
import seaborn as sns

# Loading the dataset
data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset
data.tail()
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Importing necessary libraries
import pandas as pd 
import seaborn as sns
 
# Loading the dataset
data = pd.read_csv('weatherAUS.csv')
 
# Displaying the last five rows of the dataset
data.tail()
					
				
			
		



출력:





		
		
			
			
Java
			
			           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No

[5 rows x 24 columns]
			
				
					
				
					1
2
3
4
5
6
7
8
				
						           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No
 
[5 rows x 24 columns]
					
				
			
		



데이터 세트는 온도, 강수량, 습도, 풍속 등과 같은 다양한 특징을 포함하고 있으며, 이는 내일 비가 올지 여부(RainTomorrow)를 예측하는 데 중요합니다.

결측 데이터 처리

실제 데이터 세트는 종종 결측 또는 불완전한 데이터를 포함하고 있습니다. 이러한 빈틈을 처리하는 것은 모델의 신뢰성을 보장하는 데 중요합니다. 우리는 결측 데이터를 수치형과 범주형의 두 가지 범주로 접근할 것입니다.

A. 수치 데이터

수치형 특징의 경우, 결측값을 평균, 중앙값 또는 최빈값과 같은 통계적 측정값으로 대체하는 것이 일반적인 전략입니다. 여기서는 평균을 사용하여 결측값을 대체할 것입니다.





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])
 
# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



B. 범주형 데이터

범주형 특징의 경우, 가장 빈번한 값(최빈값)이 결측 데이터를 대체하기에 적합합니다.





		
		
			
			
Java
			
			# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])
 
# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])
					
				
			
		



범주형 변수 인코딩

머신 러닝 모델은 수치형 입력을 필요로 합니다. 따라서 범주형 변수를 수치형 형식으로 변환하는 것이 필수적입니다. 이를 위해 레이블 인코딩과 원-핫 인코딩을 사용할 수 있습니다.

A. 레이블 인코딩

레이블 인코딩은 각 고유한 범주에 고유한 정수를 할당합니다. 간단하지만 존재하지 않는 서수 관계를 도입할 수 있습니다.





		
		
			
			
Java
			
			from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 

# Encoding the target variable
y = LabelEncoderMethod(y)
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn import preprocessing
 
def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 
 
# Encoding the target variable
y = LabelEncoderMethod(y)
					
				
			
		



B. 원-핫 인코딩

원-핫 인코딩은 각 범주에 대한 이진 열을 생성하여 서수 관계를 제거하고 각 범주가 독립적으로 처리되도록 합니다.





		
		
			
			
Java
			
			from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
 
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)
					
				
			
		



특징 인코딩 선택

고유한 범주의 수에 따라 레이블 인코딩과 원-핫 인코딩 중에서 선택하는 것이 효율적입니다.





		
		
			
			
Java
			
			def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Applying encoding selection
X = EncodingSelection(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
				
						def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
# Applying encoding selection
X = EncodingSelection(X)
					
				
			
		



출력:





		
		
			
			
Java
			
			(142193, 23)
			
				
					
				
					1
				
						(142193, 23)
					
				
			
		



이 단계는 가장 관련성 높은 인코딩된 특징만을 선택하여 특징 공간을 줄입니다.

특징 선택

모든 특징이 예측 작업에 동일하게 기여하지는 않습니다. 특징 선택은 가장 유용한 특징을 식별하고 유지하여 모델 성능을 향상시키고 계산 부담을 줄이는 데 도움이 됩니다.





		
		
			
			
Java
			
			from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)

# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape
print(X.shape)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
				
						from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing
 
# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)
 
# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)
 
# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)
 
# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]
 
# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)
 
# Verifying the new shape
print(X.shape)
					
				
			
		



출력:





		
		
			
			
Java
			
			(142193, 13)
			
				
					
				
					1
				
						(142193, 13)
					
				
			
		



이 과정은 분류 작업에 가장 영향력 있는 13개의 특징으로 특징 세트를 23개에서 줄입니다.

학습-테스트 분할

분류 모델의 성능을 평가하기 위해 데이터 세트를 학습 및 테스트 하위 집합으로 분할해야 합니다.





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data
print(X_train.shape)
			
				
					
				
					1
2
3
4
5
6
7
				
						from sklearn.model_selection import train_test_split
 
# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
 
# Displaying the shape of training data
print(X_train.shape)
					
				
			
		



출력:





		
		
			
			
Java
			
			(113754, 13)
			
				
					
				
					1
				
						(113754, 13)
					
				
			
		



특징 스케일링

특징 스케일링은 모든 특징이 결과에 동일하게 기여하도록 보장합니다. 이는 서포트 벡터 머신이나 K-최근접 이웃과 같은 특징의 크기에 민감한 알고리즘에 특히 중요합니다.

표준화

표준화는 데이터의 평균을 0으로, 표준 편차를 1로 재조정합니다.





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data
sc.fit(X_train)

# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						from sklearn import preprocessing
 
# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
 
# Fitting the scaler on training data
sc.fit(X_train)
 
# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
 
# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)
					
				
			
		



출력:





		
		
			
			
Java
			
			(113754, 13)
(28439, 13)
			
				
					
				
					1
2
				
						(113754, 13)
(28439, 13)
					
				
			
		



참고: with_mean=False 매개변수는 원-핫 인코딩으로 인해 발생하는 희소 데이터 행렬의 문제를 피하기 위해 사용됩니다.

결론

데이터 전처리는 견고하고 정확한 분류 모델을 구축하는 데 있어 중요한 단계입니다. 결측 데이터를 체계적으로 처리하고, 범주형 변수를 인코딩하며, 관련성 높은 특징을 선택하고 스케일링함으로써 모든 머신 러닝 모델의 강력한 기초를 마련할 수 있습니다. 이 가이드는 Python과 그 강력한 라이브러리를 사용한 실습 접근 방식을 제공하여, 분류 문제를 모델 학습 및 평가에 적합하게 준비할 수 있도록 보장합니다. "쓰레기가 들어오면 쓰레기가 나온다"는 격언은 머신 러닝에서도 그대로 적용되므로, 데이터 전처리에 시간 투자는 모델 성능에 큰 보상을 가져옵니다.



키워드: 분류 문제, 데이터 전처리, 머신 러닝, 데이터 정제, 특징 선택, 레이블 인코딩, 원-핫 인코딩, 특징 스케일링, Python, Pandas, Scikit-learn, 분류 모델