S20L05 – 파이썬을 이용한 다중 클래스 분류를 위한 로지스틱 회귀

html
파이썬에서 다중 클래스 분류를 위한 로지스틱 회귀 구현: 종합 가이드

기계 학습의 끊임없이 진화하는 분야에서 다중 클래스 분류는 데이터셋 내 여러 범주를 구분할 수 있게 하는 중추적인 작업으로 자리 잡고 있습니다. 수많은 알고리즘 중에서 로지스틱 회귀는 이러한 문제를 해결하기 위한 강력하고 해석 가능한 선택지로 떠오릅니다. 이 가이드에서는 Scikit-learn과 Kaggle에서 제공하는 방글라어 음악 데이터셋을 활용하여 파이썬으로 다중 클래스 분류를 위한 로지스틱 회귀를 구현하는 방법을 깊이 있게 살펴봅니다.

목차


    다중 클래스 분류 소개
    데이터셋 이해하기
    데이터 전처리
        
            결측 데이터 처리
            범주형 변수 인코딩
        
    
    특징 선택
    모델 훈련 및 평가
        
            K-최근접 이웃 (KNN) 분류기
            로지스틱 회귀 모델
        
    
    비교 분석
    결론
    전체 파이썬 구현


다중 클래스 분류 소개

다중 클래스 분류는 각 인스턴스를 세 개 이상의 클래스 중 하나로 분류하는 분류 작업의 한 유형입니다. 두 개의 클래스를 다루는 이진 분류와 달리, 다중 클래스 분류는 고유한 도전 과제를 제시하며, 여러 범주를 효과적으로 구분할 수 있는 알고리즘이 필요합니다.

로지스틱 회귀는 전통적으로 이진 분류로 알려져 있지만, One-vs-Rest (OvR) 또는 다항 접근법과 같은 전략을 사용하여 다중 클래스 시나리오를 처리할 수 있습니다. 단순성, 해석 가능성, 효율성으로 인해 다양한 분류 작업에서 인기 있는 선택지입니다.

데이터셋 이해하기

이 가이드에서는 방글라어 노래에서 추출된 특징을 포함하는 방글라 음악 데이터셋을 사용합니다. 주요 목표는 이러한 특징을 기반으로 노래를 장르별로 분류하는 것입니다. 이 데이터셋에는 스펙트럴 중심, 스펙트럴 대역폭, 크로마 주파수, 그리고 Mel 주파수 켑스트럴 계수(MFCC)와 같은 다양한 오디오 특징이 포함되어 있습니다.

데이터셋 출처: Kaggle - 방글라 음악 데이터셋

샘플 데이터 개요





		
		
			
			
Java
			
			import pandas as pd

# Load the dataset
data = pd.read_csv('bangla.csv')
print(data.tail())
			
				
					
				
					1
2
3
4
5
				
						import pandas as pd
 
# Load the dataset
data = pd.read_csv('bangla.csv')
print(data.tail())
					
				
			
		







		
		
			
			
Java
			
			                                               file_name  zero_crossing  \
1737  Tumi Robe Nirobe, Artist - DWIJEN  MUKHOPADHYA...          78516   
1738  TUMI SANDHYAR MEGHMALA  Srikanta Acharya  Rabi...         176887   
1739  Utal Haowa Laglo Amar Gaaner Taranite  Sagar S...         133326   
1740  venge mor ghorer chabi by anima roy.. album ro...         179932   
1741   vora thak vora thak by anima roy ( 160kbps ).mp3         175244   

          spectral_centroid  spectral_rolloff  spectral_bandwidth  \
1737         800.797115       1436.990088         1090.389766   
1738        1734.844686       3464.133429         1954.831684   
1739        1380.139172       2745.410904         1775.717428   
1740        1961.435018       4141.554401         2324.507425   
1741        1878.657768       3877.461439         2228.147952   

          chroma_frequency      rmse         delta  melspectogram       tempo  \
1737          0.227325  0.108344  2.078194e-08       3.020211  117.453835   
1738          0.271189  0.124934  5.785562e-08       4.098559  129.199219   
1739          0.263462  0.111411  4.204189e-08       3.147722  143.554688   
1740          0.261823  0.168673  3.245319e-07       7.674615  143.554688   
1741          0.232985  0.311113  1.531590e-07      26.447679  129.199219   

          ...    mfcc11     mfcc12     mfcc13    mfcc14    mfcc15    mfcc16  \
1737  ... -2.615630   2.119485 -12.506942 -1.148996  0.090582 -8.694072   
1738  ...  1.693247  -4.076407  -2.017894 -7.419591 -0.488603 -8.690254   
1739  ...  2.487961  -3.434017  -6.099467 -6.008315 -7.483330 -2.908477   
1740  ...  1.192605 -13.142963   0.281834 -5.981567 -1.066383  0.677886   
1741  ... -5.636770 -12.078487   1.692546 -6.005674  1.502304 -0.415201   

          mfcc17    mfcc18    mfcc19     label  
1737 -6.597594  2.925687 -6.154576  rabindra  
1738 -7.090489 -6.530357 -5.593533  rabindra  
1739  0.783345 -3.394053 -3.157621  rabindra  
1740  0.803132 -3.304548  4.309490  rabindra  
1741  2.389623 -3.135799  0.225479  rabindra  

[5 rows x 31 columns]
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
				
						                                               file_name  zero_crossing  \
1737  Tumi Robe Nirobe, Artist - DWIJEN  MUKHOPADHYA...          78516   
1738  TUMI SANDHYAR MEGHMALA  Srikanta Acharya  Rabi...         176887   
1739  Utal Haowa Laglo Amar Gaaner Taranite  Sagar S...         133326   
1740  venge mor ghorer chabi by anima roy.. album ro...         179932   
1741   vora thak vora thak by anima roy ( 160kbps ).mp3         175244   
 
          spectral_centroid  spectral_rolloff  spectral_bandwidth  \
1737         800.797115       1436.990088         1090.389766   
1738        1734.844686       3464.133429         1954.831684   
1739        1380.139172       2745.410904         1775.717428   
1740        1961.435018       4141.554401         2324.507425   
1741        1878.657768       3877.461439         2228.147952   
 
          chroma_frequency      rmse         delta  melspectogram       tempo  \
1737          0.227325  0.108344  2.078194e-08       3.020211  117.453835   
1738          0.271189  0.124934  5.785562e-08       4.098559  129.199219   
1739          0.263462  0.111411  4.204189e-08       3.147722  143.554688   
1740          0.261823  0.168673  3.245319e-07       7.674615  143.554688   
1741          0.232985  0.311113  1.531590e-07      26.447679  129.199219   
 
          ...    mfcc11     mfcc12     mfcc13    mfcc14    mfcc15    mfcc16  \
1737  ... -2.615630   2.119485 -12.506942 -1.148996  0.090582 -8.694072   
1738  ...  1.693247  -4.076407  -2.017894 -7.419591 -0.488603 -8.690254   
1739  ...  2.487961  -3.434017  -6.099467 -6.008315 -7.483330 -2.908477   
1740  ...  1.192605 -13.142963   0.281834 -5.981567 -1.066383  0.677886   
1741  ... -5.636770 -12.078487   1.692546 -6.005674  1.502304 -0.415201   
 
          mfcc17    mfcc18    mfcc19     label  
1737 -6.597594  2.925687 -6.154576  rabindra  
1738 -7.090489 -6.530357 -5.593533  rabindra  
1739  0.783345 -3.394053 -3.157621  rabindra  
1740  0.803132 -3.304548  4.309490  rabindra  
1741  2.389623 -3.135799  0.225479  rabindra  
 
[5 rows x 31 columns]
					
				
			
		



데이터 전처리

효과적인 데이터 전처리는 신뢰할 수 있는 머신 러닝 모델을 구축하는 데 필수적입니다. 이 섹션에서는 모델링을 위해 데이터를 준비하는 단계들을 설명합니다.

결측 데이터 처리

결측 데이터는 머신 러닝 모델의 성능에 부정적인 영향을 줄 수 있습니다. 결측 값을 식별하고 적절하게 처리하는 것이 중요합니다.

수치 데이터

수치형 특징의 경우, 결측 값은 평균 전략을 사용하여 대체됩니다.





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize SimpleImputer for mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initialize SimpleImputer for mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



범주형 데이터

범주형 특징의 경우, 결측 값은 가장 빈번한 전략을 사용하여 대체됩니다.





		
		
			
			
Java
			
			# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize SimpleImputer for most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])
 
# Initialize SimpleImputer for most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fit and transform the data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		



범주형 변수 인코딩

머신 러닝 알고리즘은 수치형 입력을 필요로 합니다. 따라서 범주형 변수는 적절하게 인코딩되어야 합니다.

원-핫 인코딩

고유한 범주가 많은 범주형 특징의 경우, 원-핫 인코딩을 사용하여 순서 관계의 도입을 방지합니다.





		
		
			
			
Java
			
			from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
 
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)
					
				
			
		



레이블 인코딩

이진 범주형 특징 또는 범주 수가 관리 가능한 특징의 경우, 레이블 인코딩을 사용합니다.





		
		
			
			
Java
			
			from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn import preprocessing
 
def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)
					
				
			
		



X에 대한 인코딩 선택

각 특징의 고유한 범주 수에 따라 다양한 인코딩 전략이 적용됩니다.





		
		
			
			
Java
			
			def EncodingSelection(X, threshold=10):
    # Step 01: Select the string columns
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    # Step 02: Label encode columns with 2 or more than 'threshold' categories
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    # Step 03: One-hot encode the remaining columns
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding selection
X = EncodingSelection(X)
print(f"Encoded feature shape: {X.shape}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
				
						def EncodingSelection(X, threshold=10):
    # Step 01: Select the string columns
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    # Step 02: Label encode columns with 2 or more than 'threshold' categories
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    # Step 03: One-hot encode the remaining columns
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
# Apply encoding selection
X = EncodingSelection(X)
print(f"Encoded feature shape: {X.shape}")
					
				
			
		



Output:




		
		
			
			
Java
			
			Encoded feature shape: (1742, 30)
			
				
					
				
					1
				
						Encoded feature shape: (1742, 30)
					
				
			
		



특징 선택

가장 관련성 높은 특징을 선택하면 모델의 성능이 향상되고 계산 복잡성이 줄어듭니다.





		
		
			
			
Java
			
			from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initialize Min-Max Scaler
MMS = preprocessing.MinMaxScaler()

# Define number of best features to select
K_features = 12

# Scale the features
x_temp = MMS.fit_transform(X)

# Apply SelectKBest with chi-squared scoring
kbest = SelectKBest(score_func=chi2, k=10)
x_temp = kbest.fit(x_temp, y)

# Identify top features
best_features = np.argsort(x_temp.scores_)[-K_features:]

# Determine features to delete
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]

# Reduce X to selected features
X = np.delete(X, features_to_delete, axis=1)
print(f"Reduced feature shape: {X.shape}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
				
						from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing
 
# Initialize Min-Max Scaler
MMS = preprocessing.MinMaxScaler()
 
# Define number of best features to select
K_features = 12
 
# Scale the features
x_temp = MMS.fit_transform(X)
 
# Apply SelectKBest with chi-squared scoring
kbest = SelectKBest(score_func=chi2, k=10)
x_temp = kbest.fit(x_temp, y)
 
# Identify top features
best_features = np.argsort(x_temp.scores_)[-K_features:]
 
# Determine features to delete
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
 
# Reduce X to selected features
X = np.delete(X, features_to_delete, axis=1)
print(f"Reduced feature shape: {X.shape}")
					
				
			
		



Output:




		
		
			
			
Java
			
			Reduced feature shape: (1742, 12)
			
				
					
				
					1
				
						Reduced feature shape: (1742, 12)
					
				
			
		



모델 훈련 및 평가

데이터 전처리 및 특징 선택이 완료되면, 모델을 훈련시키고 평가합니다.

K-최근접 이웃 (KNN) 분류기

KNN은 단순한 인스턴스 기반 학습 알고리즘으로, 분류 작업의 기준선으로 사용할 수 있습니다.





		
		
			
			
Java
			
			from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN with 8 neighbors
knnClassifier = KNeighborsClassifier(n_neighbors=8)

# Train the model
knnClassifier.fit(X_train, y_train)

# Make predictions
y_pred_knn = knnClassifier.predict(X_test)

# Evaluate accuracy
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f"KNN Accuracy: {knn_accuracy:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
 
# Initialize KNN with 8 neighbors
knnClassifier = KNeighborsClassifier(n_neighbors=8)
 
# Train the model
knnClassifier.fit(X_train, y_train)
 
# Make predictions
y_pred_knn = knnClassifier.predict(X_test)
 
# Evaluate accuracy
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f"KNN Accuracy: {knn_accuracy:.2f}")
					
				
			
		



Output:




		
		
			
			
Java
			
			KNN Accuracy: 0.68
			
				
					
				
					1
				
						KNN Accuracy: 0.68
					
				
			
		



로지스틱 회귀 모델

로지스틱 회귀는 다항 접근법을 사용하여 다중 클래스 분류를 처리하도록 확장됩니다.





		
		
			
			
Java
			
			from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression with increased iterations
LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs')

# Train the model
LRM.fit(X_train, y_train)

# Make predictions
y_pred_lr = LRM.predict(X_test)

# Evaluate accuracy
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
				
						from sklearn.linear_model import LogisticRegression
 
# Initialize Logistic Regression with increased iterations
LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs')
 
# Train the model
LRM.fit(X_train, y_train)
 
# Make predictions
y_pred_lr = LRM.predict(X_test)
 
# Evaluate accuracy
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
					
				
			
		



Output:




		
		
			
			
Java
			
			Logistic Regression Accuracy: 0.65
			
				
					
				
					1
				
						Logistic Regression Accuracy: 0.65
					
				
			
		



비교 분석

두 모델을 평가한 결과, 이 특정 시나리오에서는 K-최근접 이웃 분류기가 로지스틱 회귀보다 우수한 성능을 보였습니다.


    KNN Accuracy: 67.9%
    로지스틱 회귀 Accuracy: 65.0%


하지만 다음과 같은 관찰 사항을 주목할 필요가 있습니다:


    반복 제한 경고: 초기에는 로지스틱 회귀가 수렴 문제를 겪었으나, max_iter 매개변수를 300에서 1000으로 늘려 해결했습니다.
    모델 성능: KNN은 더 높은 정확도를 보였지만, 로지스틱 회귀는 해석 가능성이 높고 더 큰 데이터셋에서 더 확장 가능할 수 있습니다.


향후 개선 사항:

    하이퍼파라미터 튜닝: 로지스틱 회귀의 C, penalty 등의 파라미터를 조정하면 성능이 향상될 수 있습니다.
    교차 검증: 교차 검증 기법을 사용하면 모델 성능을 보다 견고하게 평가할 수 있습니다.
    특징 공학: 정보성이 더 높은 특징을 생성하거나 선택하면 분류 정확도를 향상시킬 수 있습니다.


결론

이 종합 가이드는 데이터 전처리부터 모델 평가에 이르기까지 파이썬에서 다중 클래스 분류를 위한 로지스틱 회귀 구현을 시연합니다. 이 경우 KNN이 더 나은 정확도를 보였지만, 로지스틱 회귀는 해석 가능성이 중요한 경우 여전히 강력한 도구로 남아 있습니다. 구조화된 전처리, 특징 선택 및 신중한 모델 훈련을 통해 다양한 도메인에서 다중 클래스 분류 문제를 효과적으로 해결할 수 있습니다.

전체 파이썬 구현

아래는 논의된 모든 단계를 포괄하는 전체 파이썬 코드입니다:





		
		
			
			
Java
			
			# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('bangla.csv')

# Separate features and target
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Handling missing data - Numeric type
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

# Handling missing string data
string_cols = list(np.where((X.dtypes == object))[0])
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

# Encoding methods
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

def LabelEncoderMethod(series):
    le = LabelEncoder()
    le.fit(series)
    return le.transform(series)

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding selection
X = EncodingSelection(X)
print(f"Encoded feature shape: {X.shape}")

# Feature selection
MMS = MinMaxScaler()
K_features = 12
x_temp = MMS.fit_transform(X)
kbest = SelectKBest(score_func=chi2, k=10)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(f"Reduced feature shape: {X.shape}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)
print(f"Training set shape: {X_train.shape}")

# Feature scaling
sc = StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
print(f"Scaled Training set shape: {X_train.shape}")
print(f"Scaled Test set shape: {X_test.shape}")

# Building KNN model
knnClassifier = KNeighborsClassifier(n_neighbors=8)
knnClassifier.fit(X_train, y_train)
y_pred_knn = knnClassifier.predict(X_test)
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f"KNN Accuracy: {knn_accuracy:.2f}")

# Building Logistic Regression model
LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs')
LRM.fit(X_train, y_train)
y_pred_lr = LRM.predict(X_test)
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
				
						# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# Load the dataset
data = pd.read_csv('bangla.csv')
 
# Separate features and target
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
 
# Handling missing data - Numeric type
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
 
# Handling missing string data
string_cols = list(np.where((X.dtypes == object))[0])
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
 
# Encoding methods
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)
 
def LabelEncoderMethod(series):
    le = LabelEncoder()
    le.fit(series)
    return le.transform(series)
 
def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
# Apply encoding selection
X = EncodingSelection(X)
print(f"Encoded feature shape: {X.shape}")
 
# Feature selection
MMS = MinMaxScaler()
K_features = 12
x_temp = MMS.fit_transform(X)
kbest = SelectKBest(score_func=chi2, k=10)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(f"Reduced feature shape: {X.shape}")
 
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)
print(f"Training set shape: {X_train.shape}")
 
# Feature scaling
sc = StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
print(f"Scaled Training set shape: {X_train.shape}")
print(f"Scaled Test set shape: {X_test.shape}")
 
# Building KNN model
knnClassifier = KNeighborsClassifier(n_neighbors=8)
knnClassifier.fit(X_train, y_train)
y_pred_knn = knnClassifier.predict(X_test)
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f"KNN Accuracy: {knn_accuracy:.2f}")
 
# Building Logistic Regression model
LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs')
LRM.fit(X_train, y_train)
y_pred_lr = LRM.predict(X_test)
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
					
				
			
		



참고: 코드를 실행하기 전에 데이터셋 bangla.csv가 작업 디렉토리에 올바르게 배치되어 있는지 확인하십시오.

키워드


    로지스틱 회귀
    다중 클래스 분류
    파이썬 튜토리얼
    머신 러닝
    데이터 전처리
    특징 선택
    K-최근접 이웃 (KNN)
    Scikit-learn
    데이터 과학
    파이썬 머신 러닝