안전한 텍스트 분류기 구축: 파이썬을 활용한 파이프라인 및 LinearSVC

텍스트 분류 소개

텍스트 분류는 사전 정의된 범주를 텍스트 데이터에 할당하는 NLP의 중요한 작업입니다. 적용 분야는 감성 분석과 주제 라벨링부터 콘텐츠 필터링 등에 이르기까지 다양합니다. 텍스트 분류기를 구축하는 주요 단계에는 데이터 수집, 전처리, 특징 추출, 모델 학습 및 평가가 포함됩니다.

이 가이드에서는 TF-IDF 벡터화를 사용하여 텍스트 데이터를 수치적 특징으로 변환하고, 선형 SVC를 활용하여 분류 모델을 간소화된 파이프라인 내에서 구축하는 방법에 중점을 둘 것입니다. 파이프라인을 활용하면 데이터 처리 단계의 연속을 효율적으로 관리할 수 있어 오류의 위험을 줄이고 재현성을 향상시킬 수 있습니다.

데이터셋 개요

이 튜토리얼에서는 긍정(pos) 또는 부정(neg)으로 라벨링된 64,720개의 영화 리뷰가 포함된 Kaggle의 영화 리뷰 데이터셋을 사용할 것입니다. 이 데이터셋은 이진 감정 분석 작업에 이상적입니다.

샘플 데이터 시각화:

fold_id	cv_tag	html_id	sent_id	text	tag
0	cv000	29590	0	films adapted from comic books have had plenty…	pos
1	cv000	29590	1	for starters, it was created by Alan Moore (…)	pos
…	…	…	…	…	…

환경 설정

코드 작업에 들어가기 전에 필요한 라이브러리가 설치되어 있는지 확인하십시오. pip을 사용하여 설치할 수 있습니다:

pip install numpy pandas scikit-learn

1	pip install numpy pandas scikit-learn

또는 Anaconda를 사용하는 경우:

conda install numpy pandas scikit-learn

1	conda install numpy pandas scikit-learn

데이터 전처리

데이터 전처리는 모델링을 위해 데이터셋을 준비하는 중요한 단계입니다. 여기에는 데이터 로드, 결측값 처리, 데이터셋을 학습용과 테스트용으로 분할하는 작업이 포함됩니다.

라이브러리 임포트

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score

데이터셋 로드

# Load the dataset
data = pd.read_csv('movie_review.csv')

# Display the first few rows
data.head()

# Load the dataset

data = pd.read_csv('movie_review.csv')

# Display the first few rows

data.head()

샘플 출력:

   fold_id  cv_tag  html_id  sent_id  \
0        0  cv000    29590        0   
1        0  cv000    29590        1   
2        0  cv000    29590        2   
3        0  cv000    29590        3   
4        0  cv000    29590        4   

                                                text tag  
0  films adapted from comic books have had plenty...  pos  
1  for starters, it was created by Alan Moore (...)  pos  
2  to say Moore and Campbell thoroughly researched...  pos  
3  the book (or "graphic novel,") if you will ...  pos  
4  in other words, don't dismiss this film because...  pos

fold_id cv_tag html_id sent_id \

0 0 cv000 29590 0

1 0 cv000 29590 1

2 0 cv000 29590 2

3 0 cv000 29590 3

4 0 cv000 29590 4

text tag

0 films adapted from comic books have had plenty... pos

1 for starters, it was created by Alan Moore (...) pos

2 to say Moore and Campbell thoroughly researched... pos

3 the book (or "graphic novel,") if you will ... pos

4 in other words, don't dismiss this film because... pos

특징 선택

text 열을 특징(X)으로, tag 열을 타겟 변수(y)로 사용할 것입니다.

X = data['text']
y = data['tag']

1 2	X = data['text'] y = data['tag']

데이터셋 분할

데이터를 학습용과 테스트용으로 분할하면 보지 않은 데이터에 대한 모델의 성능을 평가할 수 있습니다.

from sklearn.model_selection import train_test_split

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

from sklearn.model_selection import train_test_split

# Split the dataset (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

TF-IDF를 이용한 벡터화

머신러닝 모델은 수치적 입력을 필요로 합니다. 벡터화는 텍스트 데이터를 수치적 특징으로 변환합니다. CountVectorizer는 단어의 발생수를 단순히 세는 반면, TF-IDF(단어 빈도-역문서 빈도)는 중요한 단어를 강조하는 가중치를 부여한 표현을 제공합니다.

왜 TF-IDF인가요?

TF-IDF는 용어의 빈도를 고려할 뿐만 아니라 모든 문서에 걸쳐 자주 나타나는 용어는 축소하여 개별 문서 내에서 용어의 중요성을 포착합니다.

머신러닝 파이프라인 구축

Scikit-learn의 Pipeline 클래스는 여러 처리 단계를 하나의 객체로 원활하게 통합할 수 있게 해줍니다. 이를 통해 모든 단계가 순서대로 실행되도록 보장하고 모델 학습과 평가를 간소화할 수 있습니다.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the pipeline
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

# Define the pipeline

text_clf = Pipeline([

('tfidf', TfidfVectorizer()),

('clf', LinearSVC()),

])

파이프라인 구성 요소:

TF-IDF Vectorizer (tfidf): 텍스트 데이터를 TF-IDF 특징 벡터로 변환합니다.
Linear Support Vector Classifier (clf): 분류 작업을 수행합니다.

모델 학습

파이프라인을 정의한 후, 모델을 학습 데이터에 맞춰 학습시킵니다.

# Train the model
text_clf.fit(X_train, y_train)

1 2	# Train the model text_clf.fit(X_train, y_train)

출력:

Pipeline(steps=[
  ('tfidf', TfidfVectorizer()),
  ('clf', LinearSVC())
])

Pipeline(steps=[

('tfidf', TfidfVectorizer()),

('clf', LinearSVC())

])

모델 성능 평가

테스트 세트에서 모델의 정확도를 평가하면 예측 능력에 대한 통찰을 얻을 수 있습니다.

from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = text_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
print(f'Accuracy: {accuracy:.2%}')

from sklearn.metrics import accuracy_score

# Make predictions on the test set

y_pred = text_clf.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_pred, y_test)

print(f'Accuracy: {accuracy:.2%}')

샘플 출력:

Accuracy: 69.83%

1	Accuracy: 69.83%

약 69.83%의 정확도는 모델이 리뷰의 거의 70%를 올바르게 분류했음을 나타내며, 이는 유망한 출발점입니다. 추가 평가를 위해 분류 보고서와 혼동 행렬을 생성하여 모델의 정밀도, 재현율 및 F1 점수를 이해하는 것을 고려하십시오.

예측 수행

모델이 학습되면 새로운 텍스트 데이터를 분류할 수 있습니다. 개별 리뷰의 감성을 예측하는 방법은 다음과 같습니다:

# Example predictions
sample_reviews = [
    'Fantastic movie! I really enjoyed it.',
    'Avoid this movie at any cost, just not good.'
]

predictions = text_clf.predict(sample_reviews)
for review, sentiment in zip(sample_reviews, predictions):
    print(f'Review: "{review}" - Sentiment: {sentiment}')

# Example predictions

sample_reviews = [

'Fantastic movie! I really enjoyed it.',

'Avoid this movie at any cost, just not good.'

]

predictions = text_clf.predict(sample_reviews)

for review, sentiment in zip(sample_reviews, predictions):

print(f'Review: "{review}" - Sentiment: {sentiment}')

샘플 출력:

Review: "Fantastic movie! I really enjoyed it." - Sentiment: pos
Review: "Avoid this movie at any cost, just not good." - Sentiment: neg

1 2	Review: "Fantastic movie! I really enjoyed it." - Sentiment: pos Review: "Avoid this movie at any cost, just not good." - Sentiment: neg

모델은 제공된 예제에서 긍정적 및 부정적 감정을 성공적으로 구분합니다.

결론

텍스트 분류기를 구축하는 것은 데이터 전처리와 특징 추출부터 모델 학습 및 평가에 이르기까지 여러 주요 단계를 포함합니다. scikit-learn의 파이프라인을 활용하면 워크플로우를 간소화하여 각 단계가 일관되고 효율적으로 실행되도록 보장할 수 있습니다. 이 가이드는 TF-IDF 벡터화와 간단한 LinearSVC 모델을 사용하지만, 다양한 벡터화 기법과 분류 알고리즘을 실험하여 성능을 더욱 향상시킬 수 있는 프레임워크를 제공합니다.

추가 자료

Scikit-learn 문서:
- TfidfVectorizer
- Pipeline
튜토리얼:
- 텍스트 데이터 작업
데이터셋:
- Kaggle 영화 리뷰 데이터셋

이 가이드를 따르면 자체 텍스트 분류기를 구축하고 개선할 수 있는 기초 지식을 갖추게 되어 보다 고급 NLP 애플리케이션을 진행할 수 있습니다.

S39L06 – 파이프라인을 통한 텍스트 분류기 구축 계속

안전한 텍스트 분류기 구축: 파이썬을 활용한 파이프라인 및 LinearSVC

목차

텍스트 분류 소개

데이터셋 개요

환경 설정

데이터 전처리

라이브러리 임포트

데이터셋 로드

특징 선택

데이터셋 분할

TF-IDF를 이용한 벡터화

왜 TF-IDF인가요?

머신러닝 파이프라인 구축

모델 학습

모델 성능 평가

예측 수행

결론

추가 자료