S39L06 – 构建文本分类器：继续使用管道

html
使用Python构建强大的文本分类器：利用管道和LinearSVC

目录

    文本分类介绍
    数据集概述
    环境设置
    数据预处理
    使用TF-IDF进行向量化
    构建机器学习管道
    模型训练
    评估模型性能
    进行预测
    结论
    附加资源


文本分类介绍

文本分类是自然语言处理（NLP）中的一项关键任务，涉及将预定义类别分配给文本数据。其应用范围从情感分析和主题标注到内容过滤等。构建文本分类器的关键步骤包括数据收集、预处理、特征提取、模型训练和评估。

在本指南中，我们将重点介绍如何使用TF-IDF向量化将文本数据转换为数值特征，并在简化的管道中使用LinearSVC构建分类模型。利用管道可以高效地管理数据处理步骤的顺序，减少错误风险并提高可重复性。

数据集概述

在本教程中，我们将使用来自Kaggle的电影评论数据集，该数据集包含64,720条电影评论，标注为正面（pos）或负面（neg）。该数据集非常适合用于二元情感分析任务。

样本数据可视化：


    
        fold_id
        cv_tag
        html_id
        sent_id
        text
        tag
    
    
        0
        cv000
        29590
        0
        films adapted from comic books have had plenty...
        pos
    
    
        1
        cv000
        29590
        1
        for starters, it was created by Alan Moore (...)
        pos
    
    
        ...
        ...
        ...
        ...
        ...
        ...
    


环境设置

在开始编写代码之前，请确保已安装必要的库。您可以使用pip进行安装：





		
		
			
			
Java
			
			pip install numpy pandas scikit-learn
			
				
					
				
					1
				
						pip install numpy pandas scikit-learn
					
				
			
		



或者，如果您使用的是Anaconda：





		
		
			
			
Java
			
			conda install numpy pandas scikit-learn
			
				
					
				
					1
				
						conda install numpy pandas scikit-learn
					
				
			
		



数据预处理

数据预处理是准备数据集进行建模的关键步骤。它包括加载数据、处理缺失值以及将数据集拆分为训练集和测试集。

导入库





		
		
			
			
Java
			
			import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
			
				
					
				
					1
2
3
4
5
				
						import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
					
				
			
		



加载数据集





		
		
			
			
Java
			
			# Load the dataset
data = pd.read_csv('movie_review.csv')

# Display the first few rows
data.head()
			
				
					
				
					1
2
3
4
5
				
						# Load the dataset
data = pd.read_csv('movie_review.csv')
 
# Display the first few rows
data.head()
					
				
			
		



样本输出：





		
		
			
			
Java
			
			   fold_id  cv_tag  html_id  sent_id  \
0        0  cv000    29590        0   
1        0  cv000    29590        1   
2        0  cv000    29590        2   
3        0  cv000    29590        3   
4        0  cv000    29590        4   

                                                text tag  
0  films adapted from comic books have had plenty...  pos  
1  for starters, it was created by Alan Moore (...)  pos  
2  to say Moore and Campbell thoroughly researched...  pos  
3  the book (or "graphic novel,") if you will ...  pos  
4  in other words, don't dismiss this film because...  pos  
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
				
						   fold_id  cv_tag  html_id  sent_id  \
0        0  cv000    29590        0   
1        0  cv000    29590        1   
2        0  cv000    29590        2   
3        0  cv000    29590        3   
4        0  cv000    29590        4   
 
                                                text tag  
0  films adapted from comic books have had plenty...  pos  
1  for starters, it was created by Alan Moore (...)  pos  
2  to say Moore and Campbell thoroughly researched...  pos  
3  the book (or "graphic novel,") if you will ...  pos  
4  in other words, don't dismiss this film because...  pos  
					
				
			
		



特征选择

我们将使用text列作为特征(X)，tag列作为目标变量(y)。





		
		
			
			
Java
			
			X = data['text']
y = data['tag']
			
				
					
				
					1
2
				
						X = data['text']
y = data['tag']
					
				
			
		



数据集拆分

将数据拆分为训练集和测试集，可以让我们评估模型在未见数据上的性能。





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
			
				
					
				
					1
2
3
4
				
						from sklearn.model_selection import train_test_split
 
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
					
				
			
		



使用TF-IDF进行向量化

机器学习模型需要数值输入。向量化将文本数据转换为数值特征。虽然CountVectorizer仅仅计算词语出现的次数，但TF-IDF（词频-逆文档频率）提供了一种加权表示，强调重要词语。

为什么选择TF-IDF？

TF-IDF不仅考虑了词语的频率，还对在所有文档中频繁出现的词语进行了下调权重，从而捕捉了词语在单个文档中的重要性。

构建机器学习管道

Scikit-learn的Pipeline类允许将多个处理步骤无缝集成到一个对象中。这确保了所有步骤按顺序执行，并简化了模型的训练和评估。





		
		
			
			
Java
			
			from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the pipeline
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
 
# Define the pipeline
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])
					
				
			
		



管道组件：


    TF-IDF Vectorizer (tfidf): 将文本数据转换为TF-IDF特征向量。
    线性支持向量分类器 (clf): 执行分类任务。


模型训练

定义管道后，训练模型涉及将其拟合到训练数据。





		
		
			
			
Java
			
			# Train the model
text_clf.fit(X_train, y_train)
			
				
					
				
					1
2
				
						# Train the model
text_clf.fit(X_train, y_train)
					
				
			
		



输出：





		
		
			
			
Java
			
			Pipeline(steps=[
  ('tfidf', TfidfVectorizer()),
  ('clf', LinearSVC())
])
			
				
					
				
					1
2
3
4
				
						Pipeline(steps=[
  ('tfidf', TfidfVectorizer()),
  ('clf', LinearSVC())
])
					
				
			
		



评估模型性能

评估模型在测试集上的准确性可以让我们了解其预测能力。





		
		
			
			
Java
			
			from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = text_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
print(f'Accuracy: {accuracy:.2%}')
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn.metrics import accuracy_score
 
# Make predictions on the test set
y_pred = text_clf.predict(X_test)
 
# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
print(f'Accuracy: {accuracy:.2%}')
					
				
			
		



样本输出：





		
		
			
			
Java
			
			Accuracy: 69.83%
			
				
					
				
					1
				
						Accuracy: 69.83%
					
				
			
		



大约69.83%的准确率表明模型正确分类了近70%的评论，这是一个有希望的起点。为了进一步评估，可以生成分类报告和混淆矩阵，以了解模型的精确度、召回率和F1得分。

进行预测

模型训练完成后，可以对新的文本数据进行分类。以下是如何预测单个评论的情感：





		
		
			
			
Java
			
			# Example predictions
sample_reviews = [
    'Fantastic movie! I really enjoyed it.',
    'Avoid this movie at any cost, just not good.'
]

predictions = text_clf.predict(sample_reviews)
for review, sentiment in zip(sample_reviews, predictions):
    print(f'Review: "{review}" - Sentiment: {sentiment}')
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Example predictions
sample_reviews = [
    'Fantastic movie! I really enjoyed it.',
    'Avoid this movie at any cost, just not good.'
]
 
predictions = text_clf.predict(sample_reviews)
for review, sentiment in zip(sample_reviews, predictions):
    print(f'Review: "{review}" - Sentiment: {sentiment}')
					
				
			
		



样本输出：





		
		
			
			
Java
			
			Review: "Fantastic movie! I really enjoyed it." - Sentiment: pos
Review: "Avoid this movie at any cost, just not good." - Sentiment: neg
			
				
					
				
					1
2
				
						Review: "Fantastic movie! I really enjoyed it." - Sentiment: pos
Review: "Avoid this movie at any cost, just not good." - Sentiment: neg
					
				
			
		



模型成功区分了提供的示例中的正面和负面情感。

结论

构建文本分类器涉及多个关键步骤，从数据预处理和特征提取到模型训练和评估。利用scikit-learn中的管道简化了工作流程，确保每个步骤都一致且高效地执行。虽然本指南采用了使用TF-IDF向量化的简单LinearSVC模型，但该框架允许您尝试各种向量化技术和分类算法，以进一步提升性能。

附加资源


    Scikit-learn文档：
        
            TfidfVectorizer
            Pipeline
        
    
    教程：
        
            处理文本数据
        
    
    数据集：
        
            Kaggle电影评论数据集
        
    


通过遵循本指南，您现在掌握了构建和优化自己的文本分类器的基础知识，为更先进的NLP应用铺平了道路。
fold_id	cv_tag	html_id	sent_id	text	tag
0	cv000	29590	0	films adapted from comic books have had plenty...	pos
1	cv000	29590	1	for starters, it was created by Alan Moore (...)	pos
...	...	...	...	...	...