S39L05 – 构建文本分类器

html
使用 Scikit-Learn 构建高效文本分类器：全面指南

Meta Description: 通过使用 Scikit-Learn 进行自然语言处理（NLP）深入了解文本分类。学习如何对文本数据进行预处理，利用 CountVectorizer 和 TfidfVectorizer，训练 LinearSVC 模型，以及在构建稳健的文本分类器时克服常见挑战。



在大数据时代，自然语言处理（NLP）在从大量文本中提取有意义的见解方面变得不可或缺。无论是用于情感分析、垃圾邮件检测还是主题分类，文本分类都处于 NLP 应用的前沿。本全面指南结合了来自 Jupyter Notebook 的实用代码片段，将引导您使用 Scikit-Learn 构建高效的文本分类器。我们将探讨数据预处理技术、向量化方法、模型训练以及解决常见问题的策略。

目录

    文本分类简介
    数据集概述
    数据预处理
        
            导入库
            加载数据
        
    
    特征提取
        
            CountVectorizer
            TfidfVectorizer
        
    
    模型训练与评估
        
            训练集与测试集拆分
            训练 LinearSVC 模型
            评估模型性能
        
    
    常见挑战与解决方案
    结论与下一步




文本分类简介

文本分类是 NLP 中的一项基础任务，涉及将预定义类别分配给文本数据。应用范围从情感分析——确定评论是积极还是消极——到更复杂的任务，如主题标注和垃圾邮件检测。通过将文本转换为数值表示，机器学习模型可以有效地学习和预测这些类别。

数据集概述

在本指南中，我们将使用 Kaggle 上提供的 电影评论数据集。该数据集包含 64,720 条带有情感标签的电影评论（pos 表示积极，neg 表示消极），非常适合用于二元情感分类任务。

加载数据

首先，导入必要的库并加载数据集。





		
		
			
			
Java
			
			import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
			
				
					
				
					1
2
3
				
						import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
					
				
			
		







		
		
			
			
Java
			
			# Load the dataset
data = pd.read_csv('movie_review.csv')
data.head()
			
				
					
				
					1
2
3
				
						# Load the dataset
data = pd.read_csv('movie_review.csv')
data.head()
					
				
			
		



Sample Output:


    
        fold_id
        cv_tag
        html_id
        sent_id
        text
        tag
    
    
        0
        cv000
        29590
        0
        films adapted from comic books have ...
        pos
    
    
        1
        cv000
        29590
        1
        for starters, it was created by Alan ...
        pos
    
    
        2
        cv000
        29590
        2
        to say Moore and Campbell thoroughly r...
        pos
    
    
        3
        cv000
        29590
        3
        the book (or "graphic novel," if you wi...
        pos
    
    
        4
        cv000
        29590
        4
        in other words, don't dismiss this film b...
        pos
    


数据预处理

在进行特征提取和模型训练之前，适当的预处理数据是至关重要的。

导入库

确保您已安装所有必要的库。Scikit-Learn 提供了强大的文本预处理和模型构建工具。





		
		
			
			
Java
			
			from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
			
				
					
				
					1
2
3
				
						from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
					
				
			
		



加载数据

我们已经加载了数据集。现在，让我们分离特征和标签。





		
		
			
			
Java
			
			X = data.iloc[:, -2]  # Selecting the 'text' column
y = data.iloc[:, -1]  # Selecting the 'tag' column
			
				
					
				
					1
2
				
						X = data.iloc[:, -2]  # Selecting the 'text' column
y = data.iloc[:, -1]  # Selecting the 'tag' column
					
				
			
		



特征提取

机器学习模型需要数值输入。因此，将文本数据转换为数值特征至关重要。两种常用的方法是 CountVectorizer 和 TfidfVectorizer。

CountVectorizer

CountVectorizer 将文本转换为词频矩阵，捕捉语料库中每个词的频率。





		
		
			
			
Java
			
			from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X_counts.toarray())
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
				
						from sklearn.feature_extraction.text import CountVectorizer
 
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(corpus)
 
print(vectorizer.get_feature_names_out())
print(X_counts.toarray())
					
				
			
		



Output:




		
		
			
			
Java
			
			['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 0 1 1]
 [0 1 1 1 0 0 1 0 1]]
			
				
					
				
					1
2
3
4
5
				
						['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 0 1 1]
 [0 1 1 1 0 0 1 0 1]]
					
				
			
		



TfidfVectorizer

TfidfVectorizer 不仅计数每个词的出现次数，还根据它们在文档中的出现频率进行缩放。这有助于减少常见词的权重，突出更具信息性的词。





		
		
			
			
Java
			
			from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X_train)
print(vectorizer.get_feature_names_out())
print(X_tfidf.toarray())
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn.feature_extraction.text import TfidfVectorizer
 
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X_train)
print(vectorizer.get_feature_names_out())
print(X_tfidf.toarray())
					
				
			
		



Output:




		
		
			
			
Java
			
			[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]]
			
				
					
				
					1
2
3
4
				
						[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]]
					
				
			
		



注意：实际输出将是一个包含许多零的大型稀疏矩阵，表示数据集中术语的频率。

模型训练与评估

有了数值表示后，我们可以继续训练分类器。

训练集与测试集拆分

将数据集拆分为训练集和测试集，有助于评估模型在未见过的数据上的性能。





		
		
			
			
Java
			
			X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
			
				
					
				
					1
				
						X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
					
				
			
		



训练 LinearSVC 模型

LinearSVC 是一种常用的支持向量机（SVM）分类器，适用于文本分类任务。





		
		
			
			
Java
			
			from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(X_tfidf, y_train)
			
				
					
				
					1
2
3
4
				
						from sklearn.svm import LinearSVC
 
model = LinearSVC()
model.fit(X_tfidf, y_train)
					
				
			
		



评估模型性能

评估模型在测试集上的准确性。





		
		
			
			
Java
			
			from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Model Accuracy: {accuracy:.2f}")
			
				
					
				
					1
2
3
4
5
				
						from sklearn.metrics import accuracy_score
 
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Model Accuracy: {accuracy:.2f}")
					
				
			
		



Potential Output:




		
		
			
			
Java
			
			Model Accuracy: 0.85
			
				
					
				
					1
				
						Model Accuracy: 0.85
					
				
			
		



注意：实际准确率可能因数据集和预处理步骤的不同而有所变化。

常见挑战与解决方案

处理稀疏矩阵

文本数据通常会导致高维稀疏矩阵。稀疏矩阵的大部分元素为零，这可能导致内存效率低下。

问题：
在使用 X_test 进行预测时，如果未使用在 X_train 上拟合的相同向量器进行转换，模型可能会报错或产生不可靠的预测。

解决方案：
始终使用相同的向量器实例来转换训练和测试数据。避免在测试数据上拟合向量器。





		
		
			
			
Java
			
			# Correct Transformation
X_test_tfidf = vectorizer.transform(X_test)
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_pred, y_test)
			
				
					
				
					1
2
3
4
				
						# Correct Transformation
X_test_tfidf = vectorizer.transform(X_test)
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_pred, y_test)
					
				
			
		



Avoid:




		
		
			
			
Java
			
			# Incorrect Transformation
X_test_tfidf = vectorizer.fit_transform(X_test)  # This refits the vectorizer on test data
			
				
					
				
					1
2
				
						# Incorrect Transformation
X_test_tfidf = vectorizer.fit_transform(X_test)  # This refits the vectorizer on test data
					
				
			
		



数据形状不一致

确保转换后的测试数据形状与训练数据匹配，对于准确预测至关重要。

问题：
如果测试数据包含训练期间未见过的词，特征矩阵的形状可能会不同。

解决方案：
在测试数据上使用 transform 而不是 fit_transform 以保持一致性。

模型过拟合

模型可能在训练数据上表现异常好，但在未见过的数据上表现不佳。

解决方案：
实施交叉验证、正则化等技术，并确保数据集平衡，以防止过拟合。

使用管道克服挑战

如在文本中强调，手动管理每个预处理和建模步骤可能既繁琐又容易出错。Scikit-Learn 的 Pipeline 类通过将这些步骤链接在一起，提供了一种简化的解决方案，确保一致性并提高代码可读性。

使用管道的好处：

    简化工作流程：将整个工作流程封装在一个对象中。
    一致性：确保在训练和预测过程中应用相同的预处理步骤。
    方便超参数调优：无缝进行网格搜索和交叉验证。


示例：





		
		
			
			
Java
			
			from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svc', LinearSVC())
])

# Training the pipeline
pipeline.fit(X_train, y_train)

# Predicting and evaluating
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Pipeline Model Accuracy: {accuracy:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
				
						from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
 
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svc', LinearSVC())
])
 
# Training the pipeline
pipeline.fit(X_train, y_train)
 
# Predicting and evaluating
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Pipeline Model Accuracy: {accuracy:.2f}")
					
				
			
		



这种方法省去了单独转换测试数据的需要，因为管道确保所有必要的转换都被正确应用。

结论与下一步

构建一个稳健的文本分类器涉及仔细的预处理、特征提取、模型选择和评估。通过利用 Scikit-Learn 强大的工具——如 CountVectorizer、TfidfVectorizer、LinearSVC 和 Pipeline——您可以简化流程，并在 NLP 任务中实现高准确率。

下一步：

    尝试不同的模型：探索其他分类器，如朴素贝叶斯或深度学习模型，以获得更好的性能。
    超参数调优：使用网格搜索或随机搜索优化模型参数以提高准确率。
    高级特征提取：结合 n-gram、词嵌入或使用不同归一化策略的 TF-IDF 等技术。
    处理不平衡数据：实施欠采样、过采样或使用专门的指标来处理类别不平衡的数据集。


踏上文本分类的旅程将为您打开无数应用的大门，从理解客户情感到自动化内容审查。凭借本指南中奠定的基础，您已具备深入探索 NLP 迷人世界的良好装备。



参考文献：

    Scikit-Learn TfidfVectorizer 文档
    Scikit-Learn 文本数据处理教程
    Kaggle 电影评论数据集
fold_id	cv_tag	html_id	sent_id	text	tag
0	cv000	29590	0	films adapted from comic books have ...	pos
1	cv000	29590	1	for starters, it was created by Alan ...	pos
2	cv000	29590	2	to say Moore and Campbell thoroughly r...	pos
3	cv000	29590	3	the book (or "graphic novel," if you wi...	pos
4	cv000	29590	4	in other words, don't dismiss this film b...	pos