Python分析文本学习笔记

本文将以Python为工具，介绍如何使用Python对文本进行分析和学习，并给出相关代码示例。

一、文本的读取和处理

1、文本文件的读取和写入


# 读取文本文件
with open('text.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# 写入文本文件
with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(text)

2、文本的清洗和预处理


import re

# 清除标点符号和特殊字符
clean_text = re.sub(r'[^ws]', '', text)

# 字符串分割成列表
word_list = clean_text.split()

# 去除停用词
stop_words = ['的', '了', '是', '在', '我', '你']
filtered_words = [word for word in word_list if word not in stop_words]

二、文本的统计和可视化

1、文本的词频统计


from collections import Counter

# 计算词频
word_counts = Counter(filtered_words)

# 输出前10个词频最高的词
top_10_words = word_counts.most_common(10)
print(top_10_words)

2、词云的生成


import matplotlib.pyplot as plt
from wordcloud import WordCloud

# 词云的生成
wordcloud = WordCloud().generate(clean_text)

# 可视化词云
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

三、情感分析

1、使用情感词典进行情感分类


positive_words = ['喜欢', '开心', '幸福', '快乐']
negative_words = ['悲伤', '失望', '痛苦', '愤怒']

# 统计文本中的正向和负向情感词的数量
positive_count = sum([word_counts[word] for word in positive_words if word in word_counts])
negative_count = sum([word_counts[word] for word in negative_words if word in word_counts])

# 输出情感分类结果
if positive_count > negative_count:
    print("这是一个积极的文本")
elif positive_count < negative_count:
    print("这是一个消极的文本")
else:
    print("这是一个中性的文本")

2、使用机器学习进行情感分析


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# 构建特征矩阵
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(filtered_words)

# 训练情感分类模型
y = ['positive' if word in positive_words else 'negative' for word in filtered_words]
model = MultinomialNB()
model.fit(X, y)

# 预测文本情感
new_text = "我很开心"
X_new = vectorizer.transform(new_text)
predicted = model.predict(X_new)

# 输出情感分类结果
print(predicted)

本文介绍了使用Python进行文本分析和学习的基本步骤和常用技术，并给出了相应的代码示例。有了这些工具和方法，我们可以更方便地对文本进行处理、统计和分类，从而深入挖掘文本背后隐藏的信息和知识。