TF-IDF算法在Python中的应用

TF-IDF算法是一种常用于文本分析和信息检索的算法。它的主要目标是根据一个文档中的词频（TF）和整个文档集合中的逆文档频率（IDF）来计算每个词的重要性。

一、TF-IDF算法简介

TF是词频的简称，用于衡量一个词在文档中的重要性。在一个文档中，某个词的TF可以通过该词在文档中出现的频率除以文档的总词数得到。

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# 定义停用词列表
stopwords = set(stopwords.words('english'))

# 定义词形还原器
lemmatizer = WordNetLemmatizer()

# 函数：计算一个句子中每个词的TF
def calculate_tf(tokens):
    tf_dict = {}
    total_words = len(tokens)
    
    for word in tokens:
        # 去除停用词和标点符号
        if word not in stopwords and not word.isalnum():
            # 词形还原
            word = lemmatizer.lemmatize(word)
            
            # 计算每个词的TF
            if word in tf_dict:
                tf_dict[word] += 1
            else:
                tf_dict[word] = 1
    
    for word, count in tf_dict.items():
        tf_dict[word] = count / total_words
    
    return tf_dict

# 示例
tokens = ['This', 'is', 'an', 'example', 'sentence']
tf = calculate_tf(tokens)
print(tf)

IDF是逆文档频率的简称，用于衡量一个词在整个文档集合中的重要性。在整个文档集合中，某个词的IDF可以通过总文档数量除以包含该词的文档数量再取对数得到。

import math

# 函数：计算每个词的IDF
def calculate_idf(corpus):
    idf_dict = {}
    total_docs = len(corpus)
    
    for doc in corpus:
        for word in doc:
            # 去除停用词和标点符号
            if word not in stopwords and not word.isalnum():
                # 词形还原
                word = lemmatizer.lemmatize(word)
                
                # 统计包含每个词的文档数量
                if word in idf_dict:
                    idf_dict[word] += 1
                else:
                    idf_dict[word] = 1
    
    for word, count in idf_dict.items():
        idf_dict[word] = math.log(total_docs / count)
    
    return idf_dict

# 示例
corpus = [
    ['This', 'is', 'the', 'first', 'document'],
    ['This', 'is', 'the', 'second', 'document'],
    ['And', 'this', 'is', 'the', 'third', 'one']
]
idf = calculate_idf(corpus)
print(idf)

二、TF-IDF算法的应用

1、文本相似度计算：
TF-IDF算法可用于计算两篇文档之间的相似度。可以先计算每篇文档的TF和IDF值，然后通过计算它们的相似度来判断文档的相关程度。

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 创建TF-IDF向量化器
vectorizer = TfidfVectorizer()

# 计算文档的TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(corpus)

# 计算相似度矩阵
similarity_matrix = cosine_similarity(tfidf_matrix)

# 示例
# 假设有两篇文档，分别是doc1和doc2，可以通过similarity_matrix[0][1]来获取它们的相似度
print(similarity_matrix[0][1])

2、关键词提取：
TF-IDF算法可用于提取一篇文档中的关键词。通过计算每个词的TF和IDF值，可以得到每个词的重要性分数，根据分数排序后选择前几个词作为关键词。

# 函数：提取文档的关键词
def extract_keywords(doc, top_n):
    tfidf_scores = []
    
    # 计算每个词的TF-IDF分数
    for word in doc:
        # 去除停用词和标点符号
        if word not in stopwords and not word.isalnum():
            # 词形还原
            word = lemmatizer.lemmatize(word)
            
            # 计算TF-IDF分数
            tfidf_score = tf[word] * idf[word]
            tfidf_scores.append((word, tfidf_score))
    
    # 根据分数排序
    tfidf_scores.sort(key=lambda x: x[1], reverse=True)
    
    # 提取前top_n个关键词
    keywords = [x[0] for x in tfidf_scores[:top_n]]
    
    return keywords

# 示例
doc = ['This', 'is', 'a', 'sample', 'document']
keywords = extract_keywords(doc, 3)
print(keywords)

三、总结

TF-IDF算法是一种在文本分析和信息检索中常用的算法，通过计算词频（TF）和逆文档频率（IDF）来衡量词的重要性。在Python中，我们可以使用nltk和sklearn等库来实现TF-IDF算法的应用。

TF-IDF算法在文本相似度计算和关键词提取等方面具有广泛的应用，可以帮助我们更好地理解和分析文本数据。