Python字符串比较相似度

解决方法：

from difflib import SequenceMatcher
s1 = "abcdef"
s2 = "abcdeg"
s3 = "abcdeh"

print(f"Similarity between s1 and s2: {SequenceMatcher(None, s1, s2).ratio()}")
print(f"Similarity between s1 and s3: {SequenceMatcher(None, s1, s3).ratio()}")

Python字符串比较相似度指的是比较两个字符串之间的相似程度。在很多情况下，我们需要对文本进行匹配和比较，例如搜索引擎自动纠错、密码破解、文本相似度比较等都需要用到字符串比较相似度算法。

一、基于SequenceMatcher的相似度比较

Python中的difflib库提供了SequenceMatcher类来实现字符串比较相似度的计算。SequenceMatcher类是按照句子中相同连续的字符串来计算相似度的。

通过以下示例代码，我们可以看到当s1和s2两个字符串比较时输出的相似度是0.8333，而当s1和s3两个字符串比较时输出的相似度是0.8889。这说明SequenceMatcher算法对于连续相同的字符串比较较为灵敏。

二、基于Levenshtein Distance的编辑距离算法

Levenshtein Distance又称编辑距离，指的是把一个字符串转换成另外一个字符串所需的最少编辑次数。编辑包括三种操作，即替换、插入、删除，每一次操作只能修改一个字符。编辑距离的计算方法可以采用动态规划的方式实现。

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

s1 = "abcdef"
s2 = "abcdeg"
s3 = "abcdeh"

print(f"Levenshtein distance between s1 and s2: {levenshtein_distance(s1, s2)}")
print(f"Levenshtein distance between s1 and s3: {levenshtein_distance(s1, s3)}")

通过以上示例代码，我们可以看到当s1和s2两个字符串比较时输出的编辑距离是1，而当s1和s3两个字符串比较时输出的编辑距离是2。这说明Levenshtein Distance算法对于单个字符的比较较为灵敏。

三、基于余弦相似度的词向量比较

除了以上介绍的两种算法，还可以使用词向量来计算字符串之间的余弦相似度。这种方法通常可以使用自然语言处理库如NLTK等进行实现。

以下示例代码展示了如何使用gensim库计算两个文本之间的相似度：

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Train a Doc2Vec model with a set of sample texts
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

# Compare two texts based on their cosine similarity
s1 = "abc efg hij"
s2 = "abc def hij"
s3 = "foo bar zoo"
s1_vec = model.infer_vector(s1.split())
s2_vec = model.infer_vector(s2.split())
s3_vec = model.infer_vector(s3.split())

from scipy.spatial.distance import cosine
print(f"Cosine distance between s1 and s2: {1 - cosine(s1_vec, s2_vec)}")
print(f"Cosine distance between s1 and s3: {1 - cosine(s1_vec, s3_vec)}")

以上代码使用Doc2Vec运用预训练好的词向量计算了三个文本之间的相似度。输出结果中，相似度越高，余弦相似度越接近于1。