knn算法python代码,python算法详解

敏感词过滤经典算法DFA在阅读相关资料后自行实现，同时进行了评价实验

先码

#！ /usr/幽默的星月/python2.6

#-* -编码： utf-8-* -

导入时间

classnode(object ) :

def __init__(self ) :

self.children=None

# The encode of word is UTF-8

defadd _ word (根，word ) :

节点=根

傅里叶(len ) word ) :

if node.children==None:

node.children={}

node.children[word[i]]=Node (

elif word [ I ] not innode.children :

node.children[word[i]]=Node (

node=node.children[word[i]]

定义(路径) :

根=节点()

FP=open (路径，' r ' ) ) )。

for line in fp:

line=line[0:-1]

#打印len (行)。

#打印行

#打印类型(行)。

add _ word (根，行) )。

fp.close () )

返回根

# The encode of word is UTF-8

# The encode of message is UTF-8

defis _ contain (消息，根) :

forIinrange(Len )消息) :

p=根

j=i

while(j

p=p.children [消息[ j ] ]

j=j 1

if p.children==None:

#print '---word--- '，message[i:j]

返回真

返回假

def dfa () :

print---------------- DFA---------'

根=init (/tmp/word.txt ) )。

message='到处乱叫，吓得呆在家里的11岁女儿躲在房间里出不来，管辖派出所的警察赶到后，把孩子从家里救了出来。最后征得主人同意后，民警和村民联手杀死了这只疯狗。”

#消息='忽略'

打印' * * *消息* * * '，len (消息) )。

start_time=time.time (

forIinrange(1000 ) :

RES=is _ contain (消息，根) ) ) ) ) ) ) )。

#打印RES

end_time=time.time (

打印(end _ time-start _ time ) )。

efis _ contain2(消息，word_list ) :

for item in word_list:

if消息. find (item )！=-1:

返回真

返回假

def正常(: )

print------------------ normal-------------------------- "

path='/tmp/word.txt '

FP=open (路径，' r ' ) ) )。

word_list=[]

打印' * * *消息* * * '，len (消息) )。

for line in fp:

line=line[0:-1]

word_list.append(line )

fp.close () )

打印' the count of word : '，Len(word_list ) )。

start_time=time.time (

forIinrange(1000 ) :

RES=is _ contain2(消息，word_list ) )

#打印RES

end_time=time.time (

打印(end _ time-start _ time ) )。

if __name__=='__main__':

dfa ()

normal () )

测试结果：

1 ) 100个敏感词语

------------DFA------------------------------------------------------。

* * *消息* * * 224

0.325479984283

----------------normal----------------请参阅

* * *消息* * * 224

The count of word: 100

0.107350111008

2 ) 1000个敏感词语

------------DFA------------------------------------------------------。

* * *消息* * * 224

0.324251890182

----------------normal----------------请参阅

* * *消息* * * 224

The count of word: 1000

1.05939006805

从上面的实验可以看出，DFA算法只有在敏感单词很多的情况下才有意义。在有上百个敏感词语的情况下，甚至比不上普通算法

以下，虽然从理论上推导出时间复杂度，但是为了便于分析，首先假设消息文本为等长且长度为lenA；各敏感词的长度相同，长度为lenB，敏感词的数量为m。

1 ) DFA算法的核心是构建多叉树。我们已经假设树的最大深度为lenB，因为敏感词的长度相同。从消息文本的某个位置(字节)开始的某个子串是否位于敏感词树上，最多只能匹配lenB次。也就是说，判断一个消息文本中是否有敏感词的时间复杂度是lenA * lenB

2 )再看一下普通的做法，使用for循环，对于每个敏感单词，依次在消息文本中查找。字符串使用KMP算法，假设KMP算法的时间复杂度为o(Lenalenb )

那么，寻找m个敏感单词的时间复杂度是(lenA lenB ) m

综上所述，DFA算法的时间复杂度基本上与敏感词的数量无关。