python大数据挖掘,python问题解答网站

分词

定义defgetseg(text，wd_dict ) :# )分词的判断条件

if not text:

返回“”

iflen(Text )==1:

返回文本

if text in wd_dict:

返回文本

else:

new_length=Len(Text )-1

text=text[0:new_length]

RES=get seg (文本，wd_dict ) ) ) ) ) ) ) )。

返回RES

efmain(text_str，n_len，dict_name ) : #定义最大前向匹配

text_str=text_str.strip(# (删除#字符串前后的空格

max_len=n_len#定义最大一致分词长度

result_str=''#保存要输出的结果

while text_str:

new _ text=text _ str [ 0: max _ len ]

seg_str=getseg(new_text，dict_name ) ) ) ) ) ) ) ) )。

result_str=result_str seg_str '/'

seg_len=len(seg_str )

text _ str=text _ str [ seg _ len : ]

return result_str

打印('分写结束) ) ) ) ) )。

jieba分词：安装Jip安装jieba

支持三种分词模式：

精确模式：试图最精确地分离句子，适合文本分析

所有模式：扫描句子中的所有单词

搜索引擎模式：基于精确模式，重新划分长词，适合搜索引擎分词

seg=Jieba.cut(test_str2，cut_all=True，HMM=True ) ) ) ) ) ) )。

print ('完整模式： ' '/'.join ) (seg ) )

seg=Jieba.cut(test_str2，cut_all=False，HMM=True ) ) ) ) )。

print ()精确模式()/(.join ) ) (seg ) ) ) ) ) )。

seg=Jie ba.cut _ for _ search (test _ str 2，HMM=True ) ) ) ) ) ) )。

print ('搜索引擎模式： ' '/'.join ) (seg ) )

jieba.cut和jieba.cut_for_search直接返回list

import jieba.posseg as psg

forxinPSG.cut(teststr2) : print ) (分词) x.word ) )。

print (词性x.flag ) ) )。

向jieba分词器添加自定义词典： Jie ba.load _ user dict (c :\ users\ CDA\ desktop\ my _ dict.txt ) ) )

云：安装： pip安装万维网云

来自word cloud导入word cloud，STOPWORDS

案例：

#读取文本数据

text=open (c : (用户(CDA )桌面) )西游记. txt )，encoding=(GB18030 ) ) () ) () ) ) )

#分写文本

text _ Jie ba=' '.join (Jie ba.cut (text，cut_all=False，HMM=True ) )

#设置云的基本参数

my _ word cloud=word cloud (background _ color=' white '，stopwords=STOPWORDS，

font _ path=' c\ windows\ fonts\ simsun.TTC ' )

#用分开单词的词汇进行单词组的绘制

my _ word cloud.generate (text _ Jie ba ) ) ) ) ) ) ) )。

#显示单词组

import matplotlib.pyplot as plt

PLT.imshow(my_wordcloud )。

关闭PLT.AXIS(off ) #坐标