首页 > 编程知识 正文

word2vec词向量,腾讯视频vip

时间:2023-05-03 18:57:45 阅读:33165 作者:4051

importjsonimporttimefromcollectionsimportordereddictfromgensim.modelsimportkeyedvectorsfromannoyimportannoyindext1=time .=keyed vectors.load _ word2vec _ format ) ./data/tencent_ailab_chilab ) )。 binary=false (print (load cost :65:3 f ).format (time.time )- t1 ) )将与txt文件中的单词相对应的向量, 要放入有序词典word_index中的keyin enumerate (TC _ wv _ model.key _ to _ index.keys ) ) :word_index[key]=counterprint ) '有序光盘本地保存t3=time.time () withopen () TC_word_index.JSON,(w ) ) asFP:JSON.dump ) fp ) print (savetc _ word _ index cost :653360.3 f } '.format (time.time (-T3 ) )腾讯语向量为二百维t4=time.time I=0forkeyintc _ wv _ model.key _ to _ index.keys (3360 v=TC _ wv _ model [ key ] TC_ v ) I=1TC_index.build(10 ) print (build TC _ indextreecost :653360.3 f ).format (时间. time )- t4 ) ) TC _ index.save (TC _ index _ build 10.index ) ) print ) saveTC_indextreecost: ) 3360.3f ).format )定时窗口编码=' utf-8 ' (asfp : word _ index=JSON.load (FP ) print (load TC _ word _ index.JSON cost 3360653360 TC _ index ) 公制=' angular ' (TC _ index.load (TC _ index _ build 10.index ' ) print ) loadTC_index_build10.indexccc反向e,key ) for ) key, value(inword_index.items () ) print (' getreverse _ word _ index cost :65:3 f ).format (时间. time ) ) 中的元素为索引t8=time.time () foritemintc _ index.get _ nns _ by _ item ) word_index[u '优惠'),10 ) : print ()

使用相关依赖项:

gensim==4.1.2

annoy==1.17.0

tqdm==4.62.3

上述代码从txt文件中读取882万个向量,并在处理后生成两个结果文件。

TC _ word _ index.JSON TC _ index _ build 10.index导入时间,JSON,logger,annoyindexclasstencentaichiembeding (woyindex TC _ index _ path (: self._ word _ index=self.load _ word _ index ) word_index_path ) self._TC_index 自. _ reverse _ word _ index=self.gen _ reverse _ word _ index (def load _ word _ index )自, word_index_path (: word _ index=nonetry : ST=time.time ) withopen ) word _ index _ path, encoding=' utf-8 ' (asfp : word _ index=JSON.load (FP ) logger.info (load { } cost :3 f ) . forr time.time(-ST ) ) exceptexceptionase : logger.error ) load_word_indexerror3360 ) ) logger.exception(e tc_index_path ) : TC _ index=nonetry : ST=time.time ) time ) TC metric=' angular ' (TC _ index.load ) e.time(-ST ) ) exceptexceptionase : logger.error (load _ TC _ index error : ) ) logger.exception(e ) ) ] : reverse _ word _ index=nonetry : ST=time.time (key ) for ) key, value(inself._word_index.items () )反向id==word地图词表logger.info (getreverse _ word _ index cost 33606533330 exce 索引错误: ' ) logger.exception(e ) returnreverse keyword,topn=10 ) : ' '基于annoy查询语中最接近的10个向量,结果是list查询语中的元素是索引' ' Simi_words=[] try:ST=time.time () foriteminself._ TC _ index.get _ nns _ by _ item ) self.TC ) get _ nnns _ by _ item ) simi _ words 3360 { }.format (time.time )- st,keyword, simi_words () exceptexceptionase : logger.error (' get _ simi _ words error 360 ' ) logger.exception(e ) (e ) ) ger.info (initializingtencentwordvec . ' ) ST=Tencent embedding=tencentaichiembedding (' ./data/Tencent _ ailab data/Tencent _ ailab _ Chinese embedding/TC _ index _ build 10.index ' ) logger.info (初始化Tencent '.format

下面的类使用上面生成的两个结果文件构建了生成同义词的工具类,只要稍微修改下面的文件目录,就应该可以直接使用。

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。