java开源项目,编译原理词法分析程序java

LTP是哈佛大学开源中文语言处理系统，涵盖基本功能。分词、词性标注、命名实体识别、依存句法分析、语义角色标注、语义依存分析等。

【开源中文分词工具探析】系列：

1 .前言

与THULAC相同，LTP也基于结构化感知器(Structured Perceptron，SP )，在最大熵标准下对注释序列$Y$输入序列$X$时的score函数进行模型化。

$$s(y，x ) ) sum_s ) alpha_s ) phi_s ) y，x ) $$

其中$phi_s(y，x ) $是本地特征函数。中文分词问题等价于给定$X$数组，求解与score函数最大值相对应的$Y$数组：

$$mathop{argmax}_ys(y，x ) $$

2 .分解

以下源代码分析基于3.4.0版。

分词流程

分词的流程和其他分词器一样。首先提取文字特征，计算特征的权重值，然后对Viterbi进行解码。有关代码的详细信息，请参阅_ _ LTP _ dll _ segmentor _ wrapper 33603360 segment (

intsegment(constchar*str，STD :3360 vector : string words ) )

LTP :框架： viterbifeaturecontextctx；

LTP :框架： viterbiscorematrixscm；

LTP :框架：虚拟编码解码器；

LTP : segmentor :实例inst；

intret=preprocessor.preprocess (str，inst.raw_forms，inst.forms，

inst.chartypes；

if(-1==ret||0==ret ) {

words.clear (；

返回0；

}

LTP : segmentor : segmentationconstraincon；

con.regist((Inst.chartypes ) )；

build _ lexicon _ match _ state (lexicons，inst )；

extract_features(Inst，model，ctx，false )；

calculate_scores(Inst，(*model )，ctx，true，scm )；

//allocateanewdecodersothatthesegmentorsupportmultithreaded

//decoding.thismodificationwascommittedbyniuox

decoder.decode(SCM，con，inst.predict_tagsidx )；

build_Words(Inst.raw_forms，inst.predict_tagsidx，words )；

return words.size (；

}

训练模式

模型文件cws.model包含类别、特征、权重、内部词典(internal lexicon )等。用Java重写了模型分析。代码如下。

datainputstreamis=new data inputstream (new file inputstream (path )；

char[]octws=readchararray(is，128 )；

//1 .读标签

smartmaplabel=readsmartmap(is；

int[]Entries=readintarray(is，label.numEntries )；

//2 .读功能空间

char[]space=readchararray(is，16 )；

intoffset=readint(is；

intSZ=readint(is；

smart map [ ] dicts=newsmartmap [ SZ ]；

for(intI=0； i sz； I ) {

dicts[I]=readsmartmap(is；

}

//3 .读我

char[]Param=readchararray(is，16 )；

intdim=readint(is；

double[]w=readdoublearray(is，dim )；

double [ ] wsum=readdoublearray (is，di

m);

int lastTimestamp = readInt(is);

// 4. read internal lexicon

SmartMap internalLexicon = readSmartMap(is);

// read char array

private static char[] readCharArray(DataInputStream is, int length) throws IOException {

char[] chars = new char[length];

for (int i = 0; i < length; i++) {

chars[i] = (char) is.read();

}

return chars;

}

// read int array

private static int[] readIntArray(DataInputStream is, int length) throws IOException {

byte[] bytes = new byte[4 * length];

is.read(bytes);

IntBuffer intBuffer = ByteBuffer.wrap(bytes)

.order(ByteOrder.LITTLE_ENDIAN)

.asIntBuffer();

int[] array = new int[length];

intBuffer.get(array);

return array;

}

LTP共用到了15类特征，故sz为15；特征是采用Map表示，LTP称之为SmartMap，看代码本质上是一个HashMap。分词工具测评结果表明，LTP分词速度较THULAC要慢。究其原因，THULAC采用双数组Trie来表示模型，特征检索速度要优于LTP。

特征

LTP所用到的特征大致可分为以下几类：

unigram字符特征 ch[-2], ch[-1], ch[0], ch[1], ch[2]

bigram字符特征 ch[-2]ch[-1], ch[-1]ch[0],ch[0]ch[1],ch[1]ch[2]

字符类型特征 ct[-1], ct[0], ct[1]

词典属性特征 ch[0]是否为词典开始字符、中间字符、结束字符

源码见extractor.cpp：

Extractor::Extractor() {

// delimit feature templates

templates.push_back(new Template("1={c-2}"));

templates.push_back(new Template("2={c-1}"));

templates.push_back(new Template("3={c-0}"));

templates.push_back(new Template("4={c+1}"));

templates.push_back(new Template("5={c+2}"));

templates.push_back(new Template("6={c-2}-{c-1}"));

templates.push_back(new Template("7={c-1}-{c-0}"));

templates.push_back(new Template("8={c-0}-{c+1}"));

templates.push_back(new Template("9={c+1}-{c+2}"));

templates.push_back(new Template("14={ct-1}"));

templates.push_back(new Template("15={ct-0}"));

templates.push_back(new Template("16={ct+1}"));

templates.push_back(new Template("17={lex1}"));

templates.push_back(new Template("18={lex2}"));

templates.push_back(new Template("19={lex3}"));

}

#define TYPE(x) (strutils::to_str(inst.chartypes[(x)]&0x07))

data.set("c-2", (idx - 2 < 0 ? BOS : inst.forms[idx - 2]));

data.set("c-1", (idx - 1 < 0 ? BOS : inst.forms[idx - 1]));

data.set("c-0", inst.forms[idx]);

data.set("c+1", (idx + 1 >= len ? EOS : inst.forms[idx + 1]));

data.set("c+2", (idx + 2 >= len ? EOS : inst.forms[idx + 2]));

data.set("ct-1", (idx - 1 < 0 ? BOT : TYPE(idx - 1)));

data.set("ct-0", TYPE(idx));

data.set("ct+1", (idx + 1 >= len ? EOT : TYPE(idx + 1)));

data.set("lex1", strutils::to_str(inst.lexicon_match_state[idx] & 0x0f));

data.set("lex2", strutils::to_str((inst.lexicon_match_state[idx] >> 4) & 0x0f));

data.set("lex3", strutils::to_str((inst.lexicon_match_state[idx] >> 8) & 0x0f));

#undef TYPE