Calculate the Chinese lexical analysis system ICTCLAS Dictionary format analysis
During this time, the small knocked module basically has no big update, not to update, but feel like encountering the ceiling, I don't know how to further expand the functionality of the word. Of course, the word is not a purpose, just to let the little bit of understanding the one-step intermediate link of the natural language. My positioning is a smart knowledge question and answer system. This makes the little bit to understand the content of the user's input is the most critical step. We learn a language, first understand the composition of the sentence, the composition of the sentence, the master, the main, the guest, the guest, the shape, the addition, etc. Let the machine understand the person's language and should also take a rough step.
The word is obviously the first step. There are two levels in my opinion: 1. Separate the words in one sentence according to the meaning representative; (the point of the search engine can be basically basic) 2. Plus the first link of the first link Words labeling; (verbs, nouns, etc.)
The small screwdrivers of the little horns are now only basically completed the first part of the above. It is very difficult to finish the Part 2, first requiring a word word language, followed by a good algorithm to mark the word.
This is probably the reason why the little horns has been unable to improve. So I am preparing to refer to the Chinese language lexical analysis system ICTCLAS, first look at how people are implemented.
Chinese language speech system ICTCLAS is a set of Chinese scope of pragmatic Chinese scope of the Chinese Academy of Sciences. The functionality of the system is: Chinese word; mean labeling; unregistered word identification. Details can be seen here. Since the other party provides source code, ICTCLAS is a good starting point. (Now ICTCLAS provides dynamic link library under Windows and Linux, there is no Java and C # versions, I think if this series of articles are written, you should be able to implement C # and Java version, huh, oh, is of course good There is also a shortcoming thing, in my opinion, ICTCLAS's biggest shortcomings are no documentation. Just like jboss, although the code is free, there is no document, many people have to pay the service fee or buy a document. This is also a profit model, give the author some compensation.
Two priorities of the word are a good word library and a good set of word algorithms. ICTCLAS is undoubted in these two aspects. This article focuses on the format of the word library used by ICTCLAS. The word library used by IctClas is the file ending with DCT. ICTCLAS4J I implemented. Can be imported directly into the Eclipse run. The word library: There is an example here.
Today, I will write here, I have not written the article in the file format, I don't know how to describe .... Tian Chunfeng 20041223