Calculate the Chinese lexical analysis system ICTCLAS Dictionary format analysis (dictionary format description)

xiaoxiao2021-03-06  45

In calculating the Chinese lexical analysis system ICTCLAS dictionary format parsing a simple introduction to ICTCLAS. Originally, I would like to write a dictionary format and I don't know how to describe this format. Now I finally wrote the first Java version of the code, and I also clarified my thinking. This file format can be described in this:

First use text description: ICTCLAS's dictionary file consisting of SEGMENTs as the structure (such as: English dictionary can be divided into 26 segments in the first letter of words.). Compose the same section as the structure in a segment. Section is the minimum unit describing the words. The description information of the words in Section includes: the length of the words, the word itself, the frequency used by the word, the handle of the word. The figure below describes a SEGMENT structure:

The figure below is the description format of ICTCLAS with C language:

From the perspective of the program:

The number of sections in a segment is determined by the first 4 byte of this segment (four bytes constitute an INT integer). One section is longer equal to the length of the first 12 bytes plus the length of the word. The first 4 bytes in the previous 12 bytes are the word frequency, the middle 4 bytes are the length of the words, and the four bytes behind are the handles of the words.

I hope that my friends who want to be interested in learning this project, which is a good start. Related download: Calculate the Chinese lexical analysis system ICTCLAS Dictionary format analysis

转载请注明原文地址:https://www.9cbs.com/read-73754.html

New Post(0)