ICTCLAS Foundation System Research (2) - Dictionary Structure

xiaoxiao2021-03-19  201

The dictionary structure of ICTCLAS is an important basis for understanding the word. It is a reasonable access to the speed-efficient dictionary through such a data structure design to achieve the purpose of rapid preparation.

By reading and analyzing the source code, we can know that the program is running first, first load the dictionary into memory to increase the speed of access. The source code implemented the Dictionary and the Dictionary Rules Library in the consult.cpp. The following code shows:

CRESULT :: CRESULT () {...

m_dictcore.load ("Data // CoreDict.dct"); m_postagger.loadContext ("Data // Lexical.ctX);

......

}

Let's jump into the Load method to analyze how it is to read data dictionary, look at LOAD source code:

Bool CDictionary :: Load (char * sfilename, bool breset) {file * fp; Int i, j, nbuffer [3];

// First, it is first to determine whether the dictionary file opens if IF ((fp = fopen (sfilename, "rb")) == null) == null) Return false; // fail while opening the file // Release memory for new files Space for (i = 0; i

// cc_num: 6768, should be the number of common Chinese characters in GB2312 encoding 6763 plus 5 space code for (i = 0; i

// Read a plastic number (number of phrases) FREAD (& (m_indextable [i] .ncount), sizeof (int), 1, fp); if (m_indextable [i] .ncount> 0) m_indextable [i] .pworditemhead = new word_item [m_indextable [i] .ncount]; else {m_indextable [i] .pworditemhead = 0; Continue;} j = 0;

/ / The number of words read by the word block is read according to the number of the words read, and the cycle reads the word block while (j

// Read tri-character integers, Frequency / Word Content Length (Handle) / handle (NBuffer, SizeOf (int), 3, fp); m_indextable [i] .pwordItemhead [J] .sword = new char [nbuffer [1] 1];

// Read word content if (nbuffer [1]) // String length is more Than 0 {fread (m_indextable [i] .pworditemhead [j] .sword, sizeof (char), nbuffer [1], fp); m_indextable [i] .pWordItemhead [j] .sword [nbuffer [1]] = 0; if (breset) // reset the frequency m_indextable [i] .pworditemhead [j] .nfrequency = 0; else m_indextable [i] .pwordItemhead [J] .nfrequency = nbuffer [0]; m_indextable [i] .pworditemhead [j] .nwordlen = nbuffer [1]; m_indextable [i] .pWordItemhead [2]; j = 1; / / Get next item in the original table.}} Fclose (fp); Return true;} After reading the source code above, the structure of the dictionary should also be basically clear, as shown below:

Figure one

The data structure of the modification table is almost almost, but there is a number of NDELETs behind the number of phrases, that is, the number of deletions, the data structure is shown in Figure 2:

Figure II

GB2312 (1980) has included 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range high byte from the B0-F7, the low byte from A1-Fe, the occupied code is 72 * 94 = 6768. 5 of which are D7FA-D7FE. The 6768 blocks shown in the Dictionary library are corresponding to this 6768 location in GB2312 encoding. Figure 1 Middle block represents all the phrases starting with all the phrases, the word in parentheses corresponds to the Chinese characters, dictionary form There is no existence in order to explain it. As shown below:

Block 6759 Count: 5 Wordlen: 2 Frequency: 0 Handle: 24832 Word: () Light Wordlen: 2 Frequency: 1 Handle: 24942 Word: () Light Wordlen: 2 Frequency: 3 Handle: 31232 Word: () Wordlen: 6 Frequency: 0 Handle: 27648 Word: () 神 神 Wordlen: 6 Frequency: 0 Handle: 26880 Word: () Rust Block 6760 Count: 1 Wordlen: 2 Frequency: 0 Handle: 28160 Word :( 鼢)mouse

Block 6761 Count: 2 Wordlen: 4 Frequency: 0 Handle: 28160 Word: () Rats Wordlen: 2 Frequency: 0 Handle: 28160 Word: ()

Analyze how the source code saved after modification:

Bool CDICTIONARY :: Save (char * sfilename) {file * fp; INT I, J, NCOUNT, NBUFFER [3]; PWORD_CHAIN ​​PCUR; IF ((fp = fopen (sfilename, "wb")) == null) Return False ; // Fail While Opening The File

// Take the 6768 data block shown in Figure 1 to traverse FOR (i = 0; I

Ncount = m_indextable [i] .ncount m_pmodifytable [i] .ncount-m_pmodifyTable [i] .ndelete; fwrite (& ncount, sizeof (int), 1, fp); pcur = m_pmodifyTable [i] .pWordItemhead; J = 0;

// Traverage the word block in the original table and the phrase in the modified table, and add the modified add to the original table (pcur! = Null && j

/ / If the word length in the modification table is smaller than the length or length of the word in the original table, the NHANDLE value is smaller than the original table, the write to the dictionary file.

IF (strcmp (pcur-> data.sword, m_indextable [i] .pworditemhead [j] .sword) <0 || (Strcmp (pcur-> data.sword, m_indextable [i] .pwordItemhead [j] .sword) = = 0 && pcur-> data.nhandle data.nfrequency; nbuffer [1] = PCur- > data.nwordlen; nbuffer [2] = pcur-> data.nhandle; fwrite (nbuffer, sizeof (int), 3, fp); if (nbuffer [1]) // String Length is more Than 0 FWrite (PCur- > data.sword, sizeof (char), nbuffer [1], fp); pcur = pcur-> Next; // get next item in the model table.}

/ / Frequency nFrequech equal to -1 Description This word has been deleted, skip it else if (m_indextable [i] .pworditemhead [j] .nfrequency == - 1) {J = 1;}

// If the word length in the modification table is equal to the length or length of the length in the original table but the original table writes in the dictionary file written in the dictionary file of the original table (Strcmp (PCUR-> Data.sword, m_indextable) [i] .pworditemhead [j] .sword)> 0 || (strcmp (pcur-> data.sword, m_indextable [i] .pwordItemhead [j] .sword) == 0 && pcur-> data.nhandle> m_indextable [i] .pworditemhead [j] .nhandle)) {// Output the index table data to the file nbuffer [0] = m_indextable [i] .pworditemhead [j] .nfrequency; nbuffer [1] = m_indextable [i] .pwordItemhead [J ] .nwordlen; nbuffer [2] = m_indextable [i] .pworditemhead [j] .nhandle; fwrite (nbuffer, sizeof (int), 3, fp); if (nbuffer [1]) // String Length is more Than 0 Fwrite (m_indextable [i] .pWordItemhead [J] .sword, sizeof (char), nbuffer [1], fp); J = 1; // Get next item in the original table.}

// Write the dictionary file written in the original table if IF (j data.nfrequency; nbuffer [1] = PCur -> data.nwordlen; nbuffer [2] = pcur-> data.nhandle; fwrite (nbuffer, sizeof (int), 3, fp); if (nbuffer [1]) // String Length is more Than 0 fwrite (PCur -> Data.sword, Sizeof (Char), NBuffer [1], FP); PCUR = pcur-> next; // get next item in the model table.}

/ / Is not a modification mark, then write all the data of the original table to the dictionary file, Else {FWRITE (& M_Indextable [i] .ncount, sizeof (int), 1, fp); // Write to the file j = 0; While (j

Bool CDictionary :: AddItem (Char * Sword, Int Nhandle, Int Nfrequency) {

Char swordadd [word_maxlength-2]; int NPOS, NFOUNDPOS; PWORD_CHAIN ​​PRET, PTEMP, PNEXT; INT i = 0;

// Preprocessing, the front and rear spaces of the words (! Preprocessing (Sword, & Npos, SwordAdd, true) Return False;

// Find the word in the original table of dictionary existing IF (FindingerTable (NPOS, SwordAdd, NHandle, & nfoundpos) {// The Word Exists in the Original Table, So Add The Frequency // Operation In The Index Table and ITEMS IF (m_IndexTable [nPos] .pWordItemHead [nFoundPos] .nFrequency == - 1) {// The word item has been removed m_IndexTable [nPos] .pWordItemHead [nFoundPos] .nFrequency = nFrequency; if (m_pModifyTable!) // Not prepare the buffer {m_pModifyTable = new MODIFY_TABLE [CC_NUM]; memset (m_pModifyTable, 0, CC_NUM * sizeof (MODIFY_TABLE));} m_pModifyTable [nPos] .nDelete- = 1;} else m_IndexTable [nPos] .pWordItemHead [nFoundPos] .nFrequency = nFrequency Return True;}

// If the modified table is empty, initialize space for it

if (! m_pModifyTable) // Not prepare the buffer {m_pModifyTable = new MODIFY_TABLE [CC_NUM]; memset (m_pModifyTable, 0, CC_NUM * sizeof (MODIFY_TABLE));} // modify the query word exists in the table, if there is an increase The word frequency IF (FindinModifyTable (NPOS, SwordAdd, Nhandle, & Pret) {if (PRET! = NULL) PRET = PRET-> Next; Else Pret = m_pmodifyTable [Npos] .pwordItemhead; Pret-> Data.nFrequency = NFREQUENCY; RETURN TRUE;

// If you don't find it in a modified table, add it in

PTEMP = new word_chain; // allocate the word chain node if (ptemp == null) // allocate memory failure return false; memset (ptemp, 0, sizeof (word_chain)); // init it with 0 Ptemp-> Data. NHANDE = nhandle; // store the handle ptemp-> data.nwordlen = strlen (swordad); ptemp-> data.sword = new char [1 PTEMP-> DATA.NWORDLEN]; STRCPY (PTEMP-> Data.sword, SwordAdd); PTEMP-> DATA.NFREQUENCY = NFREQUENCY; PTEMP-> Next = NULL;

// Insert if (PRET! = Null) {pNext = PRET-> next; // get the next item before = ptemp; // link the node to the chain} else {= PTEMP; pNext = m_pModifyTable [nPos] .pWordItemHead; m_pModifyTable [nPos] .pWordItemHead = pTemp; // Set the pAdd as the head node} pTemp-> next = pNext; // Very important !!!! or else it will lose some node // Add a M_PMODIFYTABLE [NPOS] .ncount ; // The number increase by one return true;

Delete modified entry

Bool CDICTIONARY :: DELMODIFIED () {PWORD_CHAIN ​​PTEMP, PCUR; if (! m_pmodifytable) return true;

For (int i = 0; i

/ / Delete the node on the list while (pcur! = Null) {ptemp = pcur; pcur = pcur-> next; delete ptemp-> data.sword; delete Ptemp;}} delete [] m_pmodifyTable; m_pmodifytable = null; return true } // use two-point method to find

bool CDictionary :: FindInOriginalTable (int nInnerCode, char * sWord, int nHandle, int * nPosRet) {PWORD_ITEM pItems = m_IndexTable [nInnerCode] .pWordItemHead; int nStart = 0, nEnd = m_IndexTable [nInnerCode] .nCount-1, nMid = ( NSTART NEND) / 2, NCOUNT = 0, NCMPVALUE; While (NStart <= NEND) // binary search {ncmpvalue = strcmp (pitems [nmid] .sword, sword);

// If the middle is exactly what is to look for (ncmpvalue == 0 && (pitems [nmid] .nhandle == nhandle || nhandle == - 1)) {if (nposret) {if (nhandle == - 1) / / NOT VERY STRICT MATCH {// add in 2002-1-28 nmid- = 1;

// Get The First Item Which Match The Current Word While (NMID> = 0 && Strmp (Pitems [NMID] .sword, Sword) == 0) NMID -; if (nmid <0 || strcmp (Pitems [NMID] .sword , SWORD)! = 0) NMID ;} * nposret = nmid; return true;} if (nposret) * nposret = nmid; return true; // Find it} else if (ncmpvalue <0 || (ncmpvalue == 0 && pitems NMID] .NHANDE 0 || (ncmpvalue == 0 && pitems [nmid] .nhandle> nhandle && nhandle! = - 1)) {Nnd = nmid -1;} nmid = (nStart NEND) / 2;}} (nposret) {// Get the previous position * nposret = nmid-1;} Return False;

/ / Query in the modified table

bool CDictionary :: FindInModifyTable (int nInnerCode, char * sWord, int nHandle, PWORD_CHAIN ​​* pFindRet) {PWORD_CHAIN ​​pCur, pPre; if (m_pModifyTable == NULL) // empty return false; pCur = m_pModifyTable [nInnerCode] .pWordItemHead; pPre = NULL; // Sword equal and handle (NHANDLE) equal while (pcur! = Null && (_ stricp (pcur-> data.sword, sword) <0 || (_Strics) == 0 && pcur > data.nhandle next;}} (pfindret) * pfindret = ppre; if (pcur! = null && _strics) > data.sword, sword == 0 && (pcur-> data.nhandle == nhandle || nhandle <0)) {// the node exists, delete the node and return return true;} return false;}

Get the type of words, a total of three Chinese characters, separators and other

int CDictionary :: GetWordType (char * sWord) {int nType = charType ((unsigned char *) sWord), nLen = strlen (sWord); if (nLen> 0 && nType == CT_CHINESE && IsAllChinese ((unsigned char *) sWord)) return WT_CHINESE ; // Chinese Word Else IF (Nlen> 0 && NTYPE == Ct_Delimiter) Return Wt_Delimiter; // Delimiter Else Return Wt_other; // Other Invalid}

转载请注明原文地址:https://www.9cbs.com/read-130190.html

New Post(0)