Quoted at http://blog.liu21st.com/index.php?job=art&articleid=a_20050408_135759
In recent days, the Oracle full-text search technology is involved, and some articles are referred to. Abstract The full-text search technology is one of the key technologies of intelligent information management. Oracle Text As a component of Oracle9i, it provides a powerful full-text search function. With Oracle9i to make a background database, you can make full use of its full-text search technology, build complex large documents. Management system. This paper mainly introduces the architecture of Oracle Text and its use. Key words Oracle Text Full Text Retrieval Oracle has been working on the full-text search technology. When Oracle9i RLLEase2 is released, the full-text search technology of the Oracle database is very perfect, and Oracle Text has a powerful text retrieval ability and intelligent text management. ability. Oracle Text is a new name adopted by Oracle9i, which is called Oracle8/8i, which is called Oracle Intermediate, where it is in Oracle8 is Oracle Context Cartridge. Using Oracle9i and Oracle Text, you can easily and effectively use standard SQL tools to build new development tools or extend existing applications. Application developers can take advantage of Oracle Text search in any text-based Oracle database application, the application range can be searched in existing applications, but also to implement large-scale document formats and complex search criteria. Document management system. Oracle Text supports the basic full-text search feature of most languages supported by the Oracle database. This article describes how to use Oracle9i's full-text search technology to provide an excellent solution for their applications. 1 Architecture of Oracle Text The figure below is an Architecture of Oracle Text. Figure 1 The architecture of Oracle Text is based on the above architecture map. The main logic steps used when the Oracle Text index document is as follows: (1) Data Store all rows of the logical search table, and read the data in the column. Typically, this is just a column data, but some data stores a pointer using column data as document data. For example, URL_DATASTORE uses the column data as a URL. (2) Filter extracts document data and converts it into text representation. This needs to do this when you store binary documents (such as Word or Acrobat files). The output of the filter does not have to be a plain text format - it can be a text format such as XML or HTML. (3) The segmentation extracts the output information of the filter and converts it into plain text. Different text formats including XML and HTML have different segments. Conversion to plain text involve detecting important document segments, removing invisible information and text reformatting. (4) The lexical analyzer extracts the placed texture in the segmentation, and splits it into discontinuous markers. There is existing the lexical analyzer used in the blank character separation language, and there is also a dedicated lexical analyzer for segment complex Asian languages. (5) Index Engine Extract all marks in the lexical analyzer, the offset of the document segment in the segmentation and the low information content list of non-indexed words, and build a reverse index. Rocked cables stores the tag and the document containing these tags. 2 Simple example here give a brief example to implement the method and steps of full-text retrieval by Oracle Text, which will be described later. ORCALE9i provides Oracle Text Manager to simplify a lot of work, all work completed in Oracle Text Manager, can be implemented by PL / SQL. To use Oracle Text, you must have a CTXApp role or a CTXSYS user. Oracle Text provides a CTXSYS user for system administrators to provide CTXApp roles for application developers.
CTXSYS users can perform the following tasks: Start the Oracle Text server, perform all tasks for the CTXApp role. Users with CTXAPP roles can perform the following tasks: Creating an index, managing Oracle Text Data Dictionary, including creation and deleting preferences, perform Oracle Text Pl / SQL packages. Steps to use Oracle Text: (1) Creating a table to save some documents. This example uses a primary key column to identify each document and use a small varchar2 list to save each document. CREATE TABLE DOCS (ID Number Primary Key, Text Vacr2 (80)); (2) Place two sample documents into the table: INSERT INTO DOCS VALUES (1, '' '' the first doc ''); insert INTO DOCS VALUES (2, '' '' the second doc '' '); commit; (3) Use Oracle Text Manager to create and modify preferences, the preferences will be associated with the index. (4) Creating a text index using Oracle Text Manager. Also, you can enter the following SQL statement using the default preferences: create index doc_index on docs (text) indexType is ctxsys.context; (5) Use the Contains function to issue a content-based document query. For example: select id from docs where contacts (text, '' 'first' '')> 0; this will find all rows in DOCs that contain the word first (ie, document 1). The> 0 part in the statement is a valid Oracle SQL necessary, and the Oracle SQL does not support the Boolean return value of the function. The above is just a simple example, intended to give full steps to establish a full-text index using Oracle Text, summarized as follows: (1) Build a table and load text (including text field with the retrieval) (2) Configure Index (3 Establish index (4) Search (5) Index Maintenance: Synchronization and Optimization (Introduction will be introduced later) 3 Full-text retrieval to implement text must first load the correct text into database tables, default established indexing behavior requirements Loading documents in a text column, although you can store documents (including file systems and URL forms) (settings in the 'Data Storage' option). By default, the system should load the document in the text column. Text columns can be VARCHAR2, Clob, Blob, CHAR or BFILE. Note that only two opposite column types of LONG and Long Raw are supported by porting the Oracle7 system to Oracle8. You cannot establish an index for column type NClob, Date, and Number. About the document format, because the system can establish an index to most document formats including HTML, PDF, Microsoft Word, and plain text, you can load any document types into text columns (set in the 'filter' option) .
For more information on the supported document format, you can refer to Appendix 'Supported Filter Formats' in Oracle Text User '' '' '''s Guide and Reference. The loading method mainly includes the following: (1) SQL INSERT statement (2) CTXLOAD executable (3) SQL * Loader (4) Load LOB DBMS_LOB.LOADFROMFILE () PL / SQL process (5) Oracle Call from BFILE After INTERFACE4 creates an Oracle Text index after loading text columns for text, you can create an Oracle Text index. Documents are stored in many different programs, formats, and language. Therefore, each Oracle Text index has many options that need to be set to configure indexes for a specific case. When creating an index, Oracle Text can use several default values, but in most cases, users are required to configure indexes by specifying the preferences. Many of the options for each index constitute a function group, called 'class', an aspect of the configuration in each class, can be considered that these classes are some issues related to the document database. For example: data storage, filter, lexical analyzer, associated word meter, storage, etc. Each class has many predefined behaviors called objects. Each object is an answer that may have a class problem, and most objects contain attributes. Customize the object by attribute, so that the configuration of the index is more variable to accommodate different applications. (1) Storage (Storage) class store Specifies the tablespace parameters of the database table and the index constituting the Oracle Text index and creates a parameter. It has only one basic object: Basic_Storage, its properties include: i_index_clause, i_table_clause, k_table_clause, n_table_clause, p_table_clause, r_table_clause. (2) Data Storage (Datastore) Class Data Store: About the location and other information stored in the column. By default, the text is stored directly in the column, and each line in the table represents a separate full document. Other data storage locations include a web page stored in a separate file or identified by its URL. Seven basic objects include: default_datastore, detail_datastore, direct_datastore, file_datastore, multi_column_datastore, url_datastore, user_datastore ,. (3) Section Group class document segment group is an object for specifying a set of document segments. The document segment must be defined first before you can use the index through the WITHIN operator in the document segment. The document segment is defined as part of the document group. Contains seven basic objects: Auto_SECTION_GROUP, BASIC_SECTION_GROUP, HTML_SECTION_GROUP, NEWS_SECTION_GROUP, NULL_SECTION_GROUP, XML_SECTION_GROUP, PATH_SECTION_GROUP.
(4) WordList class (WordList) Class-related word meter identity for the language of the stem and blur matching query options, only one basic object Basic_WordList, its properties are: Fuzzy_Match, Fuzzy_NumResults, Fuzzy_Score, STEMMER, SUBSTRING_DEX, WILDCARD_MAXTERMS, Prefix_index, prefix_max_length, prefix_min_length. (5) Index Set Index Set is a collection of one or more Oracle indexes (not Oracle Text Index) for creating a CTXCAT type Oracle text index, only one basic object Basic_index_set. (6) Lexer Analyzer (Lexer) Class Class Class Language Analyzer class identity text uses the language used, and it is also determined how to identify tags in the text. The default lexical analyzer is English or other Western European language, with space, standard punctuation, and non-alphanumeric character identification mark, and disable case sensation. Contains 8 basic objects: Basic_LEXER, Chinese_LEXER, Chinese_VGRAM_LEXER, Japanese_vexer, Japanese_VGRAM_LEXER, KOREAN_LEXER, KOREAN__MORPH_ LEXER, MULTI_LEXER. (7) Filter Class Filter determines how to filter text to establish an index. You can use a filter to create a document, a formatted document, a plain text, and an HTML document, including 5 basic objects: charset_filter, procedure_filter, user_filter, including 5 basic objects: charset_filter, inso_filter. (8) Non-Index Class Non-index character table class is used to specify a word (referred to as non-index word) that does not encrypt the index. There are two basic objects: Basic_Stoplist (all non-index words in a language), Multi_Stoplist (including non-indexed multilingual non-index word tables in multiple languages). 5 Query has established an index, you can use the ContaS operator in the SELECT statement to issue a text query. Two queries can be made using Contains: Word query and About queries. 5.1 Word query example word query is an exact word or a phrase between the input to the Contains operator. In the following example, we will find all documents that contain the word Oracle in the text column. The score of each row is selected by the Score operator using the label 1: SELECT score (1) title from news where contacts (text, '' 'Oracle' '', 1)> 0; in the query expression, You can use the text operators such as AND and OR to get different results. Structural predicates can also be added to the WHERE clause. You can use count (*), ctx_query.count_hits or ctx_query.explain to calculate the number of hits (match) of the query. 5.2 About Query Example In all languages, the About query adds the number of related documents returned by a query. In English, the About query can use the indexed keyword component, which is created by default. In this way, the operator will return a document according to the concept of the query, not only the precise words or phrases specified.
For example, the following query will look up all documents about topic politicals in the text column, not only the documentation of the word Politics: SELECT SCIE (1) Title from news where contacts (text, '' 'About (politicals)' '' ', 1)> 0; 6 Display documents that meet the query conditions typically, by querying applications using Oracle Text, users can view the documents returned by the query. The user selects a document from the list of lives, and then the application displays the document in some form. With Oracle Text, you can reproduce the document in a different way. For example, the document can be displayed by highlighting the query word. Highlighting the query word can be the word in the relevant word query, or the keyword in the English ABOUT query. The following is information about the output effect and the process for each output effect: highlight the document, pure text format version (CTX_Doc.markup) highlights the document, HTML version (CTX_Doc.markup "highlights the plain text format version Offset information (ctx_doc.highlight) highlights the offset information of HTML version (ctx_doc.highlight) plain textbook version, no highlight (CTX_Doc.filter) HTML version document, no highlight (CTX_Doc.filter) 7 index After the maintenance index is built, if the data in the table changes, such as an increase or modify the record, what should I do? Due to any DML statement that occurs in the table, the index is automatically modified, so it is necessary to synchronize (SYNC) and optimization indexes to correctly reflect the changes in data. After the index is built, you can check the following tables below to the Oracle automatically generate the following tables: (Hypothesis Name MyIndex): DR $ MyIndex $ I, DR $ MyIndex $ K, DR $ MyIndex $ R, DR $ MyIndex $ n is most important in the i table, you can query the table: select token_text, token_count from Dr $ MyIndex $ I where rownu <= 20; query results are omitted here. It can be seen that after saving in this table After Oracle analysis, the generated TERM record is here, including the location, number of times, HASH values, etc. of Term. When the content of the document changes, you can think that the content of this I should also change accordingly, to ensure that Oracle has correctly retrieved content when doing a full-text search (because the so-called full text retrieval, the core is to query this table). So how to maintain the content of the table, you can't re-establish the index every time you data changes, this is to use SYNC and Optimize. Synchronization (SYNC): Saves new Term to i; Optimize: Clear the garbage of the i table, mainly to delete the TERM that has been deleted from the I table. Oracle provides a so-called CTX Server to do this and optimize the work, just run this process in the background, which will monitor the changes in data, synchronize in time.