REL = file-list href = "Research and implementation of search engine based on Java technology. Files / filelist.xml"> rel = edit-time-data href = "Research and implementation of search engines based on Java technology. Files / EditData .mso "> REL = Ole-object-data href =" Research and implementation of search engines based on Java technology. Files / OLEDATA.MSO ">
Research and Implementation of Search Engine Based on Java Technology
Research and Implementation of Search Engine Based on Java Technology
table of Contents
table of Contents................................................. ................................................ ................................................ ....................... 1
Summary................................................. ................................................ ................................................ .....................
Chapter 1 Introduction ............................................ ................................................ ................................................ ........... 4
Chapter 2 Search Engine Structure ......................................... ................................................ ......................................... 5
2.1 System overview ........................................... ................................................ ................................................ ........ 5
2.2 Composition of search engines ........................................... ................................................ ................................................ 5
2.2.1
Network robot .............................................. ................................................ ......................................... 5
2.2.2
Index and search ............................................. ................................................ ............................................ 5
2.2.3
WEB server .............................................. ................................................ ......................................... 62.3 Main indicators of search engine and analysis................................................. ................................................ .......................... 6
2.4 sections .............................................. ................................................ ................................................ ...............
Chapter III Network Robot ........................................... ................................................ ............................................. 7
3.1 What is a network robot ........................................... ................................................ .....................................................
3.2 Structure analysis of network robots .......................................... ................................................ .....................................
3.2.1
How to analyze HTML ........................................... ................................................ ..................................... 7
3.2.2
Spider program structure ........................................... ................................................ ..................................... 8
3.2.3
How to construct a spider program .......................................... ................................................ ............................... 9
3.2.4
How to improve program performance .......................................... ................................................ ...............................
3.2.5
Code analysis of network robots ......................................... ................................................ ............................... 12
3.3 sections .............................................. ................................................ ................................................ ............ 14 Chapter 4 of Lucene's index and search .......................... ................................................ ...................................... 15
4.1 What is Lucene full-text search .......................................... ................................................ ................................ 15
4.2 LUCENE principle analysis ........................................... ................................................ ...................................... 15
4.2.1
The realization mechanism of full text search ........................................... ................................................ ............................. 15
4.2.2
Lucene's index efficiency .......................................... ................................................ ................................ 15
4.2.3
Chinese cutting words mechanism .......................................... ................................................ .................................... 17
4.3 Combination of Lucene and Spider ........................................ ................................................ ............................ 18
4.4 Section ................................................ ................................................ ................................................ ............ twenty one
Chapter 5 WEB Server Based on Tomcat ........................................ ................................................ ..................... twenty two
5.1 What is a web server based on Tomcat ...................................... ................................................ ................ twenty two
5.2 User Interface Design .......................................... ................................................ ................................................ 225.3. 1
Client design ........................................... ................................................ ........................................... twenty two
5.3.2
Service design ........................................... ................................................ ........................................... twenty three
5.3 Deploying items on Tomcat ........................................ ................................................ ................................. 25
5.4 sections .............................................. ................................................ ................................................ ............ 25
Chapter VI Search Engine Strategy .......................................... ................................................ ............................................ 26
6.1 Introduction .............................................. ................................................ ................................................ ............ 26
6.2 Search strategy for theme .......................................... ................................................ ...................................... 26
6.2.1
Guide words ................................................ ................................................ ................................................ 26
6.2.3
Authoritative web page and center page ........................................... ................................................ ............................. 27
6.3 Section .............................................. ................................................ ................................................ ............ 27
references................................................ ................................................ ................................................ ............. 28 Summary
The resources in the network are very rich, but how effective search information is a difficult thing. Establishing a search engine is the best way to solve this problem. This article first introduces the system structure of the Internet-based search engine, and then detailed from the three aspects of the network robot, index engine, and web servers. In order to understand this technology more, I also realized a self-service search engine.
News Search Engine is passing, searches, search, and adds each news to the database after indexing each news from the specified Web page. Then searched the matched news from the index database after receiving the client request through the web server.
I introduce the search engine's chapter in addition to the detailed explanation of the technical core, the news search engine is combined, and the graphic is also easy to understand.
Abstract
The resources in the internet are abundant, but it is a difficult job to search some useful information. So a search engine is the best method to solve this problem. This article fist introduces the system structure of search engine based on the internet in detail, ., i have program
The news search engine is explained and searched according to hyperlink from a appointed web page, then indexs every searched information and adds it to the index database. Then after receiving the customers' requests from the web server, it soon searchs the right news form the INDEX ENGINE,
In The Chapter of Introducing Search Engine, IT IS Not Only elaborate The Core Technology, But Also Combine with The Modern Code, Pictures Included, Easy To Understand.
First chapter introduction
In the face of vast network resources, the search engine provides an entrance for users who surf all online, unavailable, all users can reach anywhere from the web that they want to go. So it has also become an online service that is used by most people other than emails.
Search Engine Technology is accompanied by WWW development. The search engine has experienced three generations of update development:
The first generation of search engines appeared in 1994. Such search engines generally index less than 1,000,000 web pages, rarely collect the webpage and refresh indexes. And its retrieval speed is very slow, usually waiting for 10 seconds or even longer. It is also based on technologies such as more mature IR (Information Retrieval), network, database, and other technologies, and applications that utilize some of the WWW implemented by some prior art. On March and April 1994, the network crawler World Web Worm (WWWW) averages approximately 1,500 queries per day. In approximately the second-generation search engine system in 1996, most of the distributed program (multiple microcomputers work together) is used to improve data size, response speed and number of users, which generally maintain an index of approximately 50,000,000 web pages. The database can respond to 10,000,000 user retrieve requests every day. In November 1997, the most advanced search engines at that time, the number of web indexes from 2,000,000 to 100,000,000 were established. The Altavista search engine claims that they have to bear 20,000,000 queries every day.
At the 2000 Search Engine 2000, according to the speech of Larry Page, Google, Google is using 3,000 PCs to run the Linux system in the web page on the web, and add a computer to this microcomputer cluster at 30 speeds per day. To maintain the same step as the network. Each microcomputer runs multiple reptiles to collect the peak speed per second, the average speed is 48.5 pages per second, one day can collect more than 4,000,000 web pages
The word search engine is widely used in the domestic and international Internet field, but his meaning is not the same. In the US search engine, it is usually referred to as the Internet-based search engine, and they collect tens million to billions of web pages over the network robot program, and each word is indexed by search engine, that is, we say full-text search. The famous Internet search engine includes First Search, Google, Hotbot, etc. In China, the search engine usually refers to the search service based on the website directory or a search service of a specific website. I studied here is based on the Internet-based search technology.
Chapter II Search Engine
2.1 System Overview
The search engine is based on the user's query request, and then finds the information from index data from index data to the user according to certain algorithms. In order to ensure the accuracy and freshness of the user find information, the search engine needs to establish and maintain a huge index database. The general search engine consists of a network robot program, an index and a search program, an index database, and other parts.
System structure diagram
2.2 Composition of search engines
2.2.1
Network robot
The network robot is also called "Spider", which is a strong Web Scanner. It can retrieve the hyperlinks in the web page and add the scan queue waiting for the next to scanning. Because the Web is widely used, a Spider program can access the entire web page in theory.
In order to ensure the breadth and depth of the network robot needs to set some important links and develop relevant scanning strategies.
2.2.2
Index and search
The network robot will store the page to the page in the temporary database, and if the information speed is directly queried by SQL, it will be unbearable. In order to improve the retrieval efficiency, you need to establish an index, and store it according to the format of the inverted file. If the index is not in time to new, the user cannot retrieve it.
When the user enters the search criteria, the search program will be retrieved by the index database and then the database that meets the query requirements is ranked in a certain policy and returns to the user.
2.2.3
Web server
Customers generally query via browsers, which requires the system to provide web servers and connect to the index database. When the customer enters the query condition in the browser, the web server receives the customer's query condition and queries in the index database, and then returns to the client. 2.3 Main indicators and analysis of search engines
The main indicators of the search engine have response time, recall rate, accuracy, and correlation, etc. These indicators determine the technical indicators of the search engine. The technical indicator of the search engine determines the evaluation index of the search engine. A good search engine should be a faster reaction speed and high call return rate, accuracy, of course, these all need to search engine technical indicators to guarantee.
Recall Rate: The ratio of the number of user requirements in accordance with the number of users in one search results is the ratio of the total number of related information
Accuracy: The ratio of the number of users required in one search results is the ratio of the total number of search results.
Correlation: a measure of the similarity between user query and search results
Accuracy: Sorting grading capabilities for search results and anti-interference ability of spam pages
2.4 section
The above is analyzed on the Internet-based search engine structure and performance indicators. I use JavaTM technology and some Open Source tools to implement a simple search engine on the basis of these studies. In the next chapter, you will analyze your design.
Chapter III Network Robot
3.1 What is a network robot?
Network robots is also known as a Spider program, is a professional BOT program. Used to find a large number of web pages. It starts from a simple web page, and then through its hyperlinks to access other pages, so repeatedly can scan all the pages on the Internet.
The Internet-based search engine is the earliest application of Spider. For example, search giants Google use the network robot program to traverse the Web site to create and maintain these large databases.
Network robots can also get the file list and hierarchy of this site by scanning the homepage of the Web site. You can also scan the interrupt hyperlinks and spelling errors.
3.2 Structure analysis of network robots
The Internet is based on many related protocols, while more complex protocols are built on the system layer protocol. Web is based on the HTTP (Hypertext Transfer Protocol) protocol, and HTTP is also based on TCP / IP (Transmission Control Protocol / Internet Protocol) protocol, which is also a Socket protocol. So the network robot is essentially a Socket-based network program.
3.2.1
How to analyze HTML
Because the information in the Web is based on the HTML protocol, the first problem when the network robot is retrieved, how is the HTML. Let's first introduce several data in HTML before resolving how to resolve.
Text: In addition to all data notes other than scripts and tags: Programmaker left the description text, a simple tag that is invisible to the user: Start tag and end tag by a single representation HTML tag: used to control the included HTML code
We don't have to care for all labels when making parsing, just analyzing several important parsing.
Super connection label
The hyper connection defines the functionality of the WWW through the Internet link document. Their main purpose is to enable users to migrate to new pages, which is the most concerned label of the network robot.
Image mapping label
Image mapping is another very important label. It allows users to migrate to new pages by clicking on the image.
Form label
The form is a unit that can enter data in the web page. Many sites allow users to fill in the data and then submit content by clicking the button, which is a typical application of the form.
Table label
The table is a component of HTML, usually used to format storage, display data.
We have two ways to resolve these HTML tags: parsing or resolving the Swing class in JavaTM or through the HTMLPage class in the BOT package, I use the latter in actual programming.
The HTMLPAGE class in the BOT package is used to read data from the specified URL and retrieve useful information. Several important methods are given below.
HTMLPAGE constructor constructs the object and specifies the HTTP object for communication, the PUBLIC HTMLPAGE (HTTP HTTP) GetForms method obtains the list of forms retrieved by the last call Open method.
Public Vector getForms () gethttp method Gets an HTTP object that is sent to the constructor
Public http gethttp () getImage method Get a list of specified pages
Public Vector GetImage () getLinks method Gets a list of connections for the specified page
Public Vector getLinks () Open method Opens a page and read this page. If the callback object is specified, the object data is given.
Public void Open (String Url, HTMleditorkit.ParserCallback A)
3.2.2
SPIDER program structure
The network robot must migrate from a web page to another, so you must find the super connection on this page. The program first parses the HTML code of the web page, finds the superconnection within this page and then implements the Spider program by recursive and non-recursive structures.
Recursive structure
Recursive is a programming technology that calls yourself in a method. Although it is easier to implement but consume memory and cannot use multi-threaded technology, it is not suitable for large projects.
Non-recursive structure
This method uses the queue data structure, when the Spider program finds that the super connection is not called itself but adds the overload to the waiting queue. When the Spider program scans the current page, the next superconnection address is accessed according to the developed policy.
Although only one queue is described here, four queues are used in actual programming, and each queue saves the URL of the same processed state.
Waiting for the queue in this queue, the URL waits for processing by the Spider program. The new discovered URL has been added to this queue
Processing Queue When the Spider program starts processing, they are sent to this queue
Error queue If an error occurs when parsing a web page, the URL will be sent here. The URL in the queue cannot be moved into other queues
Completion Queue If the parsing page does not have an error, the URL will be sent here. The URL in this queue cannot be moved into other queues
At the same time URL can only be in a queue, we call it the state of the URL.
The above figure shows the change process of the queue, in which the Spider program starts running when an URL is added to the waiting queue. Just wait for a web page or Spider program in the queue, the program will continue his work. When the wait team is empty and there is currently no web pages, the Spider program will stop its work.
3.2.3
How to construct a Spider program
Before constructing a Spider program, we understand how the procedures work together. And how to expand this program.
The flow chart is as follows:
ISPIDERREPORTABLE interface
This is an interface that must be implemented, and can accept the page encountered by the callback function. The interface defines several events sent by Spider to his controller. Various Spider programs can be created by providing a handler for each event. Here is his interface statement:
Public interface ispidERREPORTABLE {
Public Boolean FoundInternalLink (String Url);
Public Boolean FoundexternalLink (String URL);
Public Boolean Foundotherlink (String URL);
Public Void ProcessPage (HTTP Page);
Public Void CompletePage (http page, boolean error);
Public boolean getremovequery ();
Public void spidercomplete ();
3.2.4
How to improve program performance
There is a massive web page in the Internet, which is very important if it is important to develop an efficient Spider program. Let's introduce several technologies for improved performance: Java multi-thread technology
Threads are executed by a program. Multithreading is the ability to run multiple tasks simultaneously. It is a division of labor in the internal part of a program.
The usual way to optimize the program is to determine the bottleneck and improve him. Bottleneck is the slowest part of a program, which restricts the operation of other tasks. According to an example: A spider program needs to download ten pages. To complete this task, the program must issue a request to the server and accept these pages. Other tasks cannot be performed when the program is waiting for a response, which affects the efficiency of the program. If multi-threading can make the waiting time of these web pages together, it is not interacting, which can greatly improve program performance.
Database Technology
When the Spider program accesses a large Web site, you must use an effective way to store the site queue. These queues management Spider programs must maintain a list of large web pages. If you put them in memory, you will decline, so we can put them in the database to reduce the consumption of system resources.
3.2.5
Code analysis of network robots
The structure structure diagram is as follows:
The program code is implemented as follows:
Package news; / ** * News Search Engine * version 1.0 * /
Import com.heaton.bot.http;
Import com.heaton.bot.httpsocket;
Import com.heaton.bot.ispiderReportable;
Import com.heaton.bot.iworkloadstorable;
Import com.heaton.bot.spider;
Import com.heaton.bot.spiderinternalWorkload; / ** * Constructs a BOT program * /
Public class searchable {
Public static void main (string [] args)
THROWS Exception {iWorkloadStorable WL = New SPIDERINTERNALWORKLOAD ();
Searcher_Searcher = New Searcher ();
Spider _spider = new spider (_SEARCHER, "http://127.0.0.1/news.htm", new httpsocket (), 100, wl); _spider.setmaxbody (100);
_Spider.Start ();} // Discover internal connection call, the URL represents the URL discovered by the program, if returns True, add the job, otherwise it will not be added.
Public Boolean FoundInternalNallink (String Url) {
Return False;} // Discover external connections, the URL represents the URL discovered by the program, if returns TRUE, add the joining job, otherwise it will not be added.
Public Boolean FoundexternalLink (String Url) {
Return False;} // This method is called when other connections are found. Other connections refer to non-HTML web pages, may be e-mail or ftp
Public Boolean FoundotherLink (String Url) {
Return False;} // is used to process the web, which is the actual work to be completed by the Spider program.
Public void processpage (http http) {
System.out.Println ("Scanning Page:" http.geturl ()); New HTMLPARSE (HTTP) .Start ();} // is used to request a processed web page.
Public void completion (http http, boolean error) {}} // is used by Spider program to determine if the query string should be deleted. If the strings in the queue should be deleted, the method returns true.
Public boolean getremovequery () {
Return True;} // When the Spider program does not have the remaining work, call this method.
Public void spidercomplete () {}
}
3.3 Section
In this chapter, the basic concepts of the network robot are first introduced, and then the structure and functionality of the Spider program is analyzed. In the end, the specific code is also described in conjunction.
I use JavaTM technology in programming, mainly involving two packages of Net and IO. In addition, a third-party development package BOT (the development package provided by Jeff Heaton) is used.
Chapter IV Rate and Search Based on LUCENE
4.1 What is Lucene full-text search
Lucene is an open source project for Jakarta Apache. It is a full-text index engine toolkit written by Java, which can be easily embedded in a variety of applications to implement full-text index / retrieval functions for applications.
4.2 Analysis of the Principle of Lucene
4.2.1
Realization mechanism for full text search
The comparison of Lucene's API interface design is very common. The input and output structure is very similar to the database of the database ==> record ==> fields, so many traditional applications, databases, etc. can be more convenient to map to Lucene's storage structure and interface. in.
Overall: You can first treat Lucene as a database system that supports full-text indexes.
Index Data Source: DOC (Field1, Field2 ...) DOC (Field1, Field2 ...) / Indexer / _____________ | Lucene Index | -------------- Searcher /
Result Output: Hits (DOC (Field1, Field2) DOC (Field1 ...))
Document: A "unit" that needs to be indexed, and a document consists of multiple fields.
Field: field
Hits: Query result set, consisting of matched documents
4.2.2
LUCENE index efficiency
Usually the book is often attached to the keyword (such as: Beijing: 12, 34, Shanghai: 3, 77 ...), it helps readers find the page number of the relevant content more quickly. And the database index can greatly improve the speed principle of the query, imagine how much the index looks higher than one page after the retrieval of the book ... and the index is high, and the other reason is that it is Needs. The core is a sorting problem for the search system.
Since the database index is not designed for full-text index, the database index does not work when using the LIKE "% keyword%". When using the Like query, the search process becomes a traversal process similar to a page. Therefore, for database services containing fuzzy queries, LIKE is extremely harmful to performance. If you need to make a plurality of keywords: Like "% keyword1%" and lot "% keyword2%" ... It is also conceivable. Therefore, the key to establishing an efficient search system is to establish a reverse index mechanism similar to the scientific index, and sequentially store the data source (such as a plural article), there is another keyword list, which is used. Storage Keyword ==> Article mapping relationship, using such a mapping relationship index: [Key words ==> "post-name number, number of times (even the location: position: start offset, end offset), appear Frequency], the retrieval process is to turn the fuzzy query into multiple processes that can utilize the accurate query of the index. Thus, it greatly improves the efficiency of multi-keyword queries, so the full text search problem is finally a sorting problem. It can be seen that the exact query of the fuzzy query relative database is a very uncertain problem, which is also a limited reason for the limited reasons for full-text retrieval. The most core of Lucene is a full-text index mechanism that is not good at traditional databases through a special index structure, and provides an extended interface to facilitate customization of different applications. You can compare the fuzzy query of the database through a table:
Lucene full-text index engine
database
index
Establish a reverse index throughout the data in the data source
For the Like query, the index of data is all not used. The data needs to be ambiguously recorded by a convenient record, and there is a plurality of orders of magnitude more than the indexed search speed.
Matching effect
Matching through the word element (TERM), through the realization of the language analysis interface, support for non-English such as Chinese can be implemented.
Using: Like "% net%" will also match Netherlands, and the fuzzy matching of multiple keywords: use the LIKE "% COM% Net%": You cannot match the word step down XXX.NET..xxx.com
suitability
There is a matching algorithm, and the result of the degree (similarity) is higher than the result.
There is no control of the degree of matching: For example, 5 words appear in the record and 1 time, the result is the same.
Result output
Through the special algorithm, the maximum number of heads 100 results are output, and the result set is read by buffer small batch.
Returns all result sets that require a lot of memory to save these temporary result sets when matching entries (such as tens of thousands).
Canability
Implementing different language analysis interfaces, it is convenient to customize index rules that meet the application needs (including support for Chinese)
No interface or interface is complex, unable to customize
in conclusion
High-load blur query applications, the rules of the fuzzy query, the amount of data of the index is relatively large
Low usage, fuzzy matching rules simple or requires less information on fuzzy query
4.2.3
Chinese cutting word mechanism
For Chinese, the full-text index must first solve a problem with language analysis. For English, the words in the statement are naturally separated by spaces, but the word in the China-Japanese Chinese Japanese sentence is a word. All, first, if you want to index in the statement, this word is a big problem.
First, you must not use a single character as an index unit, otherwise check "Shanghai", you can't make "sea" also match. But in a word: "Beijing Tiananmen", how does the computer use the Chinese language habits? "Beijing Tiananmen" or "North Beijing Tiananmen"? Let the computer can separate according to language habits, often requiring a machine to have a relatively rich word library to be more accurate to identify words in statements. Another solution is to adopt automatic cutting algorithms: Separate words according to the 2 yuan syntax (BIGRAM), such as "Beijing Tiananmen" ==> "Beijing Jingtian Tian Anmen". In this way, in the query, whether it is inquiry "Beijing" or query "Tiananmen", put the query phrase according to the same rules: "Beijing", "Tiananmen", multiple keywords, "and" "The relationship is combined, and it can also be mapped to the corresponding index. This approach is common for other Asian languages: Korean, Japanese is common. Based on automatic segmentation is that there is no maintenance cost, simple, and the disadvantage is that the index efficiency is low, but for small and medium-sized applications, 2-element syntax is still sufficient. The index after 2 yuan is similar to the general size and source files, and for English, index files generally only have 30% -40% of the original file,
Automatic segmentation
Word table
achieve
Very simple implementation
Complex
Inquire
Added complexity of query analysis,
Suitable for complicated query syntax rules
Storage efficiency
Index redundancy, index is almost as large as the original text
High index efficiency, about 30% of the original size
Maintenance cost
No word maintenance cost
The maintenance cost of the word meter is very high: Sino-Japanese and Korea is needed to be maintained separately. It also needs to include word frequency statistics.
Applicable field
Embedded system: operating environmental resource limited distributed system: no word synchronization problem multilingual environment: no word maintenance cost
Professional search engine with high query and storage efficiency requirements
4.3 Combination of Lucene and Spider
First construct an index class to implement indexing of content.
The code analysis is as follows:
Package news; / ** * News Search Engine * * version 1.0
* / Import java.io.ioException;
Import org.apache.lucene.Alysis.cn.ChineseAlyzer;
Import org.apache.lucene.document.document;
Import org.apache.lucene.document.field;
Import org.apache.lucene.index.indexwriter;
Public clas index {
Indexwriter_Writer = NULL;
Index () throws exception {
_writer = new indexwriter ("c: // news // index", new ChineseAlyzer (), true);
} / ** * Add each news to the index * @Param URL News URL * @Param Title News Title * @throws java.lang.exception * /
Void AddNews (String Url, String Title "throws exception {
Document_doc = new document ();
_doc.add ("Title", Title); _ Doc.Add (Field.UnIndexed ("URL", URL);
_writer.adddocument (_doc);} / ** Optimize and clean up resource @throws java.lang.exception * / void close () throws exception {
_writer.optimize ();
_writer.close ();
}
Then construct an HTML parsed class to index the news content collected by the BOT program.
The code analysis is as follows:
Package news; / ** * News Search Engine * version 1.0 * /
Import java.util.iterator;
Import java.util.vector;
Import com.heaton.bot.htmlpage;
Import com.heaton.bot.http;
Import com.heaton.bot.link;
Public class htmlparse {
Http_ http = null;
Public HTMLPARSE (HTTP HTTP) {_http = http;} / ** After parsing the web page, establish an index * / public void start () {
Try {htmlpage _page = new htmlpage (_http);
_Page.Open (_HTTP.GetURL (), null);
Vector _links = _page.getlinks ();
Index _index = new index ();
Iterator_ = _links.iterator ();
INT n = 0;
While (_IT.hasNext ()) {
Link _LINK = (link) _it.next ();
String_herf = INPUT (_LINK.GETHREF (). Trim ());
String _title = INPUT (_Link.getPrompt (). Trim ());
_index.addnews (_herf, _title);
N ;
}
System.out.println ("Shake" N "Article News");
_INDEX.CLOSE ();
}
Catch (Exception EX) {
System.out.println (ex);
} / ** * Solve Chinese issues in Java * @Param STR Enter Chinese * @return's decoded Chinese * / public static string input (String Str) {
String Temp = NULL;
IF (str! = null) {
Try {
Temp = new string (Str.getbytes ("ISO8859_1");
}
Catch (Exception E) {}
}
Return Temp;
}
}
4.4 section
When using a lot of data search, it will be very painful if a simple database technology is used. The speed will be a great bottleneck. So this chapter proposes to use a full-text search engine Lucene to index, search.
Finally, it is also combined with specific code to explain how the Lucene full-text search engine and Spider program are integrated with each other to implement news searches.
Chapter 5 WEB Server Based on Tomcat
5.1 What is a Tomcat-based web server
The web server is a server that builds a basic platform for realizing information publishing, data query, data processing, etc. in the network. How to work in web server: In the web page processing, it can be divided into three steps. The first step, the web browser issues a web page request to a specific server; second step, the web server receives the web page request, find the place The requested web page and transfer the requested web page to a web browser; step 3, the web server receives the requested web page and display it. Tomcat is an open source, running a Java-based web application container for Servlet and JSP web applications. Tomcat is supported by Apache-Jakarta subprojects and maintained by volunteers from an open source Java community. Tomcat Server is performed according to the Servlet and JSP specification, so we can say that Tomcat Server has also implemented Apache-Jakarta specifications and better than most commercial application software servers.
5.2 User Interface Design
5.3.1
Client design
A good query interface is very important, such as GoogL is known for her simple query interface. I also take advantage of practicality and simplicity when I design.
The screenshot of the query interface is as follows:
The screenshot of the search results are as follows:
5.3.2
Service design
Mainly using the JavaTM servlet technology, users submit query conditions from the client from the client through the GET method, the server accepts and analyzes the submission parameters through Tomcat, and then calls the LUCENE development package for search operation. Finally, the results of the search are sent to the client in the form of an HTTP message package, thereby completing a search operation.
The structure of the server servlet program is as follows:
The key code implemented is as follows:
Public void search (string qc, printwriter out) throws exception {// Create an indexsearcher_Searcher = new indexsearcher from the index directory ("c: // news // index"); // Create a standard analyzer
Analyzer Analyzer = new ChineseAnalyzer (); // Query Conditions
String line = qc; // Query is an abstract class
Query Query = queryParser.Parse (Line, "Title", Analyzer;
Out.println ("");
Out.println ("
Out.println ("
Out.println ("