Research and Implementation of Search Engine Based on Java Technology

xiaoxiao2021-03-06 78

REL = file-list href = "Research and implementation of search engine based on Java technology. Files / filelist.xml"> rel = edit-time-data href = "Research and implementation of search engines based on Java technology. Files / EditData .mso "> REL = Ole-object-data href =" Research and implementation of search engines based on Java technology. Files / OLEDATA.MSO ">

table of Contents

table of Contents................................................. ................................................ ................................................ ....................... 1

Summary................................................. ................................................ ................................................ .....................

Chapter 1 Introduction ............................................ ................................................ ................................................ ........... 4

Chapter 2 Search Engine Structure ......................................... ................................................ ......................................... 5

2.1 System overview ........................................... ................................................ ................................................ ........ 5

2.2 Composition of search engines ........................................... ................................................ ................................................ 5

2.2.1

Network robot .............................................. ................................................ ......................................... 5

2.2.2

Index and search ............................................. ................................................ ............................................ 5

2.2.3

WEB server .............................................. ................................................ ......................................... 62.3 Main indicators of search engine and analysis................................................. ................................................ .......................... 6

2.4 sections .............................................. ................................................ ................................................ ...............

Chapter III Network Robot ........................................... ................................................ ............................................. 7

3.1 What is a network robot ........................................... ................................................ .....................................................

3.2 Structure analysis of network robots .......................................... ................................................ .....................................

3.2.1

How to analyze HTML ........................................... ................................................ ..................................... 7

3.2.2

Spider program structure ........................................... ................................................ ..................................... 8

3.2.3

How to construct a spider program .......................................... ................................................ ............................... 9

3.2.4

How to improve program performance .......................................... ................................................ ...............................

3.2.5

Code analysis of network robots ......................................... ................................................ ............................... 12

3.3 sections .............................................. ................................................ ................................................ ............ 14 Chapter 4 of Lucene's index and search .......................... ................................................ ...................................... 15

4.1 What is Lucene full-text search .......................................... ................................................ ................................ 15

4.2 LUCENE principle analysis ........................................... ................................................ ...................................... 15

4.2.1

The realization mechanism of full text search ........................................... ................................................ ............................. 15

4.2.2

Lucene's index efficiency .......................................... ................................................ ................................ 15

4.2.3

Chinese cutting words mechanism .......................................... ................................................ .................................... 17

4.3 Combination of Lucene and Spider ........................................ ................................................ ............................ 18

4.4 Section ................................................ ................................................ ................................................ ............ twenty one

Chapter 5 WEB Server Based on Tomcat ........................................ ................................................ ..................... twenty two

5.1 What is a web server based on Tomcat ...................................... ................................................ ................ twenty two

5.2 User Interface Design .......................................... ................................................ ................................................ 225.3. 1

Client design ........................................... ................................................ ........................................... twenty two

5.3.2

Service design ........................................... ................................................ ........................................... twenty three

5.3 Deploying items on Tomcat ........................................ ................................................ ................................. 25

5.4 sections .............................................. ................................................ ................................................ ............ 25

Chapter VI Search Engine Strategy .......................................... ................................................ ............................................ 26

6.1 Introduction .............................................. ................................................ ................................................ ............ 26

6.2 Search strategy for theme .......................................... ................................................ ...................................... 26

6.2.1

Guide words ................................................ ................................................ ................................................ 26

6.2.3

Authoritative web page and center page ........................................... ................................................ ............................. 27

6.3 Section .............................................. ................................................ ................................................ ............ 27

references................................................ ................................................ ................................................ ............. 28 Summary

The resources in the network are very rich, but how effective search information is a difficult thing. Establishing a search engine is the best way to solve this problem. This article first introduces the system structure of the Internet-based search engine, and then detailed from the three aspects of the network robot, index engine, and web servers. In order to understand this technology more, I also realized a self-service search engine.

News Search Engine is passing, searches, search, and adds each news to the database after indexing each news from the specified Web page. Then searched the matched news from the index database after receiving the client request through the web server.

I introduce the search engine's chapter in addition to the detailed explanation of the technical core, the news search engine is combined, and the graphic is also easy to understand.

Abstract

The resources in the internet are abundant, but it is a difficult job to search some useful information. So a search engine is the best method to solve this problem. This article fist introduces the system structure of search engine based on the internet in detail, ., i have program

The news search engine is explained and searched according to hyperlink from a appointed web page, then indexs every searched information and adds it to the index database. Then after receiving the customers' requests from the web server, it soon searchs the right news form the INDEX ENGINE,

In The Chapter of Introducing Search Engine, IT IS Not Only elaborate The Core Technology, But Also Combine with The Modern Code, Pictures Included, Easy To Understand.

First chapter introduction

In the face of vast network resources, the search engine provides an entrance for users who surf all online, unavailable, all users can reach anywhere from the web that they want to go. So it has also become an online service that is used by most people other than emails.

Search Engine Technology is accompanied by WWW development. The search engine has experienced three generations of update development:

The first generation of search engines appeared in 1994. Such search engines generally index less than 1,000,000 web pages, rarely collect the webpage and refresh indexes. And its retrieval speed is very slow, usually waiting for 10 seconds or even longer. It is also based on technologies such as more mature IR (Information Retrieval), network, database, and other technologies, and applications that utilize some of the WWW implemented by some prior art. On March and April 1994, the network crawler World Web Worm (WWWW) averages approximately 1,500 queries per day. In approximately the second-generation search engine system in 1996, most of the distributed program (multiple microcomputers work together) is used to improve data size, response speed and number of users, which generally maintain an index of approximately 50,000,000 web pages. The database can respond to 10,000,000 user retrieve requests every day. In November 1997, the most advanced search engines at that time, the number of web indexes from 2,000,000 to 100,000,000 were established. The Altavista search engine claims that they have to bear 20,000,000 queries every day.

At the 2000 Search Engine 2000, according to the speech of Larry Page, Google, Google is using 3,000 PCs to run the Linux system in the web page on the web, and add a computer to this microcomputer cluster at 30 speeds per day. To maintain the same step as the network. Each microcomputer runs multiple reptiles to collect the peak speed per second, the average speed is 48.5 pages per second, one day can collect more than 4,000,000 web pages

The word search engine is widely used in the domestic and international Internet field, but his meaning is not the same. In the US search engine, it is usually referred to as the Internet-based search engine, and they collect tens million to billions of web pages over the network robot program, and each word is indexed by search engine, that is, we say full-text search. The famous Internet search engine includes First Search, Google, Hotbot, etc. In China, the search engine usually refers to the search service based on the website directory or a search service of a specific website. I studied here is based on the Internet-based search technology.

Chapter II Search Engine

2.1 System Overview

The search engine is based on the user's query request, and then finds the information from index data from index data to the user according to certain algorithms. In order to ensure the accuracy and freshness of the user find information, the search engine needs to establish and maintain a huge index database. The general search engine consists of a network robot program, an index and a search program, an index database, and other parts.

System structure diagram

2.2 Composition of search engines

2.2.1

Network robot

The network robot is also called "Spider", which is a strong Web Scanner. It can retrieve the hyperlinks in the web page and add the scan queue waiting for the next to scanning. Because the Web is widely used, a Spider program can access the entire web page in theory.

In order to ensure the breadth and depth of the network robot needs to set some important links and develop relevant scanning strategies.

2.2.2

Index and search

The network robot will store the page to the page in the temporary database, and if the information speed is directly queried by SQL, it will be unbearable. In order to improve the retrieval efficiency, you need to establish an index, and store it according to the format of the inverted file. If the index is not in time to new, the user cannot retrieve it.

When the user enters the search criteria, the search program will be retrieved by the index database and then the database that meets the query requirements is ranked in a certain policy and returns to the user.

2.2.3

Web server

Customers generally query via browsers, which requires the system to provide web servers and connect to the index database. When the customer enters the query condition in the browser, the web server receives the customer's query condition and queries in the index database, and then returns to the client. 2.3 Main indicators and analysis of search engines

The main indicators of the search engine have response time, recall rate, accuracy, and correlation, etc. These indicators determine the technical indicators of the search engine. The technical indicator of the search engine determines the evaluation index of the search engine. A good search engine should be a faster reaction speed and high call return rate, accuracy, of course, these all need to search engine technical indicators to guarantee.

Recall Rate: The ratio of the number of user requirements in accordance with the number of users in one search results is the ratio of the total number of related information

Accuracy: The ratio of the number of users required in one search results is the ratio of the total number of search results.

Correlation: a measure of the similarity between user query and search results

Accuracy: Sorting grading capabilities for search results and anti-interference ability of spam pages

2.4 section

The above is analyzed on the Internet-based search engine structure and performance indicators. I use JavaTM technology and some Open Source tools to implement a simple search engine on the basis of these studies. In the next chapter, you will analyze your design.

Chapter III Network Robot

3.1 What is a network robot?

Network robots is also known as a Spider program, is a professional BOT program. Used to find a large number of web pages. It starts from a simple web page, and then through its hyperlinks to access other pages, so repeatedly can scan all the pages on the Internet.

The Internet-based search engine is the earliest application of Spider. For example, search giants Google use the network robot program to traverse the Web site to create and maintain these large databases.

Network robots can also get the file list and hierarchy of this site by scanning the homepage of the Web site. You can also scan the interrupt hyperlinks and spelling errors.

3.2 Structure analysis of network robots

The Internet is based on many related protocols, while more complex protocols are built on the system layer protocol. Web is based on the HTTP (Hypertext Transfer Protocol) protocol, and HTTP is also based on TCP / IP (Transmission Control Protocol / Internet Protocol) protocol, which is also a Socket protocol. So the network robot is essentially a Socket-based network program.

3.2.1

How to analyze HTML

Because the information in the Web is based on the HTML protocol, the first problem when the network robot is retrieved, how is the HTML. Let's first introduce several data in HTML before resolving how to resolve.

Text: In addition to all data notes other than scripts and tags: Programmaker left the description text, a simple tag that is invisible to the user: Start tag and end tag by a single representation HTML tag: used to control the included HTML code

We don't have to care for all labels when making parsing, just analyzing several important parsing.

Super connection label

The hyper connection defines the functionality of the WWW through the Internet link document. Their main purpose is to enable users to migrate to new pages, which is the most concerned label of the network robot.

Image mapping label

Image mapping is another very important label. It allows users to migrate to new pages by clicking on the image.

Form label

The form is a unit that can enter data in the web page. Many sites allow users to fill in the data and then submit content by clicking the button, which is a typical application of the form.

Table label

The table is a component of HTML, usually used to format storage, display data.

We have two ways to resolve these HTML tags: parsing or resolving the Swing class in JavaTM or through the HTMLPage class in the BOT package, I use the latter in actual programming.

The HTMLPAGE class in the BOT package is used to read data from the specified URL and retrieve useful information. Several important methods are given below.

HTMLPAGE constructor constructs the object and specifies the HTTP object for communication, the PUBLIC HTMLPAGE (HTTP HTTP) GetForms method obtains the list of forms retrieved by the last call Open method.

Public Vector getForms () gethttp method Gets an HTTP object that is sent to the constructor

Public http gethttp () getImage method Get a list of specified pages

Public Vector GetImage () getLinks method Gets a list of connections for the specified page

Public Vector getLinks () Open method Opens a page and read this page. If the callback object is specified, the object data is given.

Public void Open (String Url, HTMleditorkit.ParserCallback A)

3.2.2

SPIDER program structure

The network robot must migrate from a web page to another, so you must find the super connection on this page. The program first parses the HTML code of the web page, finds the superconnection within this page and then implements the Spider program by recursive and non-recursive structures.

Recursive structure

Recursive is a programming technology that calls yourself in a method. Although it is easier to implement but consume memory and cannot use multi-threaded technology, it is not suitable for large projects.

Non-recursive structure

This method uses the queue data structure, when the Spider program finds that the super connection is not called itself but adds the overload to the waiting queue. When the Spider program scans the current page, the next superconnection address is accessed according to the developed policy.

Although only one queue is described here, four queues are used in actual programming, and each queue saves the URL of the same processed state.

Waiting for the queue in this queue, the URL waits for processing by the Spider program. The new discovered URL has been added to this queue

Processing Queue When the Spider program starts processing, they are sent to this queue

Error queue If an error occurs when parsing a web page, the URL will be sent here. The URL in the queue cannot be moved into other queues

Completion Queue If the parsing page does not have an error, the URL will be sent here. The URL in this queue cannot be moved into other queues

At the same time URL can only be in a queue, we call it the state of the URL.

The above figure shows the change process of the queue, in which the Spider program starts running when an URL is added to the waiting queue. Just wait for a web page or Spider program in the queue, the program will continue his work. When the wait team is empty and there is currently no web pages, the Spider program will stop its work.

3.2.3

How to construct a Spider program

Before constructing a Spider program, we understand how the procedures work together. And how to expand this program.

The flow chart is as follows:

ISPIDERREPORTABLE interface

This is an interface that must be implemented, and can accept the page encountered by the callback function. The interface defines several events sent by Spider to his controller. Various Spider programs can be created by providing a handler for each event. Here is his interface statement:

Public interface ispidERREPORTABLE {

Public Boolean FoundInternalLink (String Url);

Public Boolean FoundexternalLink (String URL);

Public Boolean Foundotherlink (String URL);

Public Void ProcessPage (HTTP Page);

Public Void CompletePage (http page, boolean error);

Public boolean getremovequery ();

Public void spidercomplete ();

3.2.4

How to improve program performance

There is a massive web page in the Internet, which is very important if it is important to develop an efficient Spider program. Let's introduce several technologies for improved performance: Java multi-thread technology

Threads are executed by a program. Multithreading is the ability to run multiple tasks simultaneously. It is a division of labor in the internal part of a program.

The usual way to optimize the program is to determine the bottleneck and improve him. Bottleneck is the slowest part of a program, which restricts the operation of other tasks. According to an example: A spider program needs to download ten pages. To complete this task, the program must issue a request to the server and accept these pages. Other tasks cannot be performed when the program is waiting for a response, which affects the efficiency of the program. If multi-threading can make the waiting time of these web pages together, it is not interacting, which can greatly improve program performance.

Database Technology

When the Spider program accesses a large Web site, you must use an effective way to store the site queue. These queues management Spider programs must maintain a list of large web pages. If you put them in memory, you will decline, so we can put them in the database to reduce the consumption of system resources.

3.2.5

Code analysis of network robots

The structure structure diagram is as follows:

The program code is implemented as follows:

Package news; / ** * News Search Engine * version 1.0 * /

Import com.heaton.bot.http;

Import com.heaton.bot.httpsocket;

Import com.heaton.bot.ispiderReportable;

Import com.heaton.bot.iworkloadstorable;

Import com.heaton.bot.spider;

Import com.heaton.bot.spiderinternalWorkload; / ** * Constructs a BOT program * /

Public class searchable {

Public static void main (string [] args)

THROWS Exception {iWorkloadStorable WL = New SPIDERINTERNALWORKLOAD ();

Searcher_Searcher = New Searcher ();

Spider _spider = new spider (_SEARCHER, "http://127.0.0.1/news.htm", new httpsocket (), 100, wl); _spider.setmaxbody (100);

_Spider.Start ();} // Discover internal connection call, the URL represents the URL discovered by the program, if returns True, add the job, otherwise it will not be added.

Public Boolean FoundInternalNallink (String Url) {

Return False;} // Discover external connections, the URL represents the URL discovered by the program, if returns TRUE, add the joining job, otherwise it will not be added.

Public Boolean FoundexternalLink (String Url) {

Return False;} // This method is called when other connections are found. Other connections refer to non-HTML web pages, may be e-mail or ftp

Public Boolean FoundotherLink (String Url) {

Return False;} // is used to process the web, which is the actual work to be completed by the Spider program.

Public void processpage (http http) {

System.out.Println ("Scanning Page:" http.geturl ()); New HTMLPARSE (HTTP) .Start ();} // is used to request a processed web page.

Public void completion (http http, boolean error) {}} // is used by Spider program to determine if the query string should be deleted. If the strings in the queue should be deleted, the method returns true.

Public boolean getremovequery () {

Return True;} // When the Spider program does not have the remaining work, call this method.

Public void spidercomplete () {}

}

3.3 Section

In this chapter, the basic concepts of the network robot are first introduced, and then the structure and functionality of the Spider program is analyzed. In the end, the specific code is also described in conjunction.

I use JavaTM technology in programming, mainly involving two packages of Net and IO. In addition, a third-party development package BOT (the development package provided by Jeff Heaton) is used.

Chapter IV Rate and Search Based on LUCENE

4.1 What is Lucene full-text search

Lucene is an open source project for Jakarta Apache. It is a full-text index engine toolkit written by Java, which can be easily embedded in a variety of applications to implement full-text index / retrieval functions for applications.

4.2 Analysis of the Principle of Lucene

4.2.1

Realization mechanism for full text search

The comparison of Lucene's API interface design is very common. The input and output structure is very similar to the database of the database ==> record ==> fields, so many traditional applications, databases, etc. can be more convenient to map to Lucene's storage structure and interface. in.

Overall: You can first treat Lucene as a database system that supports full-text indexes.

Index Data Source: DOC (Field1, Field2 ...) DOC (Field1, Field2 ...) / Indexer / _____________ | Lucene Index | -------------- Searcher /

Result Output: Hits (DOC (Field1, Field2) DOC (Field1 ...))

Document: A "unit" that needs to be indexed, and a document consists of multiple fields.

Field: field

Hits: Query result set, consisting of matched documents

4.2.2

LUCENE index efficiency

Usually the book is often attached to the keyword (such as: Beijing: 12, 34, Shanghai: 3, 77 ...), it helps readers find the page number of the relevant content more quickly. And the database index can greatly improve the speed principle of the query, imagine how much the index looks higher than one page after the retrieval of the book ... and the index is high, and the other reason is that it is Needs. The core is a sorting problem for the search system.

Since the database index is not designed for full-text index, the database index does not work when using the LIKE "% keyword%". When using the Like query, the search process becomes a traversal process similar to a page. Therefore, for database services containing fuzzy queries, LIKE is extremely harmful to performance. If you need to make a plurality of keywords: Like "% keyword1%" and lot "% keyword2%" ... It is also conceivable. Therefore, the key to establishing an efficient search system is to establish a reverse index mechanism similar to the scientific index, and sequentially store the data source (such as a plural article), there is another keyword list, which is used. Storage Keyword ==> Article mapping relationship, using such a mapping relationship index: [Key words ==> "post-name number, number of times (even the location: position: start offset, end offset), appear Frequency], the retrieval process is to turn the fuzzy query into multiple processes that can utilize the accurate query of the index. Thus, it greatly improves the efficiency of multi-keyword queries, so the full text search problem is finally a sorting problem. It can be seen that the exact query of the fuzzy query relative database is a very uncertain problem, which is also a limited reason for the limited reasons for full-text retrieval. The most core of Lucene is a full-text index mechanism that is not good at traditional databases through a special index structure, and provides an extended interface to facilitate customization of different applications. You can compare the fuzzy query of the database through a table:

Lucene full-text index engine

database

index

Establish a reverse index throughout the data in the data source

For the Like query, the index of data is all not used. The data needs to be ambiguously recorded by a convenient record, and there is a plurality of orders of magnitude more than the indexed search speed.

Matching effect

Matching through the word element (TERM), through the realization of the language analysis interface, support for non-English such as Chinese can be implemented.

Using: Like "% net%" will also match Netherlands, and the fuzzy matching of multiple keywords: use the LIKE "% COM% Net%": You cannot match the word step down XXX.NET..xxx.com

suitability

There is a matching algorithm, and the result of the degree (similarity) is higher than the result.

There is no control of the degree of matching: For example, 5 words appear in the record and 1 time, the result is the same.

Result output

Through the special algorithm, the maximum number of heads 100 results are output, and the result set is read by buffer small batch.

Returns all result sets that require a lot of memory to save these temporary result sets when matching entries (such as tens of thousands).

Canability

Implementing different language analysis interfaces, it is convenient to customize index rules that meet the application needs (including support for Chinese)

No interface or interface is complex, unable to customize

in conclusion

High-load blur query applications, the rules of the fuzzy query, the amount of data of the index is relatively large

Low usage, fuzzy matching rules simple or requires less information on fuzzy query

4.2.3

Chinese cutting word mechanism

For Chinese, the full-text index must first solve a problem with language analysis. For English, the words in the statement are naturally separated by spaces, but the word in the China-Japanese Chinese Japanese sentence is a word. All, first, if you want to index in the statement, this word is a big problem.

First, you must not use a single character as an index unit, otherwise check "Shanghai", you can't make "sea" also match. But in a word: "Beijing Tiananmen", how does the computer use the Chinese language habits? "Beijing Tiananmen" or "North Beijing Tiananmen"? Let the computer can separate according to language habits, often requiring a machine to have a relatively rich word library to be more accurate to identify words in statements. Another solution is to adopt automatic cutting algorithms: Separate words according to the 2 yuan syntax (BIGRAM), such as "Beijing Tiananmen" ==> "Beijing Jingtian Tian Anmen". In this way, in the query, whether it is inquiry "Beijing" or query "Tiananmen", put the query phrase according to the same rules: "Beijing", "Tiananmen", multiple keywords, "and" "The relationship is combined, and it can also be mapped to the corresponding index. This approach is common for other Asian languages: Korean, Japanese is common. Based on automatic segmentation is that there is no maintenance cost, simple, and the disadvantage is that the index efficiency is low, but for small and medium-sized applications, 2-element syntax is still sufficient. The index after 2 yuan is similar to the general size and source files, and for English, index files generally only have 30% -40% of the original file,

Automatic segmentation

Word table

achieve

Very simple implementation

Complex

Inquire

Added complexity of query analysis,

Suitable for complicated query syntax rules

Storage efficiency

Index redundancy, index is almost as large as the original text

High index efficiency, about 30% of the original size

Maintenance cost

No word maintenance cost

The maintenance cost of the word meter is very high: Sino-Japanese and Korea is needed to be maintained separately. It also needs to include word frequency statistics.

Applicable field

Embedded system: operating environmental resource limited distributed system: no word synchronization problem multilingual environment: no word maintenance cost

Professional search engine with high query and storage efficiency requirements

4.3 Combination of Lucene and Spider

First construct an index class to implement indexing of content.

The code analysis is as follows:

Package news; / ** * News Search Engine * * version 1.0

* / Import java.io.ioException;

Import org.apache.lucene.Alysis.cn.ChineseAlyzer;

Import org.apache.lucene.document.document;

Import org.apache.lucene.document.field;

Import org.apache.lucene.index.indexwriter;

Public clas index {

Indexwriter_Writer = NULL;

Index () throws exception {

_writer = new indexwriter ("c: // news // index", new ChineseAlyzer (), true);

} / ** * Add each news to the index * @Param URL News URL * @Param Title News Title * @throws java.lang.exception * /

Void AddNews (String Url, String Title "throws exception {

Document_doc = new document ();

_doc.add ("Title", Title); _ Doc.Add (Field.UnIndexed ("URL", URL);

_writer.adddocument (_doc);} / ** Optimize and clean up resource @throws java.lang.exception * / void close () throws exception {

_writer.optimize ();

_writer.close ();

}

Then construct an HTML parsed class to index the news content collected by the BOT program.

The code analysis is as follows:

Package news; / ** * News Search Engine * version 1.0 * /

Import java.util.iterator;

Import java.util.vector;

Import com.heaton.bot.htmlpage;

Import com.heaton.bot.http;

Import com.heaton.bot.link;

Public class htmlparse {

Http_ http = null;

Public HTMLPARSE (HTTP HTTP) {_http = http;} / ** After parsing the web page, establish an index * / public void start () {

Try {htmlpage _page = new htmlpage (_http);

_Page.Open (_HTTP.GetURL (), null);

Vector _links = _page.getlinks ();

Index _index = new index ();

Iterator_ = _links.iterator ();

INT n = 0;

While (_IT.hasNext ()) {

Link _LINK = (link) _it.next ();

String_herf = INPUT (_LINK.GETHREF (). Trim ());

String _title = INPUT (_Link.getPrompt (). Trim ());

_index.addnews (_herf, _title);

N ;

}

System.out.println ("Shake" N "Article News");

_INDEX.CLOSE ();

}

Catch (Exception EX) {

System.out.println (ex);

} / ** * Solve Chinese issues in Java * @Param STR Enter Chinese * @return's decoded Chinese * / public static string input (String Str) {

String Temp = NULL;

IF (str! = null) {

Try {

Temp = new string (Str.getbytes ("ISO8859_1");

}

Catch (Exception E) {}

}

Return Temp;

}

4.4 section

When using a lot of data search, it will be very painful if a simple database technology is used. The speed will be a great bottleneck. So this chapter proposes to use a full-text search engine Lucene to index, search.

Finally, it is also combined with specific code to explain how the Lucene full-text search engine and Spider program are integrated with each other to implement news searches.

Chapter 5 WEB Server Based on Tomcat

5.1 What is a Tomcat-based web server

The web server is a server that builds a basic platform for realizing information publishing, data query, data processing, etc. in the network. How to work in web server: In the web page processing, it can be divided into three steps. The first step, the web browser issues a web page request to a specific server; second step, the web server receives the web page request, find the place The requested web page and transfer the requested web page to a web browser; step 3, the web server receives the requested web page and display it. Tomcat is an open source, running a Java-based web application container for Servlet and JSP web applications. Tomcat is supported by Apache-Jakarta subprojects and maintained by volunteers from an open source Java community. Tomcat Server is performed according to the Servlet and JSP specification, so we can say that Tomcat Server has also implemented Apache-Jakarta specifications and better than most commercial application software servers.

5.2 User Interface Design

5.3.1

Client design

A good query interface is very important, such as GoogL is known for her simple query interface. I also take advantage of practicality and simplicity when I design.

The screenshot of the query interface is as follows:

The screenshot of the search results are as follows:

5.3.2

Service design

Mainly using the JavaTM servlet technology, users submit query conditions from the client from the client through the GET method, the server accepts and analyzes the submission parameters through Tomcat, and then calls the LUCENE development package for search operation. Finally, the results of the search are sent to the client in the form of an HTTP message package, thereby completing a search operation.

The structure of the server servlet program is as follows:

The key code implemented is as follows:

Public void search (string qc, printwriter out) throws exception {// Create an indexsearcher_Searcher = new indexsearcher from the index directory ("c: // news // index"); // Create a standard analyzer

Analyzer Analyzer = new ChineseAnalyzer (); // Query Conditions

String line = qc; // Query is an abstract class

Query Query = queryParser.Parse (Line, "Title", Analyzer;

Out.println ("");

Out.println (" Search Results </ TITLE> </ HEAD>");</p> <p>Out.println ("<body bgcolor = # ffffff>");</p> <p>Out.println ("<center>" "<form action = '/ newsserver / results' method = 'get'>" "<font face = '华文中' color = '# 3399ff'> News search engine < / font>: " " <input type = 'text' name = 'querycontent' size = '20 '> " " <input type =' submit 'name =' submit 'value =' Start search '> " " </ form> </ center> "); out.println (" <p> Search Keyword: <font color = red> " query.tostring (" title ") " </ font> </ p> " );</p> <p>Hits Hits = _SEARCHER.SEARCH (Query);</p> <p>Out.println ("" Find <font color = red> " Hits.Length () " <br> ");</p> <p>Final Int Hits_Per_Page = 10;</p> <p>For (int start = 0; start <Hits.Length (); start = HITS_PER_PAGE) {</p> <p>INT end = math.min (Hits.Length (), Start Hits_Per_Page);</p> <p>For (int i = start; i <end; i ) {</p> <p>Document doc = Hits.DOC (i);</p> <p>String url = doc.get ("URL");</p> <p>IF (URL! = NULL) {</p> <p>Out.println ((i 1) "<a href='" url " '" replace (" title ", QC) " </A> <BR> "); } Else {</p> <p>System.out.println ("Not found!");</p> <p>}</p> <p>}</p> <p>}</p> <p>Out.println ("</ Body> </ HTML>");</p> <p>_SEARCHER.CLOSE ();</p> <p>5.3 Deploying items on Tomcat</p> <p>The application in Tomcat is a WAR (Web Archive) file. WAR is a web application format proposed by Sun, similar to JAR, is also a compressed package for many files. The files in this package are organized by a certain directory structure: usually contains HTML and JSP files with HTML and JSP files or include the directory of these two files, and there will be a web-inf directory, which is important. Usually there is a web.xml file and a class directory in the web-inflicide directory, web.xml is the configuration file for this application, and the classes directory contains compiled servlet classes and other classes dependen on by JSP or servlets (such as Javabean. Usually these dependencies can also be packaged into JAR in the lib directory under WEB-INF, of course, can also be placed in the system's classpath. In Tomcat, the deployment of the application is simple, you just need to put your WAR in the Tomcat's WebApp directory, Tomcat automatically detects this file and decompressed it. When you visit the JSP of this app in your browser, you will usually be very slow, because Tomcat is transformed into a servlet file and then compile. After compiling, the access will be very fast.</p> <p>5.4 section</p> <p>In this chapter, how to architecture how to architecture based on Tomcat-based web servers makes users with news search, and finally describe Tomcat how to deploy.</p> <p>Chapter VI Search Engine Strategy</p> <p>6.1 Introduction</p> <p>With the diversification of information, thousands will not be able to meet the specific users more in-depth query requirements. At the same time, such universal search engines are unlikely to update more comprehensive information on the Internet in a current hardware condition. In response to this situation, we need a classification and precise, data comprehensive, and updated a timely-oriented search engine.</p> <p>Since the topic search uses intelligent strategies such as artificial classification and feature extraction, it will be more effective and accurate than the top three generations mentioned above, and we call this kind of perfect topic search engine as the fourth generation search engine.</p> <p>6.2 Search strategy for topic</p> <p>6.2.1</p> <p>Guide word</p> <p>The guiding word is a keyword, which will guide the searcher to search the entire network in a certain order, so that the search engine can get the most comprehensive information related to a topic in the shortest time. By setting the guiding words and their corresponding different weights, all headings, authors, body, text or superconnect text will be given a higher weight, which will give priority when searching. The searcher is also in order of high to low in order to the host program. Conversely, the searcher is pre-sorted in advance in order to the weight of the weight when submitting new URLs and its weights to the master program.</p> <p>6.2.2</p> <p>Website rating</p> <p>At the time of considering a web page is referenced by another web page, it is not a simple Hit Number that will be referenced to the web page, but will reference the number of connections to the web page, and the importance of the reference page is also considered (see The example mentioned above, Yahoo! referenced web page is obviously important than the page referenced by the personal website, because Yahoo! itself is important), you can get the extended web page score.</p> <p>The first calculation method of page score was Google. They propose a "random surf" model to describe the network user access behavior for the web page. The model is assumed as follows:</p> <p>1)</p> <p>Users randomly select a web page as the starting page of the Internet;</p> <p>2)</p> <p>After reading this page, you will continue to browse from the chain contained in the page.</p> <p>3)</p> <p>After entering a certain number of web pages along the hyperbar, the user feels tired of this topic, retrystrates a web page to browse, and repeat 2 and 3.</p> <p>Follow the above user behavior model, each web page may be accessed by the link weight of the page. How to calculate this weight? PageRank is calculated using the following formula: where WJ represents the weight of the word JU; LIJ only takes 0, 1 value, represents whether or not there is a link from the web page i to page J; Ni represents how many links to other webpages from the page i D represents the average number of times along the link to access the webpage in "Random Surf". Choose a suitable value, recursive use of the above formula, you can get the ideal web link weight. The method can greatly improve the quality of the simple retrieval to return the results, and can effectively prevent the webpage writer from spoofing of the search engine. Therefore, its wide application can be sorted by the search for the user's web page, the higher the web page, the higher the web page, and the closing of the ranks.</p> <p>6.2.3</p> <p>Authoritative web page and center web page</p> <p>Authoritative webpage</p> <p>As the name implies, it is a series of important authoritative web pages under the topic. Its importance and authority are mainly reflected in the following two points:</p> <p>2)</p> <p>From a single web page, its webpage content itself is important for this given topic;</p> <p>3)</p> <p>From this page, this page is confirmed by other pages, which is mainly reflected in many pages related to this topic. There are links to this page.</p> <p>It can be seen that the authority web page has a significant significance for the implementation of the topic search engine. The topic search engine a very critical task is the fastest most accurate web page in the Internet, and establish an index for them. This is also an important feature of the effective distinguishing topic search engine and the top three generations of traditional universal search engines.</p> <p>Center page</p> <p>It is a page with a lot of hyperlinks pointing to the authoritative webpage. An example of the most typical center web page is Yahoo! Its directory structure points to the authoritative webpage of many themes, making it a central page of many themes. Departure from the center page, it is easy to reach a large number of authoritative web pages. Therefore, it also has great significance for the implementation of the topic search engine.</p> <p>The authoritative web page and the central page are a relationship between mutual promotion: a good center page must have a hyperlink to point to multiple authoritative web pages; a good authority web page is bound to be linked by multiple central web pages.</p> <p>6.3 Section</p> <p>This chapter describes the topic search strategies and makes a detailed explanation. Although there is no search strategy in the news search, search strategies are extremely important for WWW Search Engines. He is directly related to the quality of the search and the performance of matching.</p> <p>references</p> <p>reference</p> <p>1</p> <p>"Programming Spiders, Bots, And Aggregator in Java" [US] Jeff Heaton</p> <p>2</p> <p>"Search Engine and Information Access Technology" Xu Baowen, Zhang Weifeng</p> <p>3</p> <p>"Java-based full-text search engine Lucene"</p> <p>4</p> <p>"The design and implementation of theme search engine" Luo Xu</p> <p>5</p> <p>"Thinking in Java" [US] Bruce Eckel</p> <p>Development Tools, Platforms & Resources:</p> <p>1</p> <p>Borland JBuilder 9</p> <p>2</p> <p>Sun JDK</p> <p>1.4.1</p> <p>3</p> <p>JAKARTA Tomcat 4.1</p> <p>4</p> <p>JAKARTA Lucene</p> <p>5</p> <p>Package bot</p></div><div class="text-center mt-3 text-grey"> 转载请注明原文地址:https://www.9cbs.com/read-77299.html</div><div class="plugin d-flex justify-content-center mt-3"></div><hr><div class="row"><div class="col-lg-12 text-muted mt-2"><i class="icon-tags mr-2"></i><span class="badge border border-secondary mr-2"><h2 class="h6 mb-0 small"><a class="text-secondary" href="tag-2.html">9cbs</a></h2></span></div></div></div></div><div class="card card-postlist border-white shadow"><div class="card-body"><div class="card-title"><div class="d-flex justify-content-between"><div><b>New Post</b>(<span class="posts">0</span>) </div><div></div></div></div><ul class="postlist list-unstyled"> </ul></div></div><div class="d-none threadlist"><input type="checkbox" name="modtid" value="77299" checked /></div></div></div></div></div><footer class="text-muted small bg-dark py-4 mt-3" id="footer"><div class="container"><div class="row"><div class="col">CopyRight © 2020 All Rights Reserved </div><div class="col text-right">Processed: <b>0.037</b>, SQL: <b>9</b></div></div></div></footer><script src="./lang/en-us/lang.js?2.2.0"></script><script src="view/js/jquery.min.js?2.2.0"></script><script src="view/js/popper.min.js?2.2.0"></script><script src="view/js/bootstrap.min.js?2.2.0"></script><script src="view/js/xiuno.js?2.2.0"></script><script src="view/js/bootstrap-plugin.js?2.2.0"></script><script src="view/js/async.min.js?2.2.0"></script><script src="view/js/form.js?2.2.0"></script><script> var debug = DEBUG = 0; var url_rewrite_on = 1; var url_path = './'; var forumarr = {"1":"Tech"}; var fid = 1; var uid = 0; var gid = 0; xn.options.water_image_url = 'view/img/water-small.png'; </script><script src="view/js/wellcms.js?2.2.0"></script><a class="scroll-to-top rounded" href="javascript:void(0);"><i class="icon-angle-up"></i></a><a class="scroll-to-bottom rounded" href="javascript:void(0);" style="display: inline;"><i class="icon-angle-down"></i></a></body></html><script> var forum_url = 'list-1.html'; var safe_token = 'tXOlSesjGSlK2XApa0D4SJjJFegsNAlJlr4vSQuqlbFFgJgAKw2VvRna9OyJdG4uZynNZQuWRfE_2FL3z9aWpfIA_3D_3D'; var body = $('body'); body.on('submit', '#form', function() { var jthis = $(this); var jsubmit = jthis.find('#submit'); jthis.reset(); jsubmit.button('loading'); var postdata = jthis.serializeObject(); $.xpost(jthis.attr('action'), postdata, function(code, message) { if(code == 0) { location.reload(); } else { $.alert(message); jsubmit.button('reset'); } }); return false; }); function resize_image() { var jmessagelist = $('div.message'); var first_width = jmessagelist.width(); jmessagelist.each(function() { var jdiv = $(this); var maxwidth = jdiv.attr('isfirst') ? first_width : jdiv.width(); var jmessage_width = Math.min(jdiv.width(), maxwidth); jdiv.find('img, embed, iframe, video').each(function() { var jimg = $(this); var img_width = this.org_width; var img_height = this.org_height; if(!img_width) { var img_width = jimg.attr('width'); var img_height = jimg.attr('height'); this.org_width = img_width; this.org_height = img_height; } if(img_width > jmessage_width) { if(this.tagName == 'IMG') { jimg.width(jmessage_width); jimg.css('height', 'auto'); jimg.css('cursor', 'pointer'); jimg.on('click', function() { }); } else { jimg.width(jmessage_width); var height = (img_height / img_width) * jimg.width(); jimg.height(height); } } }); }); } function resize_table() { $('div.message').each(function() { var jdiv = $(this); jdiv.find('table').addClass('table').wrap('<div class="table-responsive"></div>'); }); } $(function() { resize_image(); resize_table(); $(window).on('resize', resize_image); }); var jmessage = $('#message'); jmessage.on('focus', function() {if(jmessage.t) { clearTimeout(jmessage.t); jmessage.t = null; } jmessage.css('height', '6rem'); }); jmessage.on('blur', function() {jmessage.t = setTimeout(function() { jmessage.css('height', '2.5rem');}, 1000); }); $('#nav li[data-active="fid-1"]').addClass('active'); </script>