Search engine search efficiency and spam

xiaoxiao2021-03-06 30

Learn: 0107390143 Name: Zhang Wei

Summary: On the Internet, the number of information is incremented by amazing speed, and the search engine is important for the user's information. However, due to the deficiencies of the user itself, the limitations of search engine design ideas, the relatively weak website maintenance, the efficiency of search engines, ie: check the rate, quotation rate, repetition rate, dead chain ratio, etc. The performance is not satisfactory, and a large number of useless spam will have a big inconvenience to the user. This article mainly discusses how to reduce the search for spam.

Keywords: search engine, search, spam

· 1 Search engine work principle

The principle of search engines originated from traditional information full-text search theory, that is, the computer program establishes a sort file in each article, and the search program is based on the search term in each article. The frequency and each search term appears in an article, sorting the article containing these search terms, and finally outputs the result of sorting.

In addition to the full-text search system, the Internet search engine must be automatically collected from the Internet; in addition, there is a need for database establishment systems and other auxiliary systems that provide information retrieval. In fact, the search engine consists of multiple subsystems, and there are three main subsystems: information collecting subsystems, index subsystems, and search subsystems. The information collection subsystem is responsible for collecting index information from the resource provided by the information resource site, such as keywords, authors, headings, etc .; the index subsystem is responsible for classifying, organizing index information, deriving repeated information and stores a search engine The search subsystem provides a user retrieval interface and submits to the user according to user requirements.

· 2 Several common types of search engines

The earliest search engine on the Internet is based on Robot's search engine. It uses a program called Robot to automatically access the Web site, extract the webpage on the site, and according to the web page

Links further extracts other web pages; or transfer them to other sites. The Robot collected web page is added to the search engine's database for user query. The original meaning of "Search Engine" is just

Refers to the Robot-based search engine. In addition, there are types of directory search engines and recent Meta search engines. Database of the directory search engine is based on full-time editing

Get up, the META search engine itself does not store the database of web information. When the user queries a keyword, it converts the user's query request to the command format that can be accepted by other search engines, and

Visit multiple search engines to query this keyword and return the results returned by these search engines to the user after processing. During the query process, they have a common search engine based on the Robot, all through the full text retrieval completion of the user's query request.

3 Distribution of spam

For the user, when using the search engine, searching spam is a very headache, that is, waste time, more affected the acquisition of the correct information. Specifically, what is spam?

Different search engines have certain differences on the definition of spam, such as Google lists the following items in the form of spam information:

(1) Hidden text or link

(2) Vocabulary that is easy to misunderstood or accumulate

(3) Search with Google retrieved page

(4) Camouflage web page

(5) Deceptive URL

  (6) Getting started with search engines 

(7) Copy website or web page

In addition, Google believes that the spam is also included in other aspects. If you use pictures and unrelated words, the same content appears in a number of domain names or secondary domain names, links to websites that are considered a low quality And the URL that is easy to mislead with well-known websites. Search Engine INKTOMI is considered to be the main content of spam. (1) Hidden, deceptive, and text (2) the content in the Meta tab is not related to the real description of the content of the web content.

  (3) There is no clear design to re-pointing the URL (4) Utilization makes a large number of the same web pages in the search results.

  (5) Interested setting of chain link

  (6) Do not reflect the entry of the website or hidden webpage

  (7) Automatically generate a large number of non-related garbage links

It can be seen from the above, the well-known search engine is consistent with the definition of spam.

4 Reduce the method of spam in search results

· 4.1 User Road

The search engine is still there many techniques. The following uses Google as an example to introduce the application of traditional retrieval policies in network information retrieval.

When using Google, you can reduce spam by following the following methods:

4.1.1

More retrieval

Since Google's default Boolean search symbol is 'and', more retrieval words are useful to use "related retrieval items to connect some further limit theme concepts, increase mutual restrictions". With "with" of multiple search terms, you can reduce the search range and increase the search rate.

4.1.2

Method for increasing the system

(1) Use a broader word

(2) Set the language to any language, set the time to any time, the position is set to any position

(3) Specify a similar web page to solve the problem of compliance issues by browsing a web page. For some search engines, you can also connect to the search term via "or", and use wildcard (X) to improve the retrieval range. Restrict the emergence of a vocabulary. In Google, if you want to avoid searching for a word, you can add a minus ("-", English character) in front of this word, but you must leave a space before the minus.

4.1.3

Restrict fields that appear in the search

There are some words to add a colon with a special meaning of Google, one of which is "site:". To search in a specific domain or site, you can enter "Site: XXX.com" in the Google search box, you want to add Title in the topic.

4.1.4

Restriction literature type

If you want the document type for a Word document, you can use Type: DOC to limit; if the document type is PDF, you can use Type: PDF to limit.

4.1.5

Use the phrase

In Google, if the two words are always appearing before and after, there is no interval in the middle, which can be solved by using quotation marks to resolve the two words, which is equivalent to the method of retrieving the restriction word in the retrieval policy.

· 4.2 Technology Road

4.2.1

Linguistics in network retrieval

The application of the current linguistics in the network information retrieval is mainly manifested as the establishment and semantic retrieval of the semantic network.

The basis of semantic network is the ontology. The body is widely accepted by GRUBER in 1993: "The body is a conceptual formal description and is shared in a particular area." Generally, the body is a series of concepts, a series of properties and their contacts and A series of inference of the rating of rules. Semantic network consists of three levels, namely metadata layers, plan layers, and logical layers.

Metadata layer: The data model of this layer contains the concept of resources and properties. Current RDF (Resource Description Framework is considered to be the most popular data model of metadata layers. Planning layer: This layer introduces the network body language to define the system description of the concept and attributes. Currently, RDFS (RDF Schema) is a relatively mature language.

Logic layer: This layer introduces more efficient network identity languages. These languages can express the logic of the description well. Oil (Ontology Inference Layer) and DAML-OIL (Dalpa Agent Markup Language-Ontology Inference Layer) are two popular logical layers.

The establishment of a semantic network makes it possible to establish a semantic-based search engine. In a semantic search engine, each query is executed within the context of some of the body. Some guides from the body can improve the search for the search. In a semantic retrieval, use is a concept match. That is, the concept of the document is automatically extracted, and the user selects the appropriate word to express its information needs with the system's aid, and then perform the concept match between the two, that is, the same, similar, similar, similar, related .

4.2.2

Application of Data Mining Technology in Network Retrieval

Data Mining Technology In the information retrieval, it is mainly in four aspects. First, the study of the user itself, is the study of information semantics, and the third is to help the web page classification, and finally the study of information expressions.

1) is a study on the user itself:

For the same search, different users may have different meanings. This also means that information retrieval should be personalized to some extent, only to retrieve information in line with user needs, the search will succeed. Using data mining technology can track user network information retrieval behavior, including the time, number of times, browsing content, download content, search terms, etc. of the user, can be used as a basis for analyzing user information requirements. Through the excavation of user interest, according to user interest to search information, on the one hand, the user's personalized demand can be met, and on the other hand, it will inevitably improve the search Huaihui rate.

2) is a study of information semantics:

Using data mining technology to dig the relationship between vocabulary, you can specify the search in the search, and provide a series of synonyms, synonyms, etc. to be applied. The core vocabulary can be excavated by data mining, and the calibration rate can be improved by retrieving each time. For example, in Google is introduced into a semantic relationship. When an error occurs in the user input search term, the system will correct and ask if the search term provided by the user system is satisfactory.

3) Is the help of the web page classification:

The expression of information is a way to interconnect. Generally speaking, each part of a web page may correspond to different themes. In theory, the search results will only show greater relevance only when the search time appears in the same subject area, and if the search term appears in the subject, the correlation between the retrieval theme will be small. . Data mining techniques can distinguish between different topics in the web page through a certain mining algorithm, providing new basis for search judgment correlation.

4) Research on information expression:

Online information is more messy, you need to minimize data to achieve accurate classification. In fact, classification is a field that is closely related to retrieving. When searching for web pages, if you can provide a query category information, you will inevitably reduce the range of retrieval ranges, thereby catching high levels. At the same time, the classification can provide a good organizational structure of information, making it easy for users to browse and filter information. At present, some websites provide classified browsing features, such as Yahoo, but the classification of the current website will often have ambiguous, and the classification accuracy is relatively low, and there is still a problem that the update is slow. Using data mining technology can combine the current classification accuracy, thereby increasing the accuracy of classification and improving retrieval efficiency.

4.2.3

Intelligent agent technology

Agent is generally used to refer to any software system with intelligence, autonomous properties. 1955 WoolDrige and Jeinnga made an authoritative definition for Agent: (1) Weak definition: Agent is a software or hardware computer system that has autonomous, social capabilities, reactivity and initiative.

(2) Strong definition: Agent is based on the feature of weak definitions, but also includes the characteristics of humans, etc.

Generally speaking, the basic characteristics of Agent include autonomous, initiative, reactivity, continuous execution, etc .; non-basic characteristics include social, personalized and adaptability, mobility, emotion, and personality. Intelligent Agent Technology In Information Retrieval is mainly manifested in establishing user information requirements model and information filtering in network information retrieval. Generally speaking information filtration is three: ie, content-based filtration, collaborative filtering and economic filtering.

4.2.4

Optimize ROBOT algorithm

Since this is more difficult, it is quite professional, this article does not discuss here, interested readers can refer to "Intelligence Journal" 2002, the fourth papers: "Optimization of ROBOT search algorithm in search engine ".

In summary, this paper explores the principles of search engines, the type of spam, and how to solve search engines search spam.

references

Li Lizhen, Wu Xingxing, Yu Xueli, Zhang Shumei. "Search Engine Retrieval Quality Control". Journal of Taiyuan University of Technology, No. 3, 2003

Guo Jiayi. Research on Network Information Retrieval Efficiency. Book and Information,

2002

year

month

day

Song Jiping, Wang Yongcheng, Teng Wei, Xu Xiqing. Optimization of Robot Search Algorithm in Search Engine ". Journal of Information, February 2002 No. 4

Marketing staff network, http://www.marketingman.net

Http://www.google.com/contact/spamreport.html

http://www.searchenginewatch.com/searchday/Article.php/2159061

Ma Bun, Li Heng. "Talking about the performance evaluation of search engine"

转载请注明原文地址:https://www.9cbs.com/read-48206.html

9cbs

New Post(0)