Profile

xiaoxiao2021-03-06  16

Lu Liang: The angel of the six wings

When developing the Booso News Search Engine, there is a problem with a lot of news that belongs to the reprinted form. To determine if the news reprint, after experiment, I find the "translation" algorithm to be implemented.

The "panning algorithm" is very easy to use, which is compared to the highest overlap rate and flat overlap in two articles / strings. For example, we have the title of two articles:

"Report shows China IP video communication application as early as Western countries _ communication and telecommunications _ technology era _ Sina.com" http://tech.sina.com.cn/t/2004-12-01/1231468255.SHTML

"Authoritative survey shows China IP video communication applications earlier than Western _ Sohu IT" http://it.sohu.com/20041201/n223268718.shtml

The above two news are reproduced in the same source, but they have made it slightly. According to the panning algorithm, we secure a string, and then the other string corresponds to the beginning of the first string, and then calculates two strings. The intersection of between. If the character is exactly 1, it is not the same as 0, and all values ​​are added.

"________ report China IP video communication application earlier than Western countries _ communication and telecommunications _ Technology era _ Sina" "authoritative institution survey shows China IP video communication application earlier than Western _ Sohu IT" 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "

It can be seen that the maximum overlap can be found when B and A are flat to a certain location. The above example is 14 consecutive characters.

Assinity: AB overlapping section / (length B of a AB - AB overlap length) 14 / (33 25-14) = 31%

Typically more than 20%, it can be judged to be the same topic or the same source is reproduced.

Flash algorithm function:

1] Implementation of the articles of highly acquirer. Reprinted, source recognition.

2] You can find the subject, discover the core content.

For example, a part of the match is achieved.

A & B = "China IP video communication application is earlier than Western"

It is a completely matched part, which is the core content of the articles.

转载请注明原文地址:https://www.9cbs.com/read-45492.html

New Post(0)