Try Nutch
Look at the LOG of the site today, I found a few reverse links from Nutch. In fact, I just mentioned this word in the Java coding specification, so the result will definitely disappoint the friends who come. Here, I will announce some of the trials of Nutch, for interested friends. It should be noted that NUTCH now does not have a stable release, and it is still constantly modifying according to feedback, and Chinese retrieval is not supported. All in all, this version is now not practical for Chinese users. I think this should also be the reason why I have been studying and paying attention to Nutch's cargo has not written notes. A few days ago, I talked to the MSN a few days ago. I feel that the best solution to retrieve this site is currently built, which is a software package that is based on Lucene-based open source project Web Lucene. NUTCH seems to be more suitable for establishing a vertical search engine website, at least in this, I think.
1, download installation
I don't know why, this website cannot be directly accessed. I use 2003-09-18 package (interested friends)
Download from this way, in Red Hat Linux 8.0 JRE 1.4.1 Tomcat 4.1 trial.
TAR ZXVF NUTCH-2003-09-18.Tar.gz
CD NUTCH-2003-09-18 <---- The command execution is called $ nutch_home, only for the purpose of describing.
Ant
Ant package
BIN / NUTCH <--- If everything is normal, you should appear "USAGE: NUTCH Command" and other words.
2, test run script description
This script is the finishing of cutting in tutorial. The commands in the script are best run in turn, $ S1, $ S2, and $ S3 assignment expressions are the same, but the three values are different, depending on the context of running. When I was run, I made a mentally wistel wrong, unpacking, and the result was wrong. :)
Initial preparation MKDIR DB creation directory storage Web databasemkdir segmentsbin / nutch admin db -create built a new empty database first round Crawl BIN / NUTCH INJECT DB -DMOZFILE Content.rdf.u8 -subset 3000 acquired URL from the DMOZ list and Add Database BIN / NUTCH Generate DB Segments According to the database content, generate a grab list (fetchlist) s1 = `ls -d segments / 2 * | tail -1` Joined the list of crawled lists in the last directory, take it Name Bin / Nutch Fetch $ S1 Using Robot Capture Page BIN / NUTCH UPDATEDB DB $ S1 Using Crawl Result Update Database Second Wheel Grab BIN / NUTCH Analyze DB 5 Iteration 5 Time Analysis Page Link BIN / NUTCH GeneRate DB Segments --TOPN 1000 1000 URLs to generate new crawl lists S2 = `ls -d segments / 2 * | tail -1` to perform crawl, update, and iterate 2 analysis link BIN / NUTCH FETCH $ S2BIN / NUTCH UPDATEDB DB $ S2 third round crawling bin / nutch analyze db 2bin / nutch generate db segments -topn 1000S3 = `ls -d segments / 2 * | tail -1`bin / nutch fetch $ s3bin / nutch Updatedb db $ s3bin / nutch analyze DB 2 (prepared for the next time?) Index and retrans into bin / Nutch Index $ S1BIN / NUTCH INDEX $ S2BIN / NUTCH INDEX $ S3bin / Nutch Dedup segments Dedup.tmp restart Tomcatcatalina.sh Start in ./segments in the directory where you are ./segments Start 3, the script changes and downloads
DMOZ's file is too big, download is not easy, if only experiment, it seems that there is no need to choose URLs there. I changed the script, build a Urls.txt file in the $ nutch_home directory, a line of the URL of the website intended to search for a website, and Nutch will take the URL of the site from this urls.txt.
Script
Downloads in a reference, run the script in the $ nutch_home directory, refer to the following command:
SH all.sh
4, provide web search
I have been busy for a long time, just grab it back, parsing the web and do an index. Here's how to provide retrieval services with JSP programs that use Nutch yourself.
CD $ TOMCATHOME / WebApps
MV Root rootold
Mkdir root
CD root
CP $ nutch_home / nutch-2003-09-18.war ./root.war
Jar XVF root.war
CD $ NUTCH_HOME
$ Tomcat_home / bin / shutdown.sh
$ Tomcat_Home / Bin / Catalina.sh Start
At this point, if you don't accident, you should access it.
My trial URL is http://cdls.nstl.gov.cn/se/, (where I change Nutch, I didn't put it in the root directory) for reference. At this point, don't retrieve Chinese characters, you can only retrieve English, such as Hedong or Lucene.
Trial rush, it is inevitable, welcome friends to communicate.
Reference: LUCENE / XML-based station full-text search solution http://www.chedong.com/tech/weblucene.htmllucene study notes (2) http://hedong.3322.org/archives/000208.htmlrunning.sh Script -------------------------------------- #! / Bin / bash
mkdir dbmkdir segmentsbin / nutch admin db -createbin / nutch inject db -urlfile urls.txtbin / nutch generate db segmentss1 = `ls -d segments / 2 * | tail -1` echo $ s1bin / nutch fetch $ s1bin / nutch updatedb db $ S1BIN / NUTCH Analyze DB 5
BIN / NUTCH Generate DB Segments - Topn 100S2 = `ls -d segments / 2 * | Tail -1`echo $ s2bin / nutch fetch $ s2bin / nutch updatedb DB $ S2BIN / NUTCH Analyze DB 2
BIN / NUTCH Generate DB Segments -topn 100S3 = `ls -d segments / 2 * | Tail -1`echo $ s3bin / nutch fetch $ s3bin / nutch Updatedb DB $ S3bin / Nutch Analyze DB 2
BIN / NUTCH INDEX $ S1BIN / NUTCH INDEX $ S2BIN / NUTCH INDEX $ S3
BIN / NUTCH DEDUP Segments Dedup.tmp ------------------------------