From: http://hedong.3322.org/archives/000218.html
Creeper Larm is a robot that captures the webpage, written by pure Java.
Through the author's narrative, write a crawler, far from the imagination. The HTML specification is too simple, so there will be a lot of new HTML files. The randomness of the network is too strong, and maybe I will encounter any problems. This kind of accident will test a crawler.
As the subproject of Lucene, LARM is still in development, even a stable version is not, only through CVS. Moreover, the documentation is not uniform, and the commonality of the project in the development. However, its random document is still said to the concept of Larm, it still has a Wiki page, I don't know why I still have a name on the SourceForge (there are still a few RTF documents).
Among the Larm source code, there is a GUI interface, let me be bad, one run, how to click "START" it is not active, very depressed, see the source code, "// to do: code goes here.", There is no right This click on the event's processing code. FT!
If you don't consider it with Lucene's relationship, you will use a crawler to see a certain value. I got this project, after compiling, crawl at http://hedong.3322.org, because there is no restrictions on the domain name, suddenly reached more than 5,500 domain names, the next 300M, it will be interrupted.
Mkdir Jakarta
CD JAKARTA
CVS -D: PServer: anoncvs@cvs.apache.org: / home / cvspublic loginpassword: anoncvscvs -d: pserver: anoncvs@cvs.apache.org: / home / cvspublic Checkout Jakarta-Lucene-Sandbox
CD JAKARTA-Lucene-Sandbox / Contributions / Webcrawler-Larmant Dist
Put all JARs in build / webcrawler_larm-0.5.jar and libs / directory to ClassPath.
Java -server de.lab.lam.Fetcher.FetChermain -start http://hedong.3322.org