Rapid "replace" website provides RSS service

xiaoxiao2021-03-06 146

Convenient RSS

Today, RSS has nowhere to be, from the earliest news website, to today's WebLog, no large amount of RSS. The RSS is very convenient for news reading readers. The biggest advantage is that the passive browsing news is changed to the reader to actively subscribe to the news. If the reader does not have to visit the website, you can immediately see the news channel you care about, so that readers get rid of the interference of website advertisements and boring content. Netscape first proposes RSS, which organizes its website into RSS to facilitate communication and visiting reading and subscriptions between websites. Later, everyone realized that the convenience of RSS and potential business value, one but up, and proposed several standards, such as RSS0.91, RSS0.92, RSS1.0, RSS2.0, Atom, etc.. After Netscape, many websites began to provide RSS. The emergence of a large number of RSS has prompted the birth of the news aggregator software, which can be easily enabled to subscribe to the RSS of your favorite channel. RSS is an XML file, which is mainly composed of and tags, represents a news channel, represents news entries in the news channel. problem

RSS is so convenient, but many websites are considered for their own interests, such as advertising interests, and have not provided RSS. What do they provide for such a good thing? This article provides a very simple and effective way to analyze news channel web content, generate RSS files, which allows you "to" provide RSS services in the shortest time :). We use Chinadaily's news channel http://www.chinadaily.com.cn/ENGLISH/HOME/news.html as an analysis goal, it is the English version of the People's Daily and does not provide RSS. Through the analysis of the target page, we found that although there is a large number of hyperlinks on the web, it is regularly to the link to the news content. Http://www.chinadaily.com.cn/ENGLISH/doc/2004-09/09/content_373147.htmhttp://www.chinadaily.com.cn/English/doc/2004-09/09/content_372925.htmhttp: //www.chinadaily.com.cn/english/doc/2004-09/09/content_373132.htm

Obviously, the news generated by the news distribution system used by China has a specific format. So we just found these compliant hyperlinks, we got all news links, and filtered off did nothing, such as advertising. The whole process, including a few steps:

Analyze the webpage, capture the web page to get all the hyperlinks in the web page. Analyze the hyperlink to get all the changing hyperlinks. Organize RSS and generate an RSS file. Output results.

Analysis page

First capture the web page, then analyze the webpage according to the HTML tag. Here is a very good web analysis open source tool - HtmlParser. It is an open source project under SourceForge, which is described as "A Super-Fast Real-Time Parser For Real-World HTML". Parser parserLink = new Parser ( "http://www.chinadaily.com.cn/english/home/news.html"); ObjectFindingVisitor visitorLink = new ObjectFindingVisitor (LinkTag.class); parserLink.visitAllNodesWith (visitorLink); Node [] Links = visitorink.gettags (); First, use the URL address of the parsing target to construct a Parser, then use Visitor to get all the hyperlinks. It will return a Node array. For HTMLParser, any HTML tag is a Node. Analysis hyperlink

After another step, we have received all the hyperlinks of the target website. Now it is now analyzing it, finding the regular hyperlink, they are news. By analyzing the hyperlink of news in the target web page, we found that all of them start at http://www.chinadaily.com.cn/english/doc/, and finally Content_ plus several numbers and HTM suffixs. So, we can find a hyperlink that meets this law. This regular expression is used here, which can achieve our needs efficiently. Regular expressions are as follows: ^ http://www.chinadaily.com.cn/english/doc/.*/content_//d //.htm $

"^" Is indicated in "http://www.chinadaily.com.cn/english/doc/"; "$" means ".htm" ends; the middle "/ d" indicates any number of numbers; ". * "Means any number of arbitrary characters. The following uses the regular expression traversed all hyperlinks: for (int i = 0; i

RSS constructor

Now I have already got all the news hyperlinks, and the next step is to construct RSS :) We use very good open source tools, informa. It is also an open source project under SourceForge, its goal is "To Provide a news aggregation library based on the java platform". First we construct a RSS Channel. Channelif channel = new channel (); CHANNEL = New Channel ("chinadaily news"); Channel.SetLanguage ("en"); Channel.sets (new url (http://www.chinadaily.com.cn/ENGLISH/HOME /NEWS.HTML)? "; code.SetDescription ("this is a demo for topic 'extract news to rss'. All the test material is chinadaily.com.cn's news ");

Then, the previous news hyperlink is organized into RSS ITEM and add it to the RSS CHANNEL. IF (linktag.getlink () {item = new item (); item.settitle (linktag.getLinkText ()); item.setLink (new url (linktag.getlink ()); item.setdescription ("please visit" linktag.getlink () "to find more."); // add item to rss channelchannel.additem (item);} RSS output

OK, here, work has been basically ended, just output the results of the work. CHANNELEXPORTERIF EXPORTER = New RSS_0_91_EXPORTER ("news.xml"); Exporter.write (CHANNEL);

The same is the tool provided by INFORMA, and the RSS0.91 format is used. Let us look at the results. chinadaily news </ title> <description> this is a demo for topic 'extract news to rss'. All the test materials all chinadaily.com.cn's news </ description > <link> http://www.chinadaily.com.cn/ENGLISH/HOME/news.html </ link> <nGuage> en </ language> <item> <title> · China Sends 2 Satellites Into Preset Orbits < / Title (ll/doc/2004-09/09/content_373147.htm</link>.com. CN / ENGLISH / DOC / 2004-09 / 09 / Content_373147.htmto Find More. </ description> </ item></p> <p>Conclude</p> <p>This article describes a very simple, effective to extract web links and organize the RSS method, allowing you to provide RSS for a favorite website in the shortest possible time. But it only extracts the news headline. If you plan to extract the news content, you still need some improvements. Reference</p> <p>Analysis Target Website http://www.chinadaily.com.cn/ENGLISH/HOME/News.html</p> <p>Web Page Analyzer: http://htmlparser.sourceforge.net/</p> <p>RSS analysis tool: http://informa.sourceforge.net/</p> <p>A Java written RSS aggregator: http://rssowl.sourceforge.net/</p> <p>About regular expressions http://java.sun.com/docs/books/tutorial/extra/Regex/</p> <p>RSS0.91 Standard http://my.netscape.com/publish/mmats/rss-spec-0.91.html</p> <p>RSS2.0 standard http://blogs.law.harvard.edu/tech/rss</p> <p>About atom http://www.atomenabled.org/- Author: imibpig October 12, 2004, 16:41:10 Tuesday</p></div><div class="text-center mt-3 text-grey"> 转载请注明原文地址:https://www.9cbs.com/read-97354.html</div><div class="plugin d-flex justify-content-center mt-3"></div><hr><div class="row"><div class="col-lg-12 text-muted mt-2"><i class="icon-tags mr-2"></i><span class="badge border border-secondary mr-2"><h2 class="h6 mb-0 small"><a class="text-secondary" href="tag-2.html">9cbs</a></h2></span></div></div></div></div><div class="card card-postlist border-white shadow"><div class="card-body"><div class="card-title"><div class="d-flex justify-content-between"><div><b>New Post</b>(<span class="posts">0</span>) </div><div></div></div></div><ul class="postlist list-unstyled"> </ul></div></div><div class="d-none threadlist"><input type="checkbox" name="modtid" value="97354" checked /></div></div></div></div></div><footer class="text-muted small bg-dark py-4 mt-3" id="footer"><div class="container"><div class="row"><div class="col">CopyRight © 2020 All Rights Reserved </div><div class="col text-right">Processed: <b>0.053</b>, SQL: <b>9</b></div></div></div></footer><script src="./lang/en-us/lang.js?2.2.0"></script><script src="view/js/jquery.min.js?2.2.0"></script><script src="view/js/popper.min.js?2.2.0"></script><script src="view/js/bootstrap.min.js?2.2.0"></script><script src="view/js/xiuno.js?2.2.0"></script><script src="view/js/bootstrap-plugin.js?2.2.0"></script><script src="view/js/async.min.js?2.2.0"></script><script src="view/js/form.js?2.2.0"></script><script> var debug = DEBUG = 0; var url_rewrite_on = 1; var url_path = './'; var forumarr = {"1":"Tech"}; var fid = 1; var uid = 0; var gid = 0; xn.options.water_image_url = 'view/img/water-small.png'; </script><script src="view/js/wellcms.js?2.2.0"></script><a class="scroll-to-top rounded" href="javascript:void(0);"><i class="icon-angle-up"></i></a><a class="scroll-to-bottom rounded" href="javascript:void(0);" style="display: inline;"><i class="icon-angle-down"></i></a></body></html><script> var forum_url = 'list-1.html'; var safe_token = 'NUZ8NVoyORQee6PoA45qeZ9wkd_2B_2BsDSonQ87xQSKDr3ARToWNIETDPxsiaW3jfXJ_2FN8K7hpIMgefwpk_2FRyFWGw_3D_3D'; var body = $('body'); body.on('submit', '#form', function() { var jthis = $(this); var jsubmit = jthis.find('#submit'); jthis.reset(); jsubmit.button('loading'); var postdata = jthis.serializeObject(); $.xpost(jthis.attr('action'), postdata, function(code, message) { if(code == 0) { location.reload(); } else { $.alert(message); jsubmit.button('reset'); } }); return false; }); function resize_image() { var jmessagelist = $('div.message'); var first_width = jmessagelist.width(); jmessagelist.each(function() { var jdiv = $(this); var maxwidth = jdiv.attr('isfirst') ? first_width : jdiv.width(); var jmessage_width = Math.min(jdiv.width(), maxwidth); jdiv.find('img, embed, iframe, video').each(function() { var jimg = $(this); var img_width = this.org_width; var img_height = this.org_height; if(!img_width) { var img_width = jimg.attr('width'); var img_height = jimg.attr('height'); this.org_width = img_width; this.org_height = img_height; } if(img_width > jmessage_width) { if(this.tagName == 'IMG') { jimg.width(jmessage_width); jimg.css('height', 'auto'); jimg.css('cursor', 'pointer'); jimg.on('click', function() { }); } else { jimg.width(jmessage_width); var height = (img_height / img_width) * jimg.width(); jimg.height(height); } } }); }); } function resize_table() { $('div.message').each(function() { var jdiv = $(this); jdiv.find('table').addClass('table').wrap('<div class="table-responsive"></div>'); }); } $(function() { resize_image(); resize_table(); $(window).on('resize', resize_image); }); var jmessage = $('#message'); jmessage.on('focus', function() {if(jmessage.t) { clearTimeout(jmessage.t); jmessage.t = null; } jmessage.css('height', '6rem'); }); jmessage.on('blur', function() {jmessage.t = setTimeout(function() { jmessage.css('height', '2.5rem');}, 1000); }); $('#nav li[data-active="fid-1"]').addClass('active'); </script>