HTMLContentParser ASP.NET Project Using VB.NET

xiaoxiao2021-03-06 31

This project I post today really has not much of a practical or functional value to anyone. (Alto I think they are pretty cool to web designers and developers. I am sure there will be detractors out there.) It is just to showcase the use of some ASP.NET objects and how easy it is to use them and also do some simple string manipulation. This project is a HTML Content Parser. It gets a stream of HTML Content from a specified URL Web Page. Then it sets to go through whole stream extracted and picks out the HTML HyperLinks and Images and displays them in an HTML Table in a hyperlink format for users to click on directly to get there. This will be particularly useful for uses who are interested in some images on websites and finds it tedious to look through the view source of the pages to extract out the image sources of the page.Please do check out the live version of this project from my website. Please click here to get there now. Please feel free to post any comments or Criticisms on my project and articles.Lets make use of MS.NET's more intuitive OOP features to separate encapsulate and group different functions into classes and assemblies for easier maintanance.This code here goes into a Class calledHTMLContentParser.vb '/// Imports System.IOImports System. NetImports SystemImports System.TextImports System.Text.RegularExpressionsPublic Class HTMLContentParserFunction Return_HTMLContent (ByVal sURL As String) Dim sStream As StreamDim uRLReq As HttpWebRequestDim URLRes As HttpWebResponseTryURLReq = WebRequest.Create (sURL) URLRes = URLReq.GetResponse () sStream =

URLRes.GetResponseStream () Return New StreamReader (sStream) .ReadToEnd () Catch ex As ExceptionReturn ex.MessageEnd TryEnd FunctionFunction ParseHTMLLinks (ByVal sHTMLContent As String, ByVal sURL As String) As ArrayListDim rRegEx As RegexDim mMatch As MatchDim aMatch As New ArrayList () RREGEX = New Regex ("a. * href / s * = / s * (?:" "(" "" "" | (? <1> / s ) ", _ regexoptions .IgnoreCase Or RegexOptions.Compiled) mMatch = rRegEx.Match (sHTMLContent) While mMatch.SuccessDim sMatch As StringsMatch = ProcessURL (mMatch.Groups (1) .ToString, sURL) aMatch.Add (sMatch) mMatch = mMatch.NextMatch () End WhileReturn aMatchEnd FunctionFunction ParseHTMLImages (ByVal sHTMLContent As String, ByVal sURL As String) As ArrayListDim rRegEx As RegexDim mMatch As MatchDim aMatch As New ArrayList () rRegEx = New Regex ( "img. * src / s * = / s * (?:" "(? <1> [^" "] *" "| (? <1> / s ))", _ regexoptions.ignorecase or regexoptions.compiled) mmatch = rregex.match (shtmlcontent) While mmatch.successdim smatch as Stringsmatch = processURL (mmatch.groups (1) .tostrin g, sURL) aMatch.Add (sMatch) mMatch = mMatch.NextMatch () End WhileReturn aMatchEnd FunctionPrivate Function ProcessURL (ByVal sInput As String, ByVal sURL As String) 'Find out if the sURL has a "/" after the Domain Name' IF not, give a "/" at the end for any slash after the 'Double Dashes of the http: //' f there is no slash, the end of the surl string with a slashiffness (8, SURL, "/") = 0 THENSURL = "/" end'Filtering'Filter Down to the domain name directory from the rightdim iCount as integer iCount =

SURL.LENGTH TO 1 Step -1IF MID (SURL, ICOUNT, 1) = "/" THENSURL = Left (SURL, ICOUNT) EXIT Forend IFNext'Filter Out The "" from the leftfor iCount = 1 to SINPUT.LENGTHIF MID ( SINPUT, ICOUNT, 4) = ">" THENSINPUT = Left (SINPUT, ICOUNT - 1) 'Stop and take the char beforeeexit forend.rext'Filter Out unnecessary characterssinput = SINPUT.REPLACE ("<", chr (39)) SINPUT = SINPUT.REPLACE (">", chr (39)) SINPUT = SINPUT.REPLACE ("" "," ") SINPUT = SINPUT.REPLACE (" '"," ") IF (SINPUT.INDEXOF (" http: // ") <0) THENIF (NOT (SINPUT.STARTSWITH (" / ")) and not (SURL.EndSwith (" / "))) Thereturn SURL &" & SINPUTELSEIF (SINPUT.STARTSWITH ("/")) and (SURL.EndSwith ("/")) ThenreTurn Surl.Substring (0, SURL.LENGTH - 1) SINPUTELSERTURN SURL SINPUTEND IFELSERTURN SINPUTEND IFEND FUNCTIONEND CLASS '

/// The Function getHTMLContent requires a URL parameter input in a string format. From there we use the HTTPWebRequest and HTTPWebResponse objects to send a request to the specified URL and get their HTML Content as a Response. Note the structured error handling implemented here. This structured error handling is explained in a different topic altogether. The returned value should be placed and displayed in a HTML TextBox for retrieval purposes later.The ParseHTMLLinks and Images Functions make use of a Regex object that should be very familiar to Java and C # Developers and would look alien to VB Developers. They are actually a pattern matching object and can be used together with the Match Object. These are all objects of the System.Text.RegularExpressions Namespaces and therefore MUST be imported and declared into the VB.NET class . What the do is essentially a regex pattern match int t Match object with the html content (retrieved from an erlier html textbox we use to display t he retrieved HTML Content) as the source. As and when it finds the matched pattern specified by Regex, it returns the string containing the pattern, process it with ProcessURL Function and then adds it to an ArrayList. The ArrayList class is essentially the Collection class of the classic VB. It has the ability to add and remove from the collection which is far more intuitive and easier to use than the array class. Both the ParseHTMLLinks and Images return an arrayList of Links and Images. The ProcessURL Function here essentially uses some Very Intrinsic VB Functions and Some New VB.

NET ones (of which I am a developer of for years and therefore am familiar with). I also realized that some detractors out there will propose the use of the stringBuilder class as an immutable class to manipulate strings in this function. What the stringBuilder class differs from the String class is that the stringBuilder class is immutable which means it doesnt create a new instance of itself any time it is referred to. It is therefore more efficient on the machine's resources. The String class creates a new instance of itself whenever it is assigned an expression. and you can imagine the strain on resources when it the same string is manipulated 10 times, it will create 10 new instances of itself. Hardly efficient. I use the string class here because although its much more inefficient, its much More familiar to the vb development to vb.net and this topic here, more or less, focus on the asp.net httportbrequest and httpwebresponse objects. I will save the stringbui lder class topic for later articles. But I am sure other developers and authors here will and already have explained the stringBuilder class aready.This code here goes to an ASP.NET ASPX page '// Private objParser As HTMLContentParser Private Sub cmdGetHTML_ServerClick (ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdGetHTML.ServerClickDim sURL As String = "http: //" & txtURL.ValuetxtHTMLContent.EnableViewState = FalsetxtHTMLContent.Value = objParser.Return_HTMLContent (sURL) End SubPrivate Sub cmdParse_ServerClick (ByVal sender AS System.Object, byval e as system.eventargs) Handles cmdpival.

ServerClickCall PopulatetblParsedContent () End SubPrivate Sub PopulatetblParsedContent () 'Populate Links TableDim sURL As String = "http: //" & txtURL.ValueDim myAnchor As HtmlAnchorDim intRows As IntegerDim intRowCount As IntegerDim objRow As HtmlTableRowDim objCell As HtmlTableCellDim sLinks As StringDim sImage As StringDim lstLinks As ArrayList = objParser.ParseHTMLLinks (txtHTMLContent.Value, sURL) Dim lstImages As ArrayList = objParser.ParseHTMLImages (txtHTMLContent.Value, sURL) tblParsedContent = Me.tblParsedContenttblParsedContent.EnableViewState = FalseFor Each sLinks In lstLinksobjRow = New HtmlTableRow () objCell = New HtmlTableCell () myAnchor = New HtmlAnchor () myAnchor.Target = "_blank" myAnchor.InnerText = "Link:" & sLinks.ToStringmyAnchor.HRef = sLinks.ToStringobjCell.NoWrap = FalseobjCell.Controls.Add (myAnchor) objRow.Cells.Add ( Objcell) TBLPARSEDCONTENT.ROWS.ADD (OBJROW) Nextfor Each Simage In Lstimagesobjrow = New HTMLTableRow () Objcell = New HTMLTableCell () Myanchor = New HTMLANCHOR ) MyAnchor.Target = "_blank" myAnchor.InnerText = "Img:" & sImage.ToStringmyAnchor.HRef = sImage.ToStringobjCell.NoWrap = FalseobjCell.Controls.Add (myAnchor) objRow.Cells.Add (objCell) tblParsedContent.Rows.Add (objRow) NextEnd Sub '/ We then go now to focus on how to extract the HTML Content and parse them. Design your ASPX page and have1) A HTMLTextbox for users to specify the URL for processing2) A HTMLButton called cmdGetHTML with a ServerClick event Handler to handle to copy.

The event will trigger a routine that uses the HTMLContentParser class that we had coded earlier and use the getHTMLContent function to return a string of HTMLContent for display into a txtHTMLContent HTMLTextBox.3) A HTMLTextBox called txtHTMLContent to hold the returned HTML Content4) A HTMLButton called cmdParse with a ServerClick event handler that calls the PopulatetblParseContent5) A HTMLTable called tblParsedContentPersonally, I think the HTMLTable server control is amazing. I have used this example here to show its intuitiveness to add cells and rows to it. Again, my detractors out there may question the routine to populate the tables. I say that this is just an article to explain one of the ways to populate a HTMLTable. It is very intuitive and I am sure most developers out there can just understand it without much explanation. It makes use Of the htmltable to add cells Into rows and rows into the htmltable. Note The Use of the for each ... in the arraylis t Collection to extract each link and image, assign them to another server control HTMLAnchor, add this HTMLAnchor to a HTMLTableCell, add the HTMLTableCell to a HTMLTableRow and finally add the HTMLTableRow to a HTMLTable. Very intuitive to program and code! In the world of software development, the HolyGrail is seldom achieved as there is No One Right Way to do things, however there are Many Wrong Ways. This article here is more or less a tutorial on certain ASP.NET Objects and the intuitiveness of the programmatically of the web Server Controls. Feel Free To Modify To Suite Your Learning Curve Transition from TO VB.

NET / ASP.NET. Use the StringBuilder class instead of the StringClass in the ProcessURL Function of the HTMLContentParser.vb class and after you are familiar with the program structure of the HTMLTable, use the DataSource and DataBind Techniques of the HTMLTable, the ASPDataList and the ASPDataGrid Server Controls. There is only one way to learn properly and that is from SCRATCH. In that sense, you can fully appreciate why and how you do things. After all, aint the whole world of MS.NET developed from SCRATCH which is Much Better Than Patches and add-ons to the imperfections of yesterday.download Article

转载请注明原文地址:https://www.9cbs.com/read-43168.html

9cbs

New Post(0)