On the Diveintopython website, http://diveintopython.org/html_processing/extracting_data.html can find some examples of HTML processing, such as this class can be used to get the HREF tag content in the HTML page.
From SGMLLIB IMPORT SGMLPARSER
Class Urllister (SGMLPARSER):
DEF RESET (SELF):
SGMLPARSER.RESET (Self)
Self.URLS = []
DEF Start_a (Self, Attrs):
HREF = [V fork, v in attrs if k == 'href']
IF HREF:
Self.urls.extend (HREF) changed this file HREF to src, start_A changed to start_img: from sgmllib import sgmlpivalser
Class Urllister: Def Reset (SELF): SGMLPARSER.RESET (SELF) SELF.URLS = []
DEF Start_IMG (Self, Attrs): SRC = [V fork, v in Attrs if k == 'src'] if src: self.urls.extend (src) Save Code content is URLLLISTER.PY file, put it in Python installation The directory can be used to analyze the image address of the web page. Below to download the program content IMPORT URLLIB2IMPORT URLLIBIMPORT OSIMPORT URLLLISTER # This is a class from http://diveintopython.org/html_processing/extracting_data.html#diaalect.extract.urlib's class
ImagePath = []
The #CD function is used to determine if the path is correct. If it is correct, change the current work path DEF CD (SS): try: os.chdir (ss) Print 'Change the work directory to' ss return 0 Except: print 'Enter picture save path If you are incorrect, please re-enter 'Return 1 # addimagePath, add the image path to ImagePath (SURL): if' http: // 'in surl imagepath.Append (surl) print' Find pictures: ' SURL.SPLIT ('/') [- 1] 'Image Address is:' SURL Else: SURL = Str_URL SURL ImagePath.Append (SURL) Print 'Find Picture:' SURL.SPLIT ('/') [-1] 'Image address is:' surl
#download imagesdef image_down (list_image): if not list_image: print "This page does not have any pictures" Else: for image in list_image: try: inur (image, image.split ('/') [- 1]) # Utilization Image.split ('/') [- 1] Get file name Print "Pictures from" Image "save success!" Except: print "From" image "picture is not saved, continue to save the next picture .. .. "print" Please enter the URL address of the web page: "Str_url = Raw_Input ()
PRINT "Please enter the picture save the address, if you go directly to you, save the default to my document"
Temp = 1WHILE TEMP: STR_SAVE = RAW_INPUT () if not str_save: str_save = 'E: / fei_doc' Temp = CD (str_save)
Try: sock = urllib2.urlopen (STR_URL) PRINT page connection is successful! Start getting the image address ... "except: print" Sorry, the input address is incorrect or the page cannot be connected, the program will automatically exit "
Parser = urllister.urllister () Parser.Feed (Sock.Read ()) Sock.close () Parser.close () for Url in Parser.URLS: AddImagePath (URL) # Call Image Download Function Image_Down (ImagePath)
# 程序 程序 End Although this program can basically solve the problem, I found some shortcomings: 1. If the page IMG tag is not directly followed by SRC properties, such as complex IMG code:
Then URLLISTER can not recognize it. However, this problem solves more easily, directly analyzing the HTML code per line code, using split ('src'), you can get the contents of all SRC tags, and then subsequently obtaining the image file address based on the suffix is JPG, GIF, etc.
2, the above program is just a picture of the image starting with HTTP and the image below the current URL below, if the content inside the SRC begins with "../Images" or "/", you need to handle it.