Problem
I need a set of two-bedroom and one hall, the price is around 400,000, located in a variety of districts (such as Xuhui, Minhang District) in Shanghai. My ideal houses must of course meet a lot of conditions, for example, it must be a house in two rooms, and its average price must be less than 7,000 yuan per square meter, and so on.
Usually I use the second-hand housing network of Shanghai hotline (http://secondhand.online.sh.cn/) to find the latest second-hand housing information.
But the latest second-hand housing information that tracks the website is too tired every day. The second-hand housing network has to add a lot of latest information every day. In order to identify the house that meets my conditions, I have to see all this information. The whole process is very wasting time.
the reason is:
1. The website will only display some of the newly added second-hand housing * information on the homepage. And in order to make your own judgment, the light is not enough to see the information, I have to see more information. And this means that I have to click the "Detailed" hyperlink to download the details of the detailed information.
2. Not all information is placed on a page. In fact, the information of the second-hand housing network of the Shanghai hotline is displayed, and only 10 information is displayed. If I need to see all the information, I need to see about 500-3000 pieces of information (how much information is strict, the size of the search is related).
The above 1st, 2 points is that even if I just need to watch 500 information, I also need to watch 550 pages:
500/10 (Summary Information Page) 500 * 1 (detailed page of each message) = 550
Each page loaded is 10 seconds (actual sample results), I read each page is 25 seconds (my reading speed is very fast).
550 * (10 25) = 19250 second = 320 minutes = 5 hours.
In other words, I will take 5 hours of reading. In fact, I have not considered the time to analyze information (such as 10 real estate average price, which is much more time than reading.
2. The essence of the problem
Since many information is now placed online, similar problems are common. For example, recruitment, securities, etc. have the same problem. This problem is how to get data from the Internet efficiently and analyze.
3. A solution
I have seen some domestic related articles, and the programs of these articles are based on IE automation. Usually add WebBrowser controls using the VB language.
The most important disadvantage of this program is that the WEBBROWSER has been packaged in many data operations, making the original simple operation becomes incomparable (I guess because Microsoft Design WebBrowser's purpose is to automate IE operations). For example, you can make a manager such as the current page. If you develop by WebBrowser, you need to get the current page, and you get two successful processes in a message. In order to communicate these two processes, there are many additional work.
My solution is to develop a script using Python, because Python's library support is complete, so you can get an additional benefits of cross-platform. I use structured design, because things that use the web robot to obtain data itself is structured. I will pack the operations related to the specific website in a few functions, which means that if the script has been properly rewritten, there can be other applications. That is, this script is not only used in Shanghai hotline second-hand housing network, not only to search for second-hand housing.
Before writing this script, I already have some experiences of WEB robots, and the language used includes VB, Java, Python, and more. I think the web robot I have written is more mature.
The point of this program is that, first submit the conditions to the second-hand housing network, then obtain the page search results, read the information clauses (), read the necessary information (these information read) from the first page () Take the same need for hard-coded starting flags and end signs. Then simulate the page according to this information, click on the "Detailed" link and other actions. These actions are nothing more than GET, POST action in the HTTP standard.
Some of the above programs require hard coding, if necessary, analyze browser client scripts (such as JavaScript):
* The starting flag and end sign of each information terms
* The current page count start flag and end flag
* The starting flag and end flag of the total number of pages
* Hyperlink address in the page action
* Click the hyperlink address in the "Details" hyperlink
I have obtained the data I am print on the screen, which uses a comma-division, and if you want to import the data into the file, you can use the pipeline mechanism.
For example, suppose I wrote the script is House.py, you want to import House.py's data into the file House.csv, you can enter such a command:
Python House.py> House.csv
Next you can analyze House.csv. I usually use the editor that supports the regular expression to remove some of the formatted characters in House.csv, and use Excel to analyze.
4. Specific implementation and code
Some small problems have also been encountered in the specific implementation. For example, in order to support Chinese, I have to start with "# - * - coding: mbcs - * -". I need to use the specified proxy server, so I use the Urllib2 library instead of the URLLIB library. These issues have been explained in the source code.
Below is the source code:
1 # - * - CODING: MBCS - * -
2
3 # ------------------ User Configuration ------------ Begin
4
5 # Do you use the HTTP proxy that you need to verify? (Only for very few companies, most cases are 0)
6 config_using_proxy = 0
Seduce
8 # proxy address and port number, only when config_using_proxy is equal to 1
9 config_httpproxy = ""
10
11 # Login the user name of the proxy server, only when config_using_proxy is equal to 1, otherwise it can be ignored
12 config_username = ""
13
14 # Log in the password of the proxy server, only when config_using_proxy is equal to 1, otherwise it can be ignored
15 config_password = ""
16
17 # Select the area of the area ("" Minhang "," Xuhui "," Yangpu "," Hongkou "," Changning "," Jing'an "," Luwan "," Huangpu "," Zhabei "," Putuo ", "Pudong", "Baoshan", "Jiading", "Qingpu", "Fengxian", "Nanhui", "Jinshan", "Songjiang", "Chongming"), "" means all
18 config_szqx = "Hongkou"
19
20 # Select a few rooms? ("" "" "" "1", "2", "3", "4", "5")
21 config_fx_room = "2"
twenty two
23 # Select a few halls? ("" "," 0 "," 1 "," 2 "," 3 "," 4 ") 24 config_fx_hall =" 1 "
25
26 # Select the total price limit (unit: 10 "," 15 "," 20 "," 25 "," 30 "," 35 "," 40 "," 50 "," 60 "," 70 "," 80 "," 90 "," 100 ")," 0 "is not limited
27 config_jg_min = "20"
Twist
29 # Select the total price limit (unit: 10,000), "" 10000 "," 10 "," 15 "," 20 "," 25 "," 30 "," 40 "," 50 "," 60 "," 70 "," 80 "," 90 "," 100 ")," 10000000 "is not limited
30 config_jg_max = "60"
31
32 # Select the trading method, ("0", "1"), 0 means it, 1 means replacement
33 Config_Type = "0"
34
35 # Registration Date (unit, day), ("15", "30", "180", ""), "" means no limit
36 config_djrq = "15"
37
38 # Search the flag string started in the first line of the table (the first line starts the search row start sign from the string)
39 config_tbl_begin_str = "> District counties <"
40
41 # Search the logo string at the end of the final line (the last line starts to reverse the search line end sign from the string)
42 Config_TBL_END_STR = "Find a qualified sale listing"
43 # ------------------------------------- End
44
45
46 # ------------------------ Administrator Configuration ------ Begin
47
48 config_post_data = { "szqx": config_szqx, "fx_room": config_fx_room, "fx_hall": config_fx_hall, "jg_min": config_jg_min, "jg_max": config_jg_max, "type": config_type, "djrq": config_djrq, "sortfield": "DJRQ", "SortType": "DESC", "Whichpage": "1"}
49
50 # ---------------------------- END
51
52
53
54
55 from String Import *
56 Import Sys
57
58 # ---------------- Print Routines ------------- Begin59 Def Dump_Row_end ():
60 sys.stdout.write ('/ n')
61
62
63 Def dump_table_begin ():
64 sys.stdout.write ("District" ",")
65 sys.stdout.write ("Property Address" ",")
66 sys.stdout.write ("Room" ",")
67 sys.stdout.write ("Property Type" ",")
68 sys.stdout.write ("Building area" ",")
69 sys.stdout.write ("Total Price" ",")
70 sys.stdout.write ("Registration Time" ",")
71 sys.stdout.write ("Listing" ",")
72 sys.stdout.write ("Property Name" ",")
73 sys.stdout.write ("Age" ",")
74 sys.stdout.write ("Property Description" ",")
75 sys.stdout.write ("Medium Quote" ",")
76 sys.stdout.write ("Property Detailed address" ",")
77 sys.stdout.write ("The house is heading" ",")
78 sys.stdout.write ("The floor" ",")
79 sys.stdout.write ("Dance" ",")
80 sys.stdout.write ("Indoor Condition" ",")
81 sys.stdout.write ("Valid Period" ",")
82 sys.stdout.write ("Note" ",")
83 sys.stdout.write ("Contact" " ", ")
84 sys.stdout.write ("Contact phone" ",")
85 sys.stdout.write ("/ n")
86
87
88 Def Dump_One_Field (STR):
89 sys.stdout.write (str ",")
90
91 # ----------------- Print Routines ------------- End
92
93
94 # ---------------- House Parser ------------- Begin
95 DEF GET_LAST_PAGE_NUMBER (s):
96 NO_BEGIN = FIND (S, "Last")
97 if NO_BEGIN == - 1:
98 RETURN 0
99
100 NO_BEGIN = RFIND (S, "JavaScript: form_submit (/ '", 0, no_begin) 101 NO_BEGIN = LEN ("JavaScript: form_submit (/'")
102 NO_END = Find (s, "/ '", no_begin)
103 if NO_END == - 1:
104 returnography 0
105
106 if NO_BEGIN> NO_END:
107 returnography 0
108 RETURN ATOI (S [NO_BEGIN: NO_END])
109
110
111 DEF GET_DATA_IN_ONE_TAG (INSTR4, TAG):
112 tag_begin = find (INSTR4, "<" tag)
113 if tag_begin == - 1:
114 Return INSTR4
115
116 tag_begin = find (instr4, ">")
117 if tag_begin == - 1:
118 Return INSTR4
119 tag_begin = 1
120
121 Tag_end = Find (INSTR4, "" tag ">", tag_begin)
122 if tag_end == - 1:
123 RETURN INSTR4
124
125 Return INSTR4 [TAG_BEGIN: TAG_END]
126
127 DEF Filter_Rubbish_Data (STR):
128 RETURN STRIP (Replace (Replace (Replace (Str, ",", ","), '/ N', ''), '/ T', ''), '/ R', '') # Maybe WE WILL OUTPUT DATA IN CSV Format
129
130 Def get_one_detailed_data (INSTR3, Keyword):
131
132 #Print INSTR3 #debug
133 #Print Keyword #debug
134
135 Data_Begin = Find (Instr3, Keyword)
136 if Data_begin == - 1:
137 RETURN ""
138 # Handle Data
139 DATA_BEGIN = FIND (INSTR3, "
140 if Data_begin == - 1:
141 RETURN ""
142
143 DATA_BEGIN = Find (INSTR3, ">", DATA_BEGIN)
144 if Data_begin == - 1:
145 Return ""
146 DATA_BEGIN = DATA_BEGIN 1
147
148 DATA_END = Find (INSTR3, " TD>", DATA_BEGIN) 149 if Data_END == - 1:
150 RETURN ""
151
152 if Data_begin> Data_end:
153 Return ""
154 # Delete Space, Comma, Tab and Linefeed
155 #Return Replace (INSTR3 [DATA_BEGIN: DATA_END], ",", ""
156 RETURN FILTER_RUBBISH_DATA (INSTR3 [DATA_BEGIN: DATA_END])
157
158
159 DEF GET_DETAILED_DATA (INSTR2):
160 dump_one_field (get_one_detailed_data (INSTR2, "Listing No."))
161 dump_one_field (get_one_detailed_data (INSTR2, "Property Name"))
162 dump_one_field (get_one_detailed_data (instr2, "limited age"))
163 dump_one_field (get_one_detailed_data (INSTR2, Property Description "))))
164 DUMP_ONE_FIELD (get_one_detailed_data (instr2, "intermediary quotation")))
165 # delete the href
166 Tmpstr = GET_ONE_DETAILED_DATA (INSTR2, "Property Address")
167 TMPPOS = Find (Tmpstr, "
168 IF Tmpstr <> - 1: 169 TMPSTR = Strip (Tmpstr [: TMPPOS]) 170 dump_one_field (tmpstr) 171 dump_one_field (get_one_detailed_data (INSTR2, "House orientation")) 172 dump_one_field (get_one_detailed_data (INSTR2, "Floor"))) 173 dump_one_field (get_one_detailed_data (INSTR2, "Dance"))) 174 dump_one_field (GET_ONE_DETAILED_DATA (INSTR2, "Indoor Condition"))) 175 dump_one_field (GET_ONE_DETAILED_DATA (INSTR2, "Valid Limit")) 176 dump_one_field (get_one_detailed_data (INSTR2, "Note")) 177 dump_one_field (get_data_in_one_tag (get_one_detailed_data (INSTR2, "Contact"), "DIV")) 178 DUMP_ONE_FIELD (get_data_in_one_tag (get_one_detailed_data), "DIV") 179 180 181 DEF GET_DATA (INSTR, TBL_BEGIN_STR, TBL_END_STR): 182 #Table Begin183 IDX = Find (Instr, TBL_BEGIN_STR) 184 if idx == - 1: 185 return 186 IDX = Find (INSTR, " 187 if idx == - 1: 188 RETURN 189 TABLE_BEGIN = IDX 190 #Print INSTR [TABLE_BEGIN: TABLE_BEGIN 100] #debug 191 192 #Table end 193 idx = find (instr, tbl_end_str, table_begin) 194 #Print INSTR [TABLE_BEGIN: IDX] 195 if idx == - 1: 196 return 197 IDX = RFIND (Instr, " TR>", Table_Begin, IDX) 198 if idx == - 1: 199 return 200 Table_END = IDX LEN (" TR>") 201 #Print INSTR [TABLE_BEGIN: TABLE_END] #debug 202 203 #Search Rows 204 TR_IDX = Table_Begin 205 While TR_IDX 206 #tr begin 207 TR_IDX = Find (INSTR, " 208 if TR_IDX == - 1: 209 return 210 TR_IDX = FIND (Instr, ">", TR_IDX) 211 if TR_IDX == - 1: 212 Return 213 TR_BEGIN = TR_IDX 1 214 215 # trans End 216 TR_IDX = Find (Instr, " TR>", TR_BEGIN) 217 if TR_IDX == - 1: 218 return 219 TR_END = TR_IDX 220 #Print INSTR [TR_BEGIN: TR_END] #debug 221 222 223 #Search Cells in One Row 224 TD_IDX = TR_BEGIN 225 IS_REALLY_A_ROW_DUMPED = 0 226 While TD_IDX 227 # TD Data Begin 228 TD_IDX = FIND (INSTR, " 229 #Print TD_IDX #debug 230 if td_idx == - 1: 231 Return 232 TD_IDX = FIND (Instr, ">", TD_IDX) 233 #Print TD_IDX # debug234 if td_idx == - 1: 235 return 236 TDDATA_BEGIN = TD_IDX 1 237 238 # TD Data End 239 TD_IDX = FIND (Instr, " TD>", TD_IDX) 240 #Print TD_IDX #debug 241 if td_idx == - 1: 242 return 243 TDDATA_END = TD_IDX 244 245 if tddata_begin> TDDATA_END: 246 Continue 247 248 if tddata_end> TR_END: 249 Continue 250 251 if tddata_end> Table_end: 252 Continue 253 254 TDDATA = filter_rubbish_data (instr [tddata_begin: tddata_end]) 255 256 #if The Tddata Is A Href, Let's Get More Data from The Href 257 href_begin = find (TDDATA, "HREF = /" javascript: urll (/ '") 258 if href_begin == - 1: 259 dump_one_field (TDDATA) 260 Continue 261 262 263 href_begin = href_begin len ("href = /" javascript: urll (/ '") 264 265 href_end = find (TDDATA, "/ '", href_begin) 266 if href_end == - 1: 267 return 268 269 view_url = "http://secondhand.online.sh.cn/" TDDATA [href_begin: href_end] 270 #Print View_URL #debug 271 #dump_one_field (view_url) 272 273 View_Result = urllib2.urlopen (view_url) 274 view_data = view_Result.read () 275 #Print "View_data =" view_data #debug 276 get_detailed_data (view_data) 277 IS_REALLY_A_ROW_DUMPED = 1 278 279 if is_really_a_row_dumped: #sometimes, no Td Output 280 dump_row_end () 281 # ----------------- House Parser ------------- End 282 283 284 DEF Install_Proxy (): 285 httpproxy = config_httpproxy 286 username = config_username 287 Password = config_password 288 httpproxyString = 'http: //' username ':' password '@' httpproxy 289 290 # build a new Opener That Uses a proxy requesting authorization 291 proxy_support = urllib2.proxyhandler ({"http": httpproxystring}) 292 293 authinfo = urllib2.httpbasicauthhandler () 294 Opener = urllib2.build_opener (proxy_support, authinfo, urllib2.httphandler) 295 296 # Install IT 297 urllib2.install_opener (Opener) 298 299 # ---------------- ----------------------- Begin 300 if __name __ == "__ main__": 301 #get the page 302 Import Urllib2 303 Import Urllib 304 305 #using proxy 306 if config_using_proxy: 307 INSTALL_PROXY () 308 309 f = urllib2.urlopen ("http://secondhand.online.sh.cn/selllist.php", urllib.urlencode (config_post_data)) 310 #Print F.Headers #debug 311 s = f.read () 312 #Print S #debug 313 314 #Parse the HTML Page 315 #S = " 316 #config_tbl_begin_str = " 317 #config_tbl_end_str = " table>" #debug 318 319 # Print Out The Table Header 320 dump_table_begin () 321 # Print Out the first page322 get_data (s, config_tbl_begin_str, config_tbl_end_str) 323 324 # get the page size from the first page data 325 last_page = GET_LAST_PAGE_NUMBER (S) 326 # Print Out Other Pages (if EXIST) 327 for I in Range (2, last_page): 328 config_post_data ['Whichpage'] = STR (i) 329 f = urllib2.urlopen ("http://secondhand.online.sh.cn/selllist.php", urllib.urlencode (config_post_data)) 330 s = f.read () 331 get_data (s, config_tbl_begin_str, config_tbl_end_str) 332 333 #S = " 334 #Print get_one_detailed_data (s, "header1") #debug 335 336 #Print get_last_page_number (" Last") #debug 337 # ---------------- Main --------------------- End
DATA12 DATA12 TD> DATA21 TD> DATA22 TD> tr> table> #debug
#debug
Header1 TD> DATA1 TD>" #debug