Search for cheap second-hand housing with web robots

xiaoxiao2021-03-05 150

Problem

I need a set of two-bedroom and one hall, the price is around 400,000, located in a variety of districts (such as Xuhui, Minhang District) in Shanghai. My ideal houses must of course meet a lot of conditions, for example, it must be a house in two rooms, and its average price must be less than 7,000 yuan per square meter, and so on.

Usually I use the second-hand housing network of Shanghai hotline (http://secondhand.online.sh.cn/) to find the latest second-hand housing information.

But the latest second-hand housing information that tracks the website is too tired every day. The second-hand housing network has to add a lot of latest information every day. In order to identify the house that meets my conditions, I have to see all this information. The whole process is very wasting time.

the reason is:

1. The website will only display some of the newly added second-hand housing * information on the homepage. And in order to make your own judgment, the light is not enough to see the information, I have to see more information. And this means that I have to click the "Detailed" hyperlink to download the details of the detailed information.

2. Not all information is placed on a page. In fact, the information of the second-hand housing network of the Shanghai hotline is displayed, and only 10 information is displayed. If I need to see all the information, I need to see about 500-3000 pieces of information (how much information is strict, the size of the search is related).

The above 1st, 2 points is that even if I just need to watch 500 information, I also need to watch 550 pages:

500/10 (Summary Information Page) 500 * 1 (detailed page of each message) = 550

Each page loaded is 10 seconds (actual sample results), I read each page is 25 seconds (my reading speed is very fast).

550 * (10 25) = 19250 second = 320 minutes = 5 hours.

In other words, I will take 5 hours of reading. In fact, I have not considered the time to analyze information (such as 10 real estate average price, which is much more time than reading.

2. The essence of the problem

Since many information is now placed online, similar problems are common. For example, recruitment, securities, etc. have the same problem. This problem is how to get data from the Internet efficiently and analyze.

3. A solution

I have seen some domestic related articles, and the programs of these articles are based on IE automation. Usually add WebBrowser controls using the VB language.

The most important disadvantage of this program is that the WEBBROWSER has been packaged in many data operations, making the original simple operation becomes incomparable (I guess because Microsoft Design WebBrowser's purpose is to automate IE operations). For example, you can make a manager such as the current page. If you develop by WebBrowser, you need to get the current page, and you get two successful processes in a message. In order to communicate these two processes, there are many additional work.

My solution is to develop a script using Python, because Python's library support is complete, so you can get an additional benefits of cross-platform. I use structured design, because things that use the web robot to obtain data itself is structured. I will pack the operations related to the specific website in a few functions, which means that if the script has been properly rewritten, there can be other applications. That is, this script is not only used in Shanghai hotline second-hand housing network, not only to search for second-hand housing.

Before writing this script, I already have some experiences of WEB robots, and the language used includes VB, Java, Python, and more. I think the web robot I have written is more mature.

The point of this program is that, first submit the conditions to the second-hand housing network, then obtain the page search results, read the information clauses (), read the necessary information (these information read) from the first page () Take the same need for hard-coded starting flags and end signs. Then simulate the page according to this information, click on the "Detailed" link and other actions. These actions are nothing more than GET, POST action in the HTTP standard.

Some of the above programs require hard coding, if necessary, analyze browser client scripts (such as JavaScript):

* The starting flag and end sign of each information terms

* The current page count start flag and end flag

* The starting flag and end flag of the total number of pages

* Hyperlink address in the page action

* Click the hyperlink address in the "Details" hyperlink

I have obtained the data I am print on the screen, which uses a comma-division, and if you want to import the data into the file, you can use the pipeline mechanism.

For example, suppose I wrote the script is House.py, you want to import House.py's data into the file House.csv, you can enter such a command:

Python House.py> House.csv

Next you can analyze House.csv. I usually use the editor that supports the regular expression to remove some of the formatted characters in House.csv, and use Excel to analyze.

4. Specific implementation and code

Some small problems have also been encountered in the specific implementation. For example, in order to support Chinese, I have to start with "# - * - coding: mbcs - * -". I need to use the specified proxy server, so I use the Urllib2 library instead of the URLLIB library. These issues have been explained in the source code.

Below is the source code:

1 # - * - CODING: MBCS - * -

3 # ------------------ User Configuration ------------ Begin

5 # Do you use the HTTP proxy that you need to verify? (Only for very few companies, most cases are 0)

6 config_using_proxy = 0

Seduce

8 # proxy address and port number, only when config_using_proxy is equal to 1

9 config_httpproxy = ""

11 # Login the user name of the proxy server, only when config_using_proxy is equal to 1, otherwise it can be ignored

12 config_username = ""

14 # Log in the password of the proxy server, only when config_using_proxy is equal to 1, otherwise it can be ignored

15 config_password = ""

17 # Select the area of the area ("" Minhang "," Xuhui "," Yangpu "," Hongkou "," Changning "," Jing'an "," Luwan "," Huangpu "," Zhabei "," Putuo ", "Pudong", "Baoshan", "Jiading", "Qingpu", "Fengxian", "Nanhui", "Jinshan", "Songjiang", "Chongming"), "" means all

18 config_szqx = "Hongkou"

20 # Select a few rooms? ("" "" "" "1", "2", "3", "4", "5")

21 config_fx_room = "2"

twenty two

23 # Select a few halls? ("" "," 0 "," 1 "," 2 "," 3 "," 4 ") 24 config_fx_hall =" 1 "

26 # Select the total price limit (unit: 10 "," 15 "," 20 "," 25 "," 30 "," 35 "," 40 "," 50 "," 60 "," 70 "," 80 "," 90 "," 100 ")," 0 "is not limited

27 config_jg_min = "20"

Twist

29 # Select the total price limit (unit: 10,000), "" 10000 "," 10 "," 15 "," 20 "," 25 "," 30 "," 40 "," 50 "," 60 "," 70 "," 80 "," 90 "," 100 ")," 10000000 "is not limited

30 config_jg_max = "60"

32 # Select the trading method, ("0", "1"), 0 means it, 1 means replacement

33 Config_Type = "0"

35 # Registration Date (unit, day), ("15", "30", "180", ""), "" means no limit

36 config_djrq = "15"

38 # Search the flag string started in the first line of the table (the first line starts the search row start sign from the string)

39 config_tbl_begin_str = "> District counties <"

41 # Search the logo string at the end of the final line (the last line starts to reverse the search line end sign from the string)

42 Config_TBL_END_STR = "Find a qualified sale listing"

43 # ------------------------------------- End

46 # ------------------------ Administrator Configuration ------ Begin

48 config_post_data = { "szqx": config_szqx, "fx_room": config_fx_room, "fx_hall": config_fx_hall, "jg_min": config_jg_min, "jg_max": config_jg_max, "type": config_type, "djrq": config_djrq, "sortfield": "DJRQ", "SortType": "DESC", "Whichpage": "1"}

50 # ---------------------------- END

55 from String Import *

56 Import Sys

58 # ---------------- Print Routines ------------- Begin59 Def Dump_Row_end ():

60 sys.stdout.write ('/ n')

63 Def dump_table_begin ():

64 sys.stdout.write ("District" ",")

65 sys.stdout.write ("Property Address" ",")

66 sys.stdout.write ("Room" ",")

67 sys.stdout.write ("Property Type" ",")

68 sys.stdout.write ("Building area" ",")

69 sys.stdout.write ("Total Price" ",")

70 sys.stdout.write ("Registration Time" ",")

71 sys.stdout.write ("Listing" ",")

72 sys.stdout.write ("Property Name" ",")

73 sys.stdout.write ("Age" ",")

74 sys.stdout.write ("Property Description" ",")

75 sys.stdout.write ("Medium Quote" ",")

76 sys.stdout.write ("Property Detailed address" ",")

77 sys.stdout.write ("The house is heading" ",")

78 sys.stdout.write ("The floor" ",")

79 sys.stdout.write ("Dance" ",")

80 sys.stdout.write ("Indoor Condition" ",")

81 sys.stdout.write ("Valid Period" ",")

82 sys.stdout.write ("Note" ",")

83 sys.stdout.write ("Contact" " ", ")

84 sys.stdout.write ("Contact phone" ",")

85 sys.stdout.write ("/ n")

88 Def Dump_One_Field (STR):

89 sys.stdout.write (str ",")

91 # ----------------- Print Routines ------------- End

94 # ---------------- House Parser ------------- Begin

95 DEF GET_LAST_PAGE_NUMBER (s):

96 NO_BEGIN = FIND (S, "Last")

97 if NO_BEGIN == - 1:

98 RETURN 0

100 NO_BEGIN = RFIND (S, "JavaScript: form_submit (/ '", 0, no_begin) 101 NO_BEGIN = LEN ("JavaScript: form_submit (/'")

102 NO_END = Find (s, "/ '", no_begin)

103 if NO_END == - 1:

104 returnography 0

105

106 if NO_BEGIN> NO_END:

107 returnography 0

108 RETURN ATOI (S [NO_BEGIN: NO_END])

109

110

111 DEF GET_DATA_IN_ONE_TAG (INSTR4, TAG):

112 tag_begin = find (INSTR4, "<" tag)

113 if tag_begin == - 1:

114 Return INSTR4

115

116 tag_begin = find (instr4, ">")

117 if tag_begin == - 1:

118 Return INSTR4

119 tag_begin = 1

120

121 Tag_end = Find (INSTR4, "", tag_begin)

122 if tag_end == - 1:

123 RETURN INSTR4

124

125 Return INSTR4 [TAG_BEGIN: TAG_END]

126

127 DEF Filter_Rubbish_Data (STR):

128 RETURN STRIP (Replace (Replace (Replace (Str, ",", ","), '/ N', ''), '/ T', ''), '/ R', '') # Maybe WE WILL OUTPUT DATA IN CSV Format

129

130 Def get_one_detailed_data (INSTR3, Keyword):

131

132 #Print INSTR3 #debug

133 #Print Keyword #debug

134

135 Data_Begin = Find (Instr3, Keyword)

136 if Data_begin == - 1:

137 RETURN ""

138 # Handle Data

139 DATA_BEGIN = FIND (INSTR3, "

140 if Data_begin == - 1:

141 RETURN ""

142

143 DATA_BEGIN = Find (INSTR3, ">", DATA_BEGIN)

144 if Data_begin == - 1:

145 Return ""

146 DATA_BEGIN = DATA_BEGIN 1

147

148 DATA_END = Find (INSTR3, "", DATA_BEGIN) 149 if Data_END == - 1:

150 RETURN ""

151

152 if Data_begin> Data_end:

153 Return ""

154 # Delete Space, Comma, Tab and Linefeed

155 #Return Replace (INSTR3 [DATA_BEGIN: DATA_END], ",", ""

156 RETURN FILTER_RUBBISH_DATA (INSTR3 [DATA_BEGIN: DATA_END])

157

158

159 DEF GET_DETAILED_DATA (INSTR2):

160 dump_one_field (get_one_detailed_data (INSTR2, "Listing No."))

161 dump_one_field (get_one_detailed_data (INSTR2, "Property Name"))

162 dump_one_field (get_one_detailed_data (instr2, "limited age"))

163 dump_one_field (get_one_detailed_data (INSTR2, Property Description "))))

164 DUMP_ONE_FIELD (get_one_detailed_data (instr2, "intermediary quotation")))

165 # delete the href

166 Tmpstr = GET_ONE_DETAILED_DATA (INSTR2, "Property Address")

167 TMPPOS = Find (Tmpstr, "

168 IF Tmpstr <> - 1:

169 TMPSTR = Strip (Tmpstr [: TMPPOS])

170 dump_one_field (tmpstr)

171 dump_one_field (get_one_detailed_data (INSTR2, "House orientation"))

172 dump_one_field (get_one_detailed_data (INSTR2, "Floor")))

173 dump_one_field (get_one_detailed_data (INSTR2, "Dance")))

174 dump_one_field (GET_ONE_DETAILED_DATA (INSTR2, "Indoor Condition")))

175 dump_one_field (GET_ONE_DETAILED_DATA (INSTR2, "Valid Limit"))

176 dump_one_field (get_one_detailed_data (INSTR2, "Note"))

177 dump_one_field (get_data_in_one_tag (get_one_detailed_data (INSTR2, "Contact"), "DIV"))

178 DUMP_ONE_FIELD (get_data_in_one_tag (get_one_detailed_data), "DIV")

179

180

181 DEF GET_DATA (INSTR, TBL_BEGIN_STR, TBL_END_STR):

182 #Table Begin183 IDX = Find (Instr, TBL_BEGIN_STR)

184 if idx == - 1:

185 return

186 IDX = Find (INSTR, "

187 if idx == - 1:

188 RETURN

189 TABLE_BEGIN = IDX

190 #Print INSTR [TABLE_BEGIN: TABLE_BEGIN 100] #debug

191

192 #Table end

193 idx = find (instr, tbl_end_str, table_begin)

194 #Print INSTR [TABLE_BEGIN: IDX]

195 if idx == - 1:

196 return

197 IDX = RFIND (Instr, "", Table_Begin, IDX)

198 if idx == - 1:

199 return

200 Table_END = IDX LEN ("")

201 #Print INSTR [TABLE_BEGIN: TABLE_END] #debug

202

203 #Search Rows

204 TR_IDX = Table_Begin

205 While TR_IDX

206 #tr begin

207 TR_IDX = Find (INSTR, "

208 if TR_IDX == - 1:

209 return

210 TR_IDX = FIND (Instr, ">", TR_IDX)

211 if TR_IDX == - 1:

212 Return

213 TR_BEGIN = TR_IDX 1

214

215 # trans End

216 TR_IDX = Find (Instr, "", TR_BEGIN)

217 if TR_IDX == - 1:

218 return

219 TR_END = TR_IDX

220 #Print INSTR [TR_BEGIN: TR_END] #debug

221

222

223 #Search Cells in One Row

224 TD_IDX = TR_BEGIN

225 IS_REALLY_A_ROW_DUMPED = 0

226 While TD_IDX

227 # TD Data Begin

228 TD_IDX = FIND (INSTR, "

229 #Print TD_IDX #debug

230 if td_idx == - 1:

231 Return

232 TD_IDX = FIND (Instr, ">", TD_IDX)

233 #Print TD_IDX # debug234 if td_idx == - 1:

235 return

236 TDDATA_BEGIN = TD_IDX 1

237

238 # TD Data End

239 TD_IDX = FIND (Instr, "", TD_IDX)

240 #Print TD_IDX #debug

241 if td_idx == - 1:

242 return

243 TDDATA_END = TD_IDX

244

245 if tddata_begin> TDDATA_END:

246 Continue

247

248 if tddata_end> TR_END:

249 Continue

250

251 if tddata_end> Table_end:

252 Continue

253

254 TDDATA = filter_rubbish_data (instr [tddata_begin: tddata_end])

255

256 #if The Tddata Is A Href, Let's Get More Data from The Href

257 href_begin = find (TDDATA, "HREF = /" javascript: urll (/ '")

258 if href_begin == - 1:

259 dump_one_field (TDDATA)

260 Continue

261

262

263 href_begin = href_begin len ("href = /" javascript: urll (/ '")

264

265 href_end = find (TDDATA, "/ '", href_begin)

266 if href_end == - 1:

267 return

268

269 view_url = "http://secondhand.online.sh.cn/" TDDATA [href_begin: href_end]

270 #Print View_URL #debug

271 #dump_one_field (view_url)

272

273 View_Result = urllib2.urlopen (view_url)

274 view_data = view_Result.read ()

275 #Print "View_data =" view_data #debug

276 get_detailed_data (view_data)

277 IS_REALLY_A_ROW_DUMPED = 1

278

279 if is_really_a_row_dumped: #sometimes, no Td Output

280 dump_row_end ()

281 # ----------------- House Parser ------------- End

282

283

284 DEF Install_Proxy ():

285 httpproxy = config_httpproxy

286 username = config_username

287 Password = config_password

288 httpproxyString = 'http: //' username ':' password '@' httpproxy

289

290 # build a new Opener That Uses a proxy requesting authorization

291 proxy_support = urllib2.proxyhandler ({"http": httpproxystring})

292

293 authinfo = urllib2.httpbasicauthhandler ()

294 Opener = urllib2.build_opener (proxy_support, authinfo, urllib2.httphandler)

295

296 # Install IT

297 urllib2.install_opener (Opener)

298

299 # ---------------- ----------------------- Begin

300 if __name __ == "__ main__":

301 #get the page

302 Import Urllib2

303 Import Urllib

304

305 #using proxy

306 if config_using_proxy:

307 INSTALL_PROXY ()

308

309 f = urllib2.urlopen ("http://secondhand.online.sh.cn/selllist.php", urllib.urlencode (config_post_data))

310 #Print F.Headers #debug

311 s = f.read ()

312 #Print S #debug

313

314 #Parse the HTML Page

315 #S = "

DATA12

DATA21

DATA22 #debug

316 #config_tbl_begin_str = "

#debug

317 #config_tbl_end_str = "" #debug

318

319 # Print Out The Table Header

320 dump_table_begin ()

321 # Print Out the first page322 get_data (s, config_tbl_begin_str, config_tbl_end_str)

323

324 # get the page size from the first page data

325 last_page = GET_LAST_PAGE_NUMBER (S)

326 # Print Out Other Pages (if EXIST)

327 for I in Range (2, last_page):

328 config_post_data ['Whichpage'] = STR (i)

329 f = urllib2.urlopen ("http://secondhand.online.sh.cn/selllist.php", urllib.urlencode (config_post_data))

330 s = f.read ()

331 get_data (s, config_tbl_begin_str, config_tbl_end_str)

332

333 #S = "

Header1

DATA1 " #debug

334 #Print get_one_detailed_data (s, "header1") #debug

335

336 #Print get_last_page_number (" Last") #debug

337 # ---------------- Main --------------------- End

转载请注明原文地址:https://www.9cbs.com/read-34801.html

9cbs

New Post(0)