Preemptive multi-threaded Web Spider Author: Sim Ayers Translation: Liu Jianqiang Win32 API supports preemptive multithreading network, which is written in MFC web spider is very useful. Spider Engineering (Program) is a procedure for how to use a pre-standard multi-threaded technology to use network spider / robot aggregation information. The project produces a program like a spider, which checks the Web site for the disconnected URL link. Link verification is only performed on the link specified by HREF. It displays a constantly updated URL list in a list view clistview to reflect the status of the hyperlink. This project can be used as a template for collecting, index information, which stores this information into a database file that can be used to query. The search engine collects information on the Web using program collection information called Robots (also called reptile, spider, worm, worm, worm, etc.), which automatically aggregates and index information on the Web, then stores this information into the database. (Note: A robot will search for a page, and then link the links on this page as the starting point of the new URL to be indexed) Users can create queries to query these databases to discover the information they need. By using the predecessor multi-threaded, you can index a URL link-based web page, start a new thread to follow each new URL link, index a new URL starting point. This project uses and custom MDI sub-frames with the MDI document class to display an edit view while downloading the web page, display a list view when checking the URL connection. In addition, this project uses Cobarray, CinternetSession, ChttpConnection, ChttpFile, and Cwinthread MFC classes. The CWINTHREAD class is used to generate multithreading instead of asynchronous mode in the CinternetSession class, which is reserved from the 16-bit Windows platform of Insock. Spider engineering uses a simple working thread to check the URL link, or download a web page. The CSPIDERTHREAD class is derived from the CWINTHREAD class, so each CSPIDERTHREAD object can use the cwinthread's Message_Map () function. By declaring "Declare_Message_Map () in the CSPIDERTHREAD class, the user interface can respond to the user's input. This means that you can check the URL link on a web server, you can download or open a web page from another web server. The user interface will not respond to the user's input only when the thread exceeds Maximum_Wait_Objects defined as 64. In the constructor of each CSPIDERTHREAD object, we provide the ThreadProc function and thread parameters that will be transferred to the ThreadProc function.
CSpiderThread * pThread; pThread = NULL; pThread = new CSpiderThread (CSpiderThread :: ThreadFunc, pThreadParams); // create a new CSpiderThread objects; we set the pointer CWinThread * m_pThread in the thread parameters in CSpiderThread class constructor, so we can this case points to the correct thread: pThreadParams-> m_pThread = this; the CSpiderThread ThreadProc function // simple worker thread function UINT CSpiderThread :: ThreadFunc (LPVOID pParam) {ThreadParams * lpThreadParams = (ThreadParams *) pParam; CSpiderThread * lpThread = ( CspiDERTHREAD *) LPTHREADPARAMS-> M_PTHREAD; lpthread-> threadrun (lpthreadparams); // This uses SendMessage instead of PostMessageUse to keep the current thread number synchronization. // If the number of threads is greater than MAXIMUM_WAIT_OBJECTS (64), the present program will be unable to respond to user input :: SendMessage (lpThreadParams-> m_hwndNotifyProgress, WM_USER_THREAD_DONE, 0, (LPARAM) lpThreadParams); // remove and reduce the total number of threads lpThreadParams return 0; } this structure is passed to CSpiderThread ThreadProc function typedef struct tagThreadParams {HWND m_hwndNotifyProgress; HWND m_hwndNotifyView; CWinThread * m_pThread; CString m_pszURL; CString m_Contents; CString m_strServerName; CString m_strObject; CString m_checkURLName; CString m_string; DWORD m_dwServiceType; DWORD m_threadID; DWORD m_Status; URLStatus m_pstatus; Internet_port m_nport; int m_type; bool m_rootlinks;} threadparams; CSPIDERTHREAD object creation, we start a new thread object with the CreatThread function. If (! pthread-> createthread ()) // Start a CWINTHREAD object to perform {AFXMessageBox; delete pthread; pthread = null; delete pthreadparams; Return False;} Once new thread is running, We use: SengMessage function Send messages to cdocument's-> clistview, this message with a state structure of the URL link.
if (! pThreadParams-> m_hwndNotifyView = NULL) :: SendMessage (pThreadParams-> m_hwndNotifyView, WM_USER_CHECK_DONE, 0, (LPARAM) & pThreadParams-> m_pStatus); URL state structure: typedef struct tagURLStatus {CString m_URL; CString m_URLPage; CString m_StatusString; CString m_LastModified; CString m_ContentType; CString m_ContentLength; DWORD m_Status;} URLStatus, * PURLStatus; each new thread to create a new CMyInternetSession class (derived from CInternetSession) object and EnableStatusCallback set to TRUE, so we can all InternetSession checks the status. Set the dwconText ID of the callback to the thread ID. Bool Cinetthread :: INITSERVER () {Try {m_psession = new cmyinternetSession (agentname, m_nthreadid); int ntimeout = 30; // is very important! If the setting is too small back to cause the server timeout, if the setting is too large, it will cause the thread hang. / * Network connection request time timeout value in milliseconds. If the connection request time exceeds this timeout, the request will be canceled. The default timeout value is unlimited. * / m_psession-> setoption (Internet_Option_connect_timeout, 1000 * ntimeout); / * Waiting delay value between retry connections in millisecond. * / m_psession-> setoption (Internet_Option_Connect_backoff, 1000); / * The number of retries when the network connection request is requested. If a connection is attempting to fail after the specified number of retries, the request is canceled. The default is 5. * / M_pSession-> SetOption (INTERNET_OPTION_CONNECT_RETRIES, 1); m_pSession-> EnableStatusCallback (TRUE);} catch (CInternetException * pEx) {// catch errors from WinINet // pEx-> ReportError (); m_pSession = NULL; pEx-> DELETE (); RETURN FALSE;} Return True;} Using the MFC Winine class in a single or multithreader, the key is to use TRY and CATCH blocks around all MFC Wininet class functions. Because the interconnect is sometimes unstable, or if the web page you visited is no longer, this case will throw a CinterNeTexception error. Try {// Some MFC Wininet Class Function} Catch (CinterneTexception * PEX) {// catch errors from wininet // pex-> reporterror (); pex-> delete (); return false;} The maximum number of threads is set to 64 You can set it to any number from 1 to 100. Setting too high will fail the link, meaning that you will have to re-check the URL link. In / cgi-bin / directory, a continuous http request will cause the server to crash.