Treasure hunt on the web page

xiaoxiao2021-03-06 60

Know yourself, don't fight

In the interaction process of the Yahoo China server, we need constant analysis web pages, get information, prepare for the next step. Or, if we can't get some information in a web page, then the program does not know how to go down.

First, the information that needs to be obtained is a challenge value. It exists in http://cn.mail.yahoo.com/page (actually there is not only the webpage, Yahoo China home page, you can also log in to Yahoo China Mailbox )among. The second step is to get the inbox URL of the mailbox, which exists on the web page you just entered the mailbox. Finally, all mail information is collected in the inbox page, and finally calculate the accessory download address.

Workers must be a prostitute

Bqyahoo uses regular expressions to analyze the web. This can quickly and accurately find the desired information, and the robustness and adaptability of the program also guarantees.

The regular expression function library that meets the following requirements is not a lot:

l Win 32 platform

L C / C language

l is compatible with the VC compiler

Eventually I chose the regular expression class in ATL Server. Originally, I want to catch up, using the regular expression in the Boost library, can eventually be unable to "get together" in the Boost library (I chose a strategy).

ATL Server is a set of libraries provided by ATL7.0, where regular expressions are integrated. Its main class is: CATLREGEXP. For more information about this class, please check the MSDN.

/// This can be briefly introduced for the use of CATLREGEXP, depending on the situation. /

key

In the previous article, we mention that the challenge value is a must "key" that generates the login website. However, this "key" is very easy to find, the context environment that appears in the web source value in the web source code is as follows:

Bqyahoo uses the following code to get the Challenge value.

Bool XRegex :: getchallengetext (char response [], xdata * xdata) // Get Challenge to generate login URL

{

CATLREGEXP <> RECT;

Rect.parse ("{challenge} [^ <>] value = /" {[^ <>] } / "");

CATLRematchContext <> mcct;

IF (Rect.match (Response, & MCCT) == false) Return False; // RESPONSE stores web source code

Const CatlrematchContext <> :: rechar * szstart = 0;

Const CatlremeatchContext <> :: rechar * szend = 0;

McCT.GETMATCH (1, & Szstart, & Sze);

Memset (xData-> ChallengeText, 0,40);

Strncpy (xData-> challengetext, szstart, szend-szstart); Return True;

}

The focus of this code is:

Rect.parse ("{challenge} [^ <>] value = /" {[^ <>] } / "");

It represents how to match the challenge value. The following three features (you can write different matching modes based on these features):

l The front is a string "challenge" value = "

l Behind the string ""> "

l Challenge value does not contain payment "<>"

Before this article, you'll see how to obtain information is different, the essence is how to match it according to the context of the information. A good matching method can get information quickly and accurately, and this match is not ruled, completely relies on personal.

go ahead

After logging in, we will get the source code for the mailbox web page. In this source code, the most important thing is the inbox website. Here you can also find it directly in the source code like getting a challenge value. However, in Bqyahoo has taken an indirect way, first get the URL of the mail server, similar to http: //cn.f**.mail.yahoo.com/, then infer the inbox URL. It should be said that these two methods can be used, but the mail server URL will be used again, so now it is also good.

The following code gets the mail server URL:

Bool XRegex :: getmailandHosturl (const char Response [], xdata * xdata) // Get the mail server URL

{

Char P [200];

MEMSET (P, 0, 200);

STRCPY (P, "{http: // cn /. [^ /.] /. mail / .yahoo / .com /}");

CATLREGEXP <> RECT;

CATLRematchContext <> mcct;

CATLRematchContext <> :: matchgroup mg;

Rect.Parse (P);

Char s [100];

IF (! Rect.match (Response, & MCCT)) Return False;

Const CatlrematchContext <> :: rechar * szstart = 0;

Const CatlremeatchContext <> :: rechar * szend = 0;

McCt.getMatch (0, & Szstart, & Sze);

MEMSET (S, 0, 100);

Strncpy (s, szstart, szend-szstart);

xData-> mailhosturl = s; // mail server URL

XData-> logouturl = xdata-> mailhosturl "ym / logout"; // mailbox logout URL

MEMSET (P, 0, 200);

STRCPY (P, "{http: // cn /. [^ /.] /. mail / .yahoo / .com / [^ /"] } ");

Rect.Parse (P);

IF (! Rect.match (Response, & MCCT)) Return False;

McCt.getMatch (0, & Szstart, & Sze);

MEMSET (S, 0, 100);

Strncpy (s, szstart, szend-szstart);

xData-> mailurl = s; return true;

}

The key to this code is:

STRCPY (P, "{http: // cn /. [^ /.] /. mail / .yahoo / .com /}");

You should easily see its matching method.

After getting the mail server URL, it is easy to calculate the inbox URL according to the law. The following code takes the online content of the inbox:

Bool WebClient :: IninBox (const string mailhosturl)

{

String Inboxurl;

INBOXURL = MailhostURL "YM / ShowFolder? RB = Inbox & Box = Inbox";

Hconnect = :: Internetopenurl (HSession, Inboxurl.c_STR (), NULL, NULL, CONNECT_FLAG, NULL;

GetResponse ();

GetHead ();

Xlog.log ("Enter Inbox"); // Record Log Information

Xlog.log (HEAD);

Xlog.log (response);

IF (TestResponse () == false) Return False;

Return True;

}

Among them:

INBOXURL = MailhostURL "YM / ShowFolder? RB = Inbox & Box = Inbox";

It is clear to express the relationship between the mail server website and the inbox URL.

Treasure

So far, the page analysis is simple, just looking for individual information in the web source code. And this information is single, and there is little change. But in the inbox web source code, we will make the most complex web page analysis.

The inbox web source code will be our last web source code we have obtained from Yahoo Server, all information about downloading files is available at this page. Eventually we will analyze all downloads of all download files corresponding to the download code in bqyahoo.

First observe the web source segment

EF = "/ ym / showletter? msgid = 6240_2993_1101_1232_369_0_373_-1_0 & id = 3 & yy = 26008 & incy = 25 & order = down & sort = date & pos = 0 & view = a & head = b & box = inbox>

Test.part0

The Test.Part0 is the subject of the message, MSGID = 6240_2993_1101_1232_369_0_373_-1_0 is the name of the message (at this time it represents a determined mail for the mail server).

I know that the address of the email is meaningless to us, but we can calculate the download address of the attachment of this message according to it. Here is the first access address of this message (Bqyahoo only with an attachment) download address.

http://cn.f157.mail.yahoo.com/ym/ShowLetter/123.txt?box=Inbox&MsgId=6240_2993_1101_1232_369_0_373_-1_0&bodyPart=2&filename=123.txt&download=1&YY=82952&order=down&sort=date&pos=0&view=a&head=b

The MSGID = 6240_2993_1101_1232_369_0_373_-1_0 represents the message corresponding to this attachment. Bodypart = 2 indicates that it is the first attachment. You may ask me, how do you know this? Obviously Yahoo China will not tell me, Yahoopops! The source code is not (Yahoopops! Use other methods to collect emails and attachments). In fact, I am guess, fortunately, I guess it, the facts also prove this.

Obtaining the download URL of the attachment and obtaining the inbox URL, there are two ways of direct and indirect. If you use a direct way, let Bqyahoo enter this mail web page again to get the mail download address. This is obviously too low in efficiency. If you have to get the attachment of the five emails, then you will have five interactions with Yahoo China Server. If you use an indirect mode, use the mail URL to calculate the accessory URL, it is indeed improved, but it also has a big shortcoming: you can't learn the file name of the attachment, that is, this information filename = 123.txt is not available. Eventually I chose to calculate, then use other ways to solve the attachment name problem, which will be explained in detail in the next article.

Said so much, come and see the source code. The first part of the source code, first obtain the description file.

Bool XRegex :: getDescribeURL (const char Response [], xdata * xdata) // Gets the URL of the description file

{

Char P [200];

MEMSET (P, 0, 200);

Sprintf (p, "{/ ym / showletter} [^ <] msgid = {[0-9 _] } [^ <] {% s.describe}", xData-> Downloadcode .c_str ());

CATLREGEXP <> RECT;

CATLRematchContext <> mcct;

Rect.Parse (P);

IF (Rect.match (Response, & MCCT) == false) Return False;

Const CatlrematchContext <> :: rechar * szstart = 0;

Const CatlremeatchContext <> :: rechar * szend = 0;

Char i [300];

Char u [300];

McCT.GETMATCH (1, & Szstart, & Sze);

MEMSET (I, 0, 300);

STRNCPY (I, Szstart, Szend-Szstart);

MEMSET (U, 0, 300);

Sprintf (u, "% SYM / Showletter? Box = Inbox & Msgid =% S & Bodypart = 2 & ORDER = DOWN", XData-> MailhostURL .c_str (), i);

XData-> describeurl = u;

Return True;

}

Key code

Sprintf (p, "{/ ym / showletter} [^ <] msgid = {[0-9 _] } [^ <] {% s.describe}", xData-> Downloadcode .c_str ());

Describes how to match the mail URL where the description file is located.

Key code

Sprintf (u, "% SYM / Showletter? Box = Inbox & Msgid =% S & BODYPART = 2 & Order = DOWN", xData-> mailhosturl .c_str (), i); how to calculate the attachment URL according to the mail URL.

Next, get other attachments according to the download code

Void XRegex :: getFileInfo (Char Response [], xData * xdata) // Analyze information on shared files

{

Char P [200];

MEMSET (P, 0, 200);

IF (xData-> downloadcode .find (".part")! = - 1)

{

Sprintf (p, "{/ ym / showletter} [^ <] msgid = {[0-9 _] } [^ <] {% s! [0-9]}", xData-> Downloadcode .c_str ));

}

Else

{

Sprintf (p, "{/ ym / showletter} [^ <] msgid = {[0-9 _] } [^ <] {% s.part [0-9] }", xData-> Downloadcode. C_STR ());

}

CATLREGEXP <> RECT;

CATLRematchContext <> mcct;

Rect.Parse (P);

Char i [300];

Char f [300];

Char u [300];

Char * b = response;

While (Rect.Match (B, & MCCT))

{

Const CatlrematchContext <> :: rechar * szstart = 0;

Const CatlremeatchContext <> :: rechar * szend = 0;

McCT.GETMATCH (1, & Szstart, & Sze);

MEMSET (I, 0, 300);

STRNCPY (I, Szstart, Szend-Szstart);

XData-> vmsgid .push_back (i);

McCt.getMatch (2, & szstart, & sze);

MEMSET (F, 0, 300);

Strncpy (f, szstart, szend-szstart);

XData-> vFileName .push_back (f);

MEMSET (U, 0, 300);

Sprintf (u, "% SYM / Showletter? Box = Inbox & Msgid =% S & Bodypart = 2 & ORDER = DOWN", XData-> MailhostURL .c_str (), i);

XData-> vfiledownloadurl .push_back (u);

B = (char *) szed;

}

Key code

Sprintf (p, "{/ ym / showletter} [^ <] msgid = {[0-9 _] } [^ <] {% s.part [0-9] }", xData-> Downloadcode. C_STR ());

Explains how to match the mail URL in which the attachments related to the specific download code.

Where the While statement will record all the attachment URLs that need to collect.

At this point, we finally found "treasure"!

summary

This article tells how to use the regular expression to collect web information. In fact, the use of regular expressions here is very primitive, and only the most common functions are used. The difficulty of this article is what web information is, where, how to find them. Some of this article is just talking about fur, not clear. For example, what is the download code, why have a description file. About these content You can get relevant information in Bqyahoo's help files, in addition, the next article will also tell these content in detail.

转载请注明原文地址:https://www.9cbs.com/read-54531.html

9cbs

New Post(0)