Evolution of the method of processing email using Perl
Translation: CHUNZI: China Perl Association FPC (Foundation of Perlchina) Original Name: The Evolution of Perl Email Handling Author: Simon Cozens Original: http://www.perl.com/pub/a/2004/06 / 10/email.html Table: June 10, 2004Perlchina Remind you: Please protect the author's copyright and safeguard the crystallization of the author. Every day I have to spend a lot of time on email, or contact other jobs through mail, or is interested in analyzing, indexing, reorganizing, and taping the email content. Naturally, Perl assists me to do these things.
There are many ready-made modules on the CPAN to process emails, and we will introduce several major. At the same time, we will also pay attention to the Perl email project (Perl Email Project), which is committed to Richard Clamp, Simon Wistow, and other partners, the goal of the project is to provide a range of simple, effective, accurate email processing modules.
The handle of mail messages We are more simple, used to draw a single email, provide access to email headers and mail body, and even those modules that have changed their information begins with introduction.
All these modules of the grandfather are Mail :: Internet, created by Graham Barr, and Mark Overmeer is maintained. The module provides a constructor that reads the contents of the letter via an array (element is character serial) or a file handle, and returns a Mail :: Internet object that describes the letter. In the following example, we use variable $ RFC2822 to represent the message content of the string form.
MY $ OBJ = Mail :: Internet-> New ([split // n /, $ rfc2822]);
Mail :: Internet extracts a mail header object from the letter and connects the mail body information. The class of the mail header is
Mail :: Header. You can obtain or set the message header through this object:
MY $ SUBJECT = $ OBJ-> Head-> Get ("Subject");
$ OBJ-> Head-> Replace ("Subject", "New Subject"); read or edit the operation of the email content, you can use the body method:
MY $ OLD_BODY = $ OBJ-> BODY;
$ obj-> body ("Wasn't Worth Reading Anyway); I have not mentioned anything about MIME in now. For simple tasks,
Mail :: Internet is indeed very convenient, but it does not fully support the processing of MIME. Thank you,
MIME :: Entity as a design for MIME
Mail :: Internet subclass, allowing you to read each individual portion (Part) of the MIME message:
MY $ Num_Parts = $ OBJ-> Parts;
For (0 .. $ num_parts) {
MY $ part = $ obj-> parts ($ _);
...
}in case
Mail :: Internet and
Mime :: Entity is not suitable for you, you can try Mark Overmeer yourself
Mail :: Message module, the module is part of an impressive mail :: Box module.
Mail :: Message is a special, fully functional module, but these advantages do not always mean praise.
Mail :: Message objects are usually built internally when you read an email folder in Mail :: Box. Of course, it can also read a letter through the Read method:
$ obj = mail :: message-> ($ RFC2822);
Like the mail :: Internet, the mail message is split into email header and email body, and
Mail :: Internet is different, email body is also an object. We read the email head:
$ obj-> head-> get ("subject"); or, if it is the Subject header information and other common mail header information, you can read it:
$ Obj-> Subject; I can't find a way to set the header information, so I can eventually do this:
$ Obj-> Head-> delete ($ header);
$ obj-> head-> add ($ header, $ _) for @data; reading the email content as a string form expression is only a bit of trouble:
$ OBJ-> DECODED-> String, the operation of setting the email content is definitely a nightmare - we have to build one
Mail :: Message :: body object to overwrite existing.
$ obj-> body (mail :: message :: body-> new (data => [split // n /, $ body]));
Mail :: Message may have a little slow, and it is also difficult. Its system is also very complicated, and the above-mentioned operations have been used in 16 types (
Mail :: Address,
Mail :: Box :: Parser,
Mail :: Box :: Parser :: Perl,
Mail :: Message,
Mail :: Message :: body,
Mail :: Message :: body :: file,
Mail :: Message :: body :: line,
Mail :: MULTIPART,
Mail :: Message :: body :: nested,
Mail :: Message :: Construct,
Mail :: Message :: Field,
Mail :: message :: Field :: fast,
Mail :: message :: Head,
Mail :: Message :: Head :: Complete,
Mail :: Message :: Part, and
Mail :: reperter) and 4400 multi-line code. Although it does have a lot of functions, I still think that the analysis of the email should be more concise. So I sat down and decided that I wrote my own message to handle the log library.
Email :: Simple module, its interactive interface is shown below:
MY $ OBJ = Email :: Simple-> New ($ RFC2822);
MY $ SUBJECT = $ OBJ-> Header ("Subject");
$ Obj-> Header_Set ("Subject", "A New Subject");
MY $ OLD_BODY = $ OBJ-> BODY;
$ OBJ-> Body_SET ("A New Body / N");
Print $ OBJ-> as_string; it doesn't do something, but it is very simple and efficient. If you need MIME processing, you can use its subclass Email :: Mime, which adds the Parts method.
In fact, choosing which mail handler library is completely dependent on you, end users, but not always like this. There are many auxiliary modules to help you process mail messages on a higher application layer, you might ask you to provide specific email expressions. For example, the most recent mail :: ListDetector module (later we will parse), you need to pass it for mail :: Internet object because the object's operation interface (API) is known. And I don't want to use the mail :: Internet object, but I need some features of Mail :: ListDetector, then what can I do?
In order to make the user can have such a choice, I wrote an abstraction layer for expressing the operation interface above, called Email :: Abstract. Given any of the types of objects, we can say:
MY $ SUBJECT = Email :: Abstract-> get_header ($ OBJ, "SUBJECT");
Email :: Abstract-> set_header ($ OBJ, "Subject", "My New Subject");
MY $ body = email :: Abstract-> get_body ($ obj);
Email :: Abstract-> set_body ($ Message, "Hello / NTest Message / N");
$ RFC2822 = Email :: Abstract-> as_string ($ OBJ);
Email :: Abstract knows how to do a corresponding action on these major email expressions. It also abstracts the process of constructing mail messages and allows you to change the operation interface of the message object by class method Cast:
MY $ OBJ = Email :: Abstract-> Cast ($ RFC2822, "Mail :: Internet"); My $ mm = email :: Abstract-> Cast ($ OBJ, "Mail :: Message"); this makes modules The author can write a mail handler library using the "interface-agnostic". I am very grateful to Michael Stevens.
Mail :: ListDetector is used
Email :: abstract. Now I can
Email :: Simple object passed
Mail :: ListDetector, and it works very well.
Email :: Abstract also gives us a chance to be a benchmark (Benchmarks) for all of these modules. Here is the test code I use:
Use email :: abstract;
MY $ message = DO {local $ /;;
my @classes =
Qw (Email :: Mime Email :: Simple Mime :: Entity Mail :: Internet Mail :: Message);
Eval "Require $ _" or DIE $ @ for @classes;
Use benchmark;
MY% h;
For My $ Class (@classes) {
$ h {$ class} = SUB {
MY $ OBJ = Email :: Abstract-> Cast ($ Message, $ Class); Email :: Abstract-> Get_Header ($ OBJ, Subject ");
Email :: Abstract-> get_body ($ OBJ);
Email :: Abstract-> set_header ($ OBJ, "Subject", "New Subject");
Email :: Abstract-> set_body ($ OBJ, "A Completely New Body");
Email :: Abstract-> as_string ($ obj);
}
}
Timethese (1000, /% h);
__Data__
... I put a short message to
In the DATA section, and run the same operation, one thousand times: construct a new message object, read the mail head, read the mail body, and return the message content as a string.
Benchmark: Timing 1000 Iterations of Email :: Mime, Email :: Simple,
Mime :: Entity, Mail :: Internet, Mail :: Message ...
Email :: Mime: 10 WallClock Secs (7.97 USR 0.24 SYS = 8.21 CPU)
@ 121.80 / s (n = 1000)
Email :: Simple: 9 WallClock Secs (7.49 USR 0.05 SYS = 7.54 CPU)
@ 132.63 / s (n = 1000)
MIME :: Entity: 33 WallClock Secs (23.76 USR 0.35 SYS = 24.11 CPU)
@ 41.48 / s (n = 1000)
Mail :: Internet: 24 WallClock Secs (17.34 USR 0.30 SYS = 17.64 CPU)
@ 56.69 / s (n = 1000)
Mail :: Message: 20 WallClock Secs (17.12 USR 0.27 SYS = 17.39 CPU)
@ 57.50 / s (n = 1000) Perl email project is indeed successful:
Email :: MIME and
Email :: Simple's running speed is almost twice the opponent. However, we have to emphasize that the tests you have made are very low. If you have to do any more complex operations than you see, you should consider which old
Mail :: Module.
The handling of the mailbox has been talked for individual letters. Let's take a look at how to deal with a set of mail or store mail. We mentioned
Mail :: Box, it is definitely the boss of the mail folder, which supports local and remote folder processing, edit the folder, and makes corresponding sort operations, and more. To use it, we first need
Mail :: Box :: Manager module, it is used to build
Mail :: Box object factory.
Use mail :: box :: manager
MY $ mgr = mail :: box :: manager-> new; Next, we open the folder by manager:
MY $ folder = $ mgr-> open (folder => $ folder_file); and now we can get individual independent email objects (
Mail :: message): for ($ folder-> messages) {
Print $ _-> Subject, "/ N";
} The most similar to this, I like the mailbox manager or
Mail :: Util
Read_MBox function. Pass the Mbox file path in UNIX to it, then return a series of anonymous arrays, each with anonymous array represents a mail message, which is the message of the message. As a result, it is very suitable
Mail :: Internet-> New or similar:
For (READ_MBOX) {
MY $ OBJ = Mail :: Internet-> New ($ _);
Print $ _-> head-> get ("subject"), "/ n";
} These two practices are very easy, but it seems that there is still a simple room in Mail :: Util's simpleness and mail :: box, so the email project is stagnant, this focus is concentrated.
Email :: Folder and
Email :: Localdelivery.
Email :: Folder can handle mail folders in Mbox and Maildir formats, as well as more other formats in the plan, and it has a very simple operation interface:
MY $ folder = email :: folder-> New ($ folder_file);
For ($ folder-> messages) {
Print $ _-> Header ("Subject"), "/ N";
} By default, it returns a series
Email :: Simple object is used to express each email, but this can be changed by deriving a subclass. For example, if we want the original RFC2822 format string, we can do this:
Package Email :: Folder :: Raw; Use Base 'Email :: Folder';
SUB BLESS_MESSAGE {MY ($ Self, $ RFC2822) = @_; Return $ RFC2822;} Maybe we don't have to derive a subclass, then
Bless_Message,
Email :: Abstract-> Cast is easier to change the expression of mail messages.
How to write data on the other hand of the folder. Or how to deliver local. The emergence of the email :: LocalDelivery module is to assist Email :: filter. The problem is more difficult than sound, because it must process lock, jump open the mail body, and problems caused by different formats such as Mailbox and MAILDIR. LocalDelivery hides all of this by simple interface:
Email :: Localdelivery-> Deliver ($ RFC2822, @mailboxes);
Email :: Localdelivery and
Email :: Folder is used
Email :: Foldertype module helps determine which type of mail folder (judge by file name).
The handle of the email address We reope with the abstract level back to the low-level processing, with a large number of modules can be used to process the email address. I really like old
Mail :: Address module. Mail addresses can be split into a variety of fields, such as: actual email addresses, name phrases, comment information. E.g:
Example User
(Not a real user)
Mail :: Address parsing these email addresses and separating name phrases and comments to get individual sections: for (Mail :: Address-> PARSE ($ from_line)) {
Print $ _-> name, "/ t", $ _-> address, "/ n";
} Unfortunately, like other mail modules, is not really useful.
MY ($ addr) = mail :: address-> parse ('"eBay, Inc."
');
Print $ addr-> name # inc. eBay
The result is still difficult to accept, although it is better than "inc eBay" returned to the version. So Casey West joins us and creates
Email :: Address module. It and
Mail :: Address uses a consistent interaction interface and is running faster, almost two to three times. (Translation: In the example above,
Email :: Address returns "eBay, Inc.". It seems in the eyes of the author,
Mail :: Address's author draws. )
There is also a thing we often need to do if the verification email address is legal. For example, a user is registered on the site, we need to check if the email address he provides can be checked. The Email :: Valid module is before we rushed in our rebellious person, there is an aboriginal member of the email :: name space, this module is used to do this. In its simpleer usage, we can say:
IF (NOT Email :: Valid-> address ('rt@example.com ') {
Die "Not a Valid Address"
} You can also open options for other checks, such as determining its domain names with a legitimate MX record, fixing some of the common AOL and CompuServe's email addresses, as follows:
IF (NOT Email :: Valid-> address (-address => 'test@example.com ",
-mxCheck => 1)) {
Die "Not a Valid Address"
}
Mail data conversion We have your own letters, what will you do next? I found that most of the text analysis of the mail, there are three modules here to help us:
The first is Text :: quote, it gets the text of the mail body, which can actually be any other text, then try to find some text section of other messages, then separated and saved to the nested data structure. For example, if we have
$ Message = <
foo
> # Bar
> baz
Quux
EOF
Then run
Extract ($ message) will return the following data structure:
[
[
{text => 'foo', quoter =>> ', raw =>'> foo '},
[
{text => 'Bar', Quoter =>> # ', RAW =>'> # bar '}
],
{text => 'baz', quoter => '>', RAW => '> baz'}],
{EMPTY => 1},
{text => 'quux', quoter => ', RAW =>' quux '}
]; When you display the content of the message message, prepare to distinguish between different colors, then this module will help you. Similar concepts
The Text :: Original module is used to search for the part of the original file content, not referenced. It knows how to identify various types of properties, so there are:
$ Message = <
Why Are the SO MANY DIFFERENT MAIL MODULES?
There's more Than One Way to Do It! Different Modules Have Different
Focuses, And Operate At Different Levels; Some Lower, Some Higher.
EOF
Then
First_sence ($ Message) will return
There's more Than One Way to do it !. The Mariaachi Mail List Architecture uses this technology, gives your mail in a clue.
Speaking of the clues of the mail, the mail :: thread module implements the mail standard algorithm for Jamie Zawinski, which is first used by Mozilla, so many other mail clients also start using this technology. Of course, Mariachi also uses this technology, recently it has been updated, using email :: Abstract to handle various email expressions you throwing past:
MY $ threader = mail :: thread-> new (@mails);
$ threader-> thread; # calculate the clue
For ($ threader-> rootset) {# original mail in a clue
DUMP_THREAD ($ _);
}
Mail filter classic Perl mail filtering tool is not
Mail :: Audit, I still have to write about how to use
Mail :: Audit module article (
http://www.perl.com/pub/a/2001/07/17/mailfiltering.html), and how to
Mail :: spamassassin
http://www.perl.com/pamb/a/2002/03/06/spam.html) The module is used in combination.
We have already mentioned that the mail :: listdector module has been several times. I combine it with Mail :: Audit to help you have a lot of automatic mail filtering. Mail :: audit :: List's plugin uses ListDetector to find mail list headers in the letter, such as something like List-ID, X-Mailman-Version, etc., these headers can help discriminate whether the message comes from mailing list . This means that I have the ability to filter all the letters from the mailing list to their respective folders, just like this:
MY $ list = mail :: ListDetector-> New ($ obj);
IF ($ LIST) {
MY $ Name = $ list-> listname;
$ item-> accept ("Mail / $ Name .- $ DATE");
}
however,
Mail :: Audit
It is still a long way to go, so if you have a new system, we encourage you to use email projects.
Email :: Filter
The module is an alternative, and most of their operating interfaces are consistent, although the function is not exactly the same. In order to pursue simple and speed, it uses new types of
EMAIL :: SIMPLE
As an email expressed object module.
Mail information mining Finally, the more advanced things I do is to develop an automated classification, organization, and index email to the database application framework, and try to analyze and extract valuable information.
My first module that completes this expected goal is Mail :: Miner, which consists of three main parts. After getting an email after getting an email, the first part is stored and stored in the database. Part III Overview This email and runs a series of recognition modules, search for email addresses, phone numbers, some keywords, phrases, etc., and store them into another independent database table. The third part is the command line tool to query the messages in the database and related information.
For example, if I need to find Tim O'Reilly postal address, I will use the query tool MM, find the address from the letter he sent:
% mm --From "Tim O" - ADDRESS
Address Found In Message 1835 from "Tim O'Reilly"
:
Tim O'Reilly @ O'Reilly & Associates, Inc.
1005 Gravenstein Highway North, Sebastopol, CA 95472
If you want to get a complete message, I can say
% mm --id 1835 If it originally contains an attachment, then we may see some of the following:
[Text / XML attachment Something.xml Detached - USE
MM - Detach 208
To recover] I paste the one in the middle
MM - Detach 208 into the shell, then very fast, Something.xml is written on disk.
Now Mail :: Miner is already very good, but it tightly bundles three ideas in a package - mail archive, mail data mining and query the command line interface of the database - this makes it difficult to develop or develop or Extend each of the features. Of course, it uses a vintage mail :: name space.
This leads us to the last stop of this mail module journey, the latest release: email :: store module. This is a Class :: DBI application framework for storing mail to the database and indexes in a variety of ways:
Use email :: store 'DBI: SQLITE: mail.db';
Email :: store-> setup;
Email :: Store :: Mail-> Store ($ RFC2822); close to ...
MY ($ name) = email :: store :: name-> search (name => "simon cozens")
@MAILS_FROM_SIMON = $ Name-> addressings -> mails; it can be used to build a mail list archive tool similar to MariaChi, or similar
Mail :: Miner data mining. It is still in the preliminary development phase and has used some new ideas in the expansion of the enhancement module. When we use the email :: store to write the first email archive and search tool, I will give you a detailed introduction. This is also the work that is prepared for the new Perl mailing list for Perl.org.
Summary We have seen several major email processing modules on the CPAN, and of course there are more. Obviously, I am biased by the modules written by themselves. The module of a specific Perl email project is used
The namespace of email :: *. We specifically designed these simple, efficient modules, and they are not always vast
Mail :: * Excellent replacement scheme, especially like
Mail :: Box. At this point, I hope that through the reading, understanding and understanding of this article, there is a chest when using Perl to process the email after later.