Universal Thread - SED Example 3

xiaoxiao2021-03-06 97

In this Sing-series summary article, Daniel Robbins takes you to experience the true power of the SED. After introducing several important SED scripts, he will demonstrate some basic SED scripts by converting a Quicken .QIF file into a readable text format. The conversion script is not only practical, but also exhibits excellent examples of the SED script.

Strong SED in the second SED article, I have some examples to demonstrate the working principle of the SED, but they have few things that actually do special use. In this final article of this SED series, I want to change that way and use SED to do practical things. I will show you a few examples, they not only demonstrate the ability of the SED, but also do some truly ingenious (and convenient). For example, in the second half of this article, you will show you how to design a SED script to convert the .qif file from the Intuit's Quicken financial program to a text file with a good format. Before doing, we will look at the SED script that is not complex but very useful. Text Conversion The first actual script converts the Unix style text into a DOS / Windows format. You may know that DOS / Windows-based text files have a CR (Enter) and LF (Removal) at the end of each row, and UNIX text has only one wrap. Sometimes you may need to move some UNIX text to the Windows system, which will perform the required format conversion.

$ sed -e 's / $ // r /' myunix.txt> mydos.txt In this script, '$' rule expression will match the end of the row, and '/ r' tells SEDs to insert one before it Enter. Insert the carriage return before the wrap, immediately, each line ends with CR / LF. Note that the '/ r' is replaced only when the GNU SED 3.02.80 or later version is used. If you haven't installed GNU Sed 3.02.80, check out how to do this in my first SED article. I can't remember how many times after downloading some example scripts or C code, but I found it is a DOS / Windows format. Although many programs don't care about the DOS / Windows format CR / LF text file, there are several programs who care - the most famous is BASH, as long as they have a bus, it will have problems. The following SED calls will convert the text of the DOS / Windows format into trusted UNIX format:

$ SED -E 'S /. $/' mydos.txt> Myunix.txt This script works very simple: alternative rule expressions match the last character of a line, and the character is just a carriage return. We replace it with empty characters to remove it from the output. If you use this script and notice the last character of each row in the output, you specify a text file that is already a UNIX format. There is no need to do it! The following is another convenient small script. As with the "TAC" command included in most Linux releases, the script will reverse the order of the lines in the file. The name of "TAC" may give people a misleading because "TAC" does not reverse the position of the characters in the line (left and right), but the position (upper and bottom) of the line in the file. Use "TAC" to process the following files:

Foo bar Oni ... will produce the following output:

Oni Bar Foo can reach the same purpose with the following SED scripts:

$ sed -e '1! g; h; $! d' forward.txt> backward.txt If you log in to the FreeBSD system that happens to the "TAC" command, it is useful to find that the SED script is useful. Although it is convenient, it is best to know why this script is. Let us discuss it. Reverse explanation First, the script contains three separate SED commands separated by a semicolon: '1! G', 'h' and '$! D'. Now you need to understand the address used for the first and third commands. If the first command is '1G', the 'g' command will only apply the first line. However, there is a '!' Character - the '!' The character ignores the address, ie, 'g' commands will be applied to all rows other than the first line. '$! d' command and the class. If the command is '$ d', only the 'd' command is applied to the last line in the file ('$' address is a simple way to specify the last line). However, after you're! ',' $! D 'will apply the' d 'command to all the rows other than the last line. Now, what we have to understand is what these commands themselves do. When the inverted script is performed on the above text file, the command that is first executed is 'h'. This command tells the SED to copy the contents of the mode space (saving the buffer that is being processed) to the reserved space (temporary buffer). Then, execute the 'd' command, the command removes "foo" from the mode space so that it does not print it after all commands are performed on this line. Now, the second line. After reading the "BAR" into the mode space, execute the 'g' command, this command attached to the contents of the space ("foo / n") to the mode space ("bar / n"), making the content of the mode space " BAR / N / FOO / N ". The 'h' command puts the content back to keep the space protection, then the 'D' deletes the row from the mode space so that it does not print it. For the last "ONI" line, in addition to the contents of the mode space (due to '$!' Before 'D') and print the contents of the mode space (three rows) to the standard output, the same steps are repeated. Now, some powerful data conversion is performed with SED. Sed QIF Magic has been in the past few weeks, I always wanted to buy a Quicken to settle my bank account. Quicken is a very good financial procedure, of course, will successfully complete this work. However, after considering, I think I can easily write a software to settle my checkbook. I think, after all, I am a software developer! I have developed a good small checkbook settlement program (using awk), which calculates the balance by analyzing the syntax of the text files for all my transactions. After slightly adjustment, I will improve it so that you can track different loans and borrowing categories like Quickers. However, I have to add a feature. Recently, I will transfer your account to a bank with an online web account interface. One day, I noticed that this bank's Web site allowed to download my account information in Quicken .QIF format.

I immediately think that if you can convert this information into a text format, it is great. The story of two formats before viewing QIF format, let's take a look at my checkbook.txt format: 28 Aug 2000 Food - - Y Supermarket 30.94 25 Aug 2000 WATR - 103 Y Check 103 52.86 In my file, all fields are One or more tabs are separated, and each transaction is occupied. The next field after the date lists the expenditure type (if it is an income item, "-"). The third field lists the type of income (if it is an expenditure, "-"). Then, it is a check number field (if it is empty, or "-"), a transaction completion field ("y" or "n"), a comment and a dollar amount field. Now let's take a look at the QIF format. When you use the text viewer to view the downloaded QIF file, it looks as follows:

! TYPE: BANK D08 / 28/28/2000 T-8.15 N PCheckcard Supermarket ^ D08 / 28/2000 T-8.25 N Pcheckcard Punjab Restaurant ^ D08 / 28/2000 T-17.17 N PCheckcard Supermarket After browsing the file, it is not difficult to guess it. Format - Ignore the first line, the rest of the format is as follows:

^ (This is a field separator) Start processing when processing an important SED project like this, don't be discouraged - Sed allows you to gradually modify the data into the final form. In progress, the SED script can be continued until the output is exactly the same as expected. There is no need to ensure that it is completely correct when trying. To start, first create a file called "Qiftrans.sed" and then start modifying the data:

1D / ^^ / d s / [[: cntrl:]] // g The first '1D' command deletes the first line, the second command removes those annoyed '^' characters from the output. The last line removes any control characters that may exist in the file. Since processing foreign file format, I want to eliminate the risk of any control character in the middle. So far, everything goes well. Now, add some processing functions to this basic script:

1D / ^^ / d s / [[: cntrl:]] // g / ^ d / {

s / ^ d /(.*/)// 1 / TouTy / Tinny / T /

S / ^ 01 / jan / s / ^ 02 / Feb /

S / ^ 03 / mAR / S / ^ 04 / APR /

S / ^ 05 / May / S / ^ 06 / JUN /

S / ^ 07 / jul / s / ^ 08 / aug /

S / ^ 09 / SEP / S / ^ 10 / OCT /

S / ^ 11 / NOV / S / ^ 12 / DEC /

S: ^ / (. * /) // (. * /) // (. * /): / 2/1/3:} First, add a '/ ^ D /' address so that the sed is only encountered The first character 'd' of the QIF data field began to process. When the SED reads such a row into its mode space, all commands in the curly brackets are performed in order. The first command in the curly brackets will take the following: D08 / 28/2000 transform:

08/28/2000 OUTY Inny, of course, the current format is not perfect, but it doesn't matter. We will gradually refine the contents of the pattern space during the process. The final effect of the latter 12 lines is to convert the data into three letters, and the last line removes three slashes from the data. Finally, get this line:

AUG 28 2000 OUTY INY OTY and Inny fields are placeholders and will be replaced later. It is still not possible to determine them, because if the US dollar is negative, Outy and Inny will be set to "MISC" and "-", however, if the US dollar is positive, will change them into "-" and "inco". Since I haven't read the US dollar, it is necessary to temporarily use placeholders. Refining is now further refined:

1D / ^^ / d s / [[: cntrl:]] // g / ^ d / {

s / ^ d /(.*/)// 1 / TouTy / Tinny / T /

S / ^ 01 / jan / s / ^ 02 / Feb /

S / ^ 03 / mAR / S / ^ 04 / APR /

S / ^ 05 / May / S / ^ 06 / JUN /

S / ^ 07 / jul / s / ^ 08 / aug /

S / ^ 09 / SEP / S / ^ 10 / OCT /

S / ^ 11 / NOV / S / ^ 12 / DEC /

s: ^ / (. * /) // (. * /) // (. * /): / 2/1/3:

N n n

S // nt /( (/ )/ nn /( (/ )/ np /( (/ )/ Num / 2Num / T / TY / T / T / 3 / TAMT / 1AMT /

S / NUMNUM / - / S / NUM / ([0-9] * /) NUM / / 1 /

S / / ([0-9] /), seven lines after // 1 /} are somewhat complicated, so they will be discussed in detail. First, use three 'n' commands. The 'n' command tells the SED to read the next line into the input and attach it to the current mode space. These three 'n' commands cause the next three lines to the current mode space buffer, and now this line looks as follows:

28 AUG 2000 OUTY INNY /NT-8.15/NN/npcheckcard Supermarket SED mode space becomes difficult - the additional new row is required and some additional formatting is performed. To do this, use alternate commands. The mode to match is:

'/nt.*/nn.*/np.*' This will follow the back in turn with 't', zero or more characters, new rows, 'n', any quantity character, new line, 'p', And the new row matching any quantity character. Yeah! This rule expression will match all the contents of the three rows that have just been attached to the mode space. But we have to reformat the area instead of replacing it. The US dollar amount, the check number (if any) and description need to appear in the replacement string. To do this, we have enclose those "Items" in parentheses with backslash, so that you can reference them in the replacement string (using '/ 1', '/ 2 / and' / 3 'to tell SED where they insert them). The following is the last command: s // nt /( (/ )/ nn /( (/ )/ np /( (/ )/ Num / 2Num / T / TY / T / T / 3 / TAMT / 1AMT / This command converts our row:

28 AUG 2000 OUTY INNY NUMNUM Y CHECKCARD SuperMarket AMT-8.15AMT Although this is a good thing, there are a few things to see ... ah ... interesting. The first is the stupid "numnum" string - what is the purpose? If you look at the afterwards of the SED script, you will find its purpose, the latter line will replace "NUMNUM" to "Num" "Number>" NUM "replaces . If you see, enclose the check number with stupid markers Allow us to insert a "-" in this field. End Try the last line to remove the comma after the number. It converts the US dollar such as "3,231.00" into the format "3231.00" I used. Now let's take a look at the final script: the final "QIF to the text" script

1D / ^^ / d s / [[: cntrl:]] // g / ^ D / {S / ^ D /(.*/)// 1 / TouTy / Tinny / T /

S / ^ 01 / jan / s / ^ 02 / feb / s / ^ 03 / mAR / S / ^ 04 / APR / S / ^ 05 / May /

S / ^ 06 / jun / s / ^ 07 / jul / s / ^ 08 / aug / s / ^ 09 / SEP / S / ^ 10 / OCT /

S / ^ 11 / NOV / S / ^ 12 / DEC / S: ^ / (. * /) // (. * /) // (. * /): / 2/1/3:

N n n s // nt /(.*/)/ nn /(.*/ )/ np /( (/ )/ Num / 2Num / T / TY / T / T / TAMT / 1AMT /

S / NUMNUM / - / S / NUM / ([0-9] * /) NUM / / 1 / S / / ([0-9] /), // 1 /

/Amt-[0-9]** .[0-9]*AMT/B Fixnegs

S / AMT / (.*/) AMT / / 1 / S / OUTY / - / S / INNY / INCO /

B DONE: FIXNEGS S / AMT - / (. * /) AMT / / 1 / S / OUTY / MISC /

S / Inny / - /: DONE} Additional eleven lines use alternatives and some branch features to beautify the output. Take a look at this line first:

/Amt-[0-9]*. [0-9]*AMT/B Fixnegs This row contains a branch command for "/ regexp / b label". If the mode space matches the rule expression, the SED will branch to the Fixnegs label. You should easily find the label, it is ": fixnegs" in your code. If the rule expression does not match, continue to process the next command in a regular manner. Since you understand the working principle of the command itself, let's take a look at the branch. If you look at the branch rule expression, you will see it with the '-', any number of numbers, one '.' AMT 'matching with the following.', Any number of numbers, and 'AMT'. Just like I am sure that you have guessed, this rule expression dedicated to the negative dollar amount. Before this, the dollar amount was enclosed with 'ATM' so that it can be easily found later. Because the rule expression only matches the US dollar that starts with '-', the branch only occurs when it happens to handle the loan. If the loan is being processed, OUTY should be set to 'Misc', set Inny to '-', and should remove the negative number in front of the amount of loan. If you track the process of code, you will see that the actual situation is this. If you do not perform a branch, use '-' to replace OUTY, replace Inny with 'inco'. finished! The output is perfect: 28 Aug 2000 Misc - - y Checkcard Supermarket -8.15 Don't be confused, as you can see, just solve the problem by step, it is not so difficult to use SED conversion data. Don't try to use a sed command or to solve all problems at once. On the contrary, you have to gradually move towards the goal, and constantly improve the SED script until it is as you want. Sed has many functions, I hope you are very familiar with its internal working principle and continue to work hard to further master it! Reference

Read the first two sed articles of Daniel on developerWorks: General Thread: SED instance, Part 1, and Part 2. View Eric Spement Sed FAQ. Sed 3.02 resources can be found in ftp.gnu.org. A good new Sed 3.02.80 will be found in alpha.gnu.org. In addition, Eric's also has some convenient SED single line programs, and any Sed master with ambition should look at it. If you want to optimize old books, O'Reilly's Sed & AWK, 2nd Edition will be excellent. Maybe read 7th Edition Unix's Sed Man Page (about 1978!). Read the Felix von Leitner short tutorial. Review, discovery, and modify this free DW exclusive tutorial text in Using Regular Expressions. About author

转载请注明原文地址:https://www.9cbs.com/read-91296.html

9cbs

New Post(0)