Universal Thread: AWK Example 2

xiaoxiao2021-03-06  73

Daniel Robbins

President and CEO, Gentoo Technologies, Inc

In this article

In the release of AWK, Daniel Robbins continues to explore awk (a language that is great but has a weird name). Daniel will demonstrate how to handle multi-line records, use loop structures, and create and use a awk array. After reading your article, you will be able to publish a lot of awk features, and you can write your own powerful AWK scripts.

Multi-line recording AWK is an excellent tool for reading and processing structured data (such as system / etc / passwd files). / etc / passwd is a UNIX user database, and a text file with a colon dominant, which contains many important information, including all existing user accounts and user IDs, and other information. In my previous article, I demonstrate how awk is easily analyzed this document. We only need to set FS (field separators) variables to ":".

Once the FS variable is set correctly, the AWK can be configured to analyze almost any type of structured data, as long as these data is a record per line. However, if you want to analyze multi-line records, it is not enough to rely on setting fs. In these cases, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends, when the new record starts.

For example, let us discuss how to complete the tasks of the address list involved in the Federal Witness Protection Program:

Jimmy the Weasel

100 pleasant drive

San Francisco, CA 12345

BIG TONY

200 Incognito Ave.

Suburbia, WA 67890

In theory, we hope that AWK will regard every 3 lines as an independent record, not three independent records. If AWK regards the first line of the address as the first field ($ 1), the street address is considered a second field ($ 2), city, state and postal coding as the third field $ 3, then this code It will become very simple. The following is the code we want to get:

Begin {

FS = "/ n"

RS = ""

}

In the above code, set the FS to "/ n" to tell AWK to occupy each field. By setting RS "", you will also tell the AWK to separate each address record. Once the AWK knows how to format the input, it can perform all the analysis work for us, and the rest of the script is simple. Let us study a full script, which will analyze this address list and print each record on one line and separate each field with a comma.

Address.awk

Begin {

FS = "/ n"

RS = ""

}

{

Print $ 1, "$ 2", "$ 3

}

If this script is saved as Address.awk, the address data is stored in file address.txt, which can be executed by entering "awk -f address.awk address.txt" to perform this script. This code will produce the following output:

Jimmy The Weasel, 100 Pleasant Drive, San Francisco, CA 12345

Big Tony, 200 Incognito Ave., Suburbia, WA 67890

Ofs and ORS In the PRINT statement of Address.awk, you can see that AWK will connect (merge) a string adjacent to each other. We use this feature to insert a comma and space (",") between the three fields on the same line. Although this method is useful, it is more ugly. Insert "," strings between fields, it is better to let AWK complete this one by setting a special AWK variable OFS. Please refer to the following code snippet. Print "Hello", "There", "Jim!"

The comma in this line code is not part of the actual text string. In fact, they tell awk "hello", "there", and "jim!" Are separate fields and should print the OFS variables between each string. By default, AWK produces the following output:

Hello there Jim!

This is the output result in the default, OFS is set to "", a single space. However, we can easily redefine OFS so that awk will insert the field separator we wish. The following is a revision of the original Address.awk program, which uses OFS to output those "," strings:

Address.awk's revision

Begin {

FS = "/ n"

RS = ""

OFS = ","

}

{

Print $ 1, $ 2, $ 3

}

AWK has a special variable ORS, and the full name is "Output Record Separator". By setting the default, the default is off ("/ N"), we can control the characters that are automatically printed at the end statement. The default ORS value will make the AWK output each new PRINT statement in the new row. If you want to double the interval of the output, you can set the ORS to "/ n / n". Or, if you want to separate the record (without changing the wrap) with a single space, you want to "".

Assuming multi-line conversion into the format separated by Tab, assuming a script, converting the address list into each record and uses Tab to import the format to import the spreadsheet. After using the ADDRESS.AWK that is slightly modified, you can clearly see that this program is only suitable for the address of the three rows. If awk encounters the following address, the fourth line will be lost and the line is not printed:

Cousin Vinnie

Vinnie's Auto SHOP

300 City Alley

SOSUEME, OR 76543

To handle this situation, the code is best considering the number of records of each field and prints each record sequentially. Now, the code only prints the first three fields of the address. The following is some of the code we want:

Address.awk version for addresses with any multiple fields

Begin {

FS = "/ n"

RS = ""

ORS = ""

}

{

X = 1

While (x

Print $ x "/ t"

X

}

Print $ nf "/ n"

}

First, set the field separator FS to "/ n", set the record separator RS to "", so that AWK can analyze multi-line addresses as previously as before. Then, the output recording separator ORS is set to "", which will make the print statement does not output a new line at the end of each call. This means that if you want any text to start from the new line, you need to explicitly write Print "/ N". In the primary sector, a variable X is created to store the number of the current field being processed. At first, it was set to 1. Then we use the While loop (an AWK loop structure, which is equivalent to the While loop in the C language), repeats the recording and TAB characters for all records (except for the last record). Finally, print the last record and the wrap; in addition, the ilState is set to "" ", and the print will not output a wrap. The program output is as follows:

We want the output. Not beautiful, but use Tab to be used to import spreadsheets

Jimmy The Weasel 100 Pleasant Drive San Francisco, CA 12345

Big Tony 200 Incognito Ave. Suburbia, WA 67890

Cousin Vinnie Vinnie's Auto Shop 300 City Alley Sosueme, OR 76543

The loop structure we have seen the WHILE loop structure of awk, it is equivalent to the corresponding C language while loop. AWK also has a "do ... while" loop, which evaluates the condition at the end of the code block, regardless of the starting point as the standard While loop. It is similar to the "REPEAT ... Until" loop in other languages. The following is an example:

Do ... while example

{

Count = 1

Do {

Print "I getprinted at Least Once No Matter What"

} while (count! = 1)

}

Unlike the general While cycle, since the conditional value is obtained after the code block, the "do ... while" loop will always be executed at least once. In other words, when the normal While cycle is first encountered, if the condition is false, the loop will never perform.

For loop AWK allows you to create for loops, it is like the While loop, it is also equivalent to the C language for loop:

For (Initial Assignment; Comparison; Increment) {

Code Block

}

The following is a short example:

FOR (x = 1; x <= 4; x ) {

Print "Iteration", X

}

This code will print:

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Break and Continue In addition, like the C language, AWK provides Break and Continue statements. Use these statements to better control the loop structure of the AWK. The following is an urgent code snippet that needs a Break statement:

While dead cycle

While (1) {

Print "Forever And Ever ..."

}

Because 1 forever represents true, this while loop will run forever. The following is a loop that performs only ten times:

BREAK statement example

X = 1

While (1) {

Print "Iteration", X

IF (x == 10) {

Break

}

X

}

Here, the BREAK statement is used to "escape" the deepest loop. "Break" makes the loop immediately and continues to perform the statements behind the loop code block.

The Continue statement has complement Break, which is as follows:

X = 1

While (1) {

IF (x == 4) {

X

Continue

}

Print "Iteration", X

IF (x> 20) {

Break

}

X

}

This code prints "Iteration 1" to "Iteration 21" except "Itemion 4". If ite iteration is equal to 4, add the X and call the Continue statement, which immediatelys the AWK begins to perform the next loop iteration without executing the rest of the code block. Like Break, the Continue statement is suitable for a variety of AWK iterative loops. When used in the For Control Subject, Continue will automatically increase the loop control variable. The following is an equivalent cycle:

FOR (x = 1; x <= 21; x ) {

IF (x == 4) {

Continue

}

Print "Iteration", X

}

In the While loop, there is no need to add x before calling Continue because the FOR loop will automatically increase X.

Array If you know that AWK can use arrays, you will be happy. However, in AWK, the array subscript usually starts from 1, not 0:

MyArray [1] = "jim"

MyArray [2] = 456

When AWK encounters the first assignment statement, it will create MyArray and set the element MyArray [1] to "Jim". After executing the second assignment statement, the array has two elements.

After the array iteration definition, AWK has a convenient mechanism to iterate an array element as follows:

For (x in myarray) {

Print MyArray [x]

}

This code will print each element in the array MyArray. When this special "in" form is used for FOR, the AWK assigns each existing subscript of MyArray to X (loop control variable), and looped a cycle code after each assignment. Although this is a very convenient AWK function, it has a disadvantage - it does not follow any specific order when the awk is rotated between the array subscripts. That means we can't know the output of the above code:

Jim

456

still is

456

Jim

In the case of Forrest Gump, iterative array content is like a box of chocolate - you never know what will you get. Therefore, it is necessary to make the AWK array "character string", we now study this problem.

Array Underline Character Stroke In my previous article, I demonstrated that AWK actually stores digital values ​​in a string format. Although the AWK wants to perform the necessary conversion to complete this work, it can use some of the strange code:

A = "1" b = "2"

C = a b 3

After the code is executed, C is equal to 6. Since AWK is "character string", adding strings "1" and "2" are functioning than adding numbers 1 and 2 difficult. In both cases, AWK can perform operations successfully. AWK's "character string" is very cute - you might want to know what happens if the string subscript is used in the use of arrays. For example, use the following code:

Myarr ["1"] = "mr. whipple"

Print Myarr ["1"]

It can be expected that this code will print "mr. whipple". But if you remove the quotation marks in the second "1" subscript, what will happen?

Myarr ["1"] = "mr. whipple"

Print Myarr [1]

Guess this code snippet is more difficult. AWK regards myarr ["1"] and myarr [1] as two independent elements of the array, or they refer to an element? The answer is that they refer to the same element, and AWK will print "mr. whipple" as the first code segment. Although it seems to be a bit strange, the awk has been using the array of string after the scene!

After understanding this strange truth, some of us may want to perform a weird code similar to the following:

Myarr ["name"] = "mr. whipple"

Print Myarr ["Name"]

This code not only does not generate errors, but its function is exactly the same as the previous example, and will print "mr. whipple"! It can be seen that AWK does not limit our use of pure integers; if we are willing, you can use the string subscript and will not produce any problems. As long as we use non-integer subscripts, such as Myarr ["Name"], then we are using an associated array. Technically, if we use the string subscript, the background operation of the AWK is not different (because the "integer" subscript is used, awk still treats it as a string). However, they should be called associated arrays - it sounds cool, and will leave your boss. The character string subscript is our small secret. ;)

When the array tool talks about array, AWK gives us a lot of flexibility. You can use a string subscript without a continuous digital sequence subscript (for example, MYARR [1] and Myarr [1000] can be defined, but all other elements may be defined. Although these are useful, in some cases, it will be confused. Fortunately, AWK provides some practical functions to help make the array easier management.

First, you can delete an array element. If you want to delete an array of fooarray elements 1, enter:

Delete FooArray [1]

Moreover, if you want to see if there is a particular array element, a special "in" Boolean operator can be used as shown below:

IF (1 in fooarray) {

Print "ayep! it's there."

} else {

Print "Nope! can't find it."

}

In this article, we have discussed many basic knowledge. Next, I will demonstrate how to use the AWK's mathematical operations and string functions, and how to create your own functions so you fully master the AWK knowledge. I will also guide you to create a check book settlement program. At that time, I will encourage you to write your own awk program. Please review the following references. Reference

Please read the AWK instance on developerWorks, Part 1. If you want to optimally book, O'Reilly's Sed & awk, 2ndedition is excellent. Please refer to Comp.lang.awkfaq. It also contains many additional AWK links. Patrick Hartigan's awk tutorial also includes a practical AWK script. Thompson's TawkCompiler compiles the AWK script into a fast binary executable. The available version has a Windows version, OS / 2, DOS version, and UNIX version. The Gnuawk User's Guide can be used for online reference.

About the author Daniel Robbins live in Albuquerque in New Mexico. He is the founder of Gentoo Technologies, Inc. and CEO, Gentoo Linux, and the founder of the PC's advanced Linux and Portage Systems (next-generation transplantation systems for Linux). He is also a collaborator of Macmillan books Caldera OpenLinux Unleashed, SUSE Linux Unleashed and Samba Unleashed. Daniel has an intravenous end of the computer in some areas in the second grade. At that time, he first contacted the LOGO program language and indulge in the PAC-Man game. This may be the reason why he still serves as the chief graphic designer of Sony Electronic Publishing / Psygnosis. Daniel likes to spend time with his wife Mary and newborn daughter Hadassah. Can contact Daniel via DRobbins@gentoo.org.

转载请注明原文地址:https://www.9cbs.com/read-92240.html

New Post(0)