Use AWK to handle binary data files

xiaoxiao2021-03-06 68

This article mainly describes how to combine the text processing tools under AWK and other Unix so that a tool that can only handle text files can also handle data in binary data files. AWK is a powerful text formatting and extracting tool under UNIX. With this tool, you can organize complex text files, extract all or part of the data, and display it as needed. It should be noted that the powerful function of awk is only for the plain text file. For binary data files with a lot of non-display characters, it will be powerless for AWK. At this time we need help from other tools. Under UNIX, there is a tool called OD, which is "Display Files In Octal Format", that is, it can display various files in an 8-way manner. If you set a different option, it can also display the file in a 16-way method. In addition, we need to use another tool, SED. This is also a traditional text processing tool under UNIX. Here we mainly use its text alternative. By combining three tools, we can complete the task we use AWK to handle binary data files. The author has a data file, FXT, and its data structure is shown in Table 1.

Starter length Description 08 account 87 amount 153 Operator number

TABLE 1

If you open this data file with a normal text editor, you see a string of numbers and a bunch of characters that are difficult to understand. I can't distinguish the amount of the amount. In order to facilitate the reader to understand this file, we use OD to view this file (see List1). # od -an -v -tx1 fxt

32 35 38 35 36 30 30 39 00 00 05 00 00 0c 31

30 31 0A 32 35 38 30 30 32 33 34 90 20 20 80 20

20D 31 30 32 0A ...

List 1

Slightly explain the meaning of the command parameters of the OD. -An indicates that the offset is not displayed on each row; -V indicates that each row is displayed; -TX1 indicates that the output is output in a 16-way mode, one byte once. Depending on the data structure definition, we can see that the previous 8 bytes (32 35 38 35 36 30 30 39) represent the account, and the account portion is composed of the displayed ASCII code, the translated result is 25856009; Next 7 One byte (00 00 05 00 00 00 0c) represents the amount. The last C represents Credit is the meaning of the loan. The actual amount it represent is 500,000.00. The three bytes immediately represent the operator number, but also consist of the displayable ASCII code. 0A is the ASCII code of the wrap, indicating that a record ends. It can be seen that it is because the amount of the amount is composed of the non-displayed ASCII code, resulting in the data files in the data file that cannot be extracted by the regular way. So how should you use the above tools to handle such data files, and generate new plain text data files in a way of understanding? OD has clearly displayed the entire data file. The format it output does not meet our requirements. For example, I originally recorded a few lines first; the characters that were originally connected together, now there is a space. This awk is still difficult. Therefore, in order to make the AWK can be easily handled, we must generate an intermediate file that AWK can process before formally extracting data. As can be seen from List1, OD is displayed in the display file, and there is a tab before each line, and the change between the original record also has become the corresponding ASCII code 0A. Then our task is to remove the table, and to restore the correct records between recording. This step can be done by the following command. # od -v -on -tx1 fxt | SED 'S / / /' | SED 'S / 0A /, /' | awk -f org.awk> fxt.aod The result is sent to the SED through the pipe. The SED is removed first, and then the 0A character is replaced into a comma. Then hand it to org.awk to format the process.

# cat Org.awk

Begin {

ORS = ""

}

{

For (i = 1; i <17; i )

{

IF ($ i == ",")

Printf ("/ n");

Else

Printf ("% s", $ i);

}

By defining ics = ", we can ensure that there is no separator between each field of the output. (ORS = Output Record Separator). Then, check if there is a comma in each row, if there is, output a newline character. In this way, we transform the results of the OD to the following. # cat fxt.a

32353835363030390000050000000C313031

323538303032333490208020200D313032

How, this format is more pleasant. Generate an intermediate file, the next step should do formal data extraction work. In actual work, it is possible that the data structure in the original data file is not completely consistent with the target data structure we need to convert. At this time, in addition to extracting data, it is also necessary to processed on the output format. For example, the above data files, if you want to output it according to the following data structure (Table 2), you can refer to the program ck1.awk.

Starter length Description 012 account 121, 131 borrowing sign, 0 represented loans, 1 indicates a 141,1514 amount 291, 303 operator number

Table 2

# cat ck1.awk

Begin {

FS = "/ n";} function trans (s)

{

IF ((A = SubStr (S, 1, 2) -30) <0)

A = 0;

For (i = 3; i

{

IF ((B = SubStr (S, I, 2) -30) <0)

B = 0;

a = (a b);

}

Return A;

}

{

Actno = Trans (Substr ($ 1,1,16)); # account

AMT = SUBSTR ($ 1,17,13); # amount

CDFLAG = SUBSTR ($ 1,30,1); # 借标

IF (cdflag == "d")

CDFLAG = 1;

IF (cdflag == "c")

CDFLAG = 0;

Oper = trans (Substr ($ 1,31,6)); # Operator number

Printf ("% - 12S,% S,% 014D,% S / N", ACTNO, CDFLAG, AMT, OPER);

}

An AWK program is divided into three parts, start, intermediate processing, and end portions. The start section is represented by begin {}, and the end portion is represented by end {}, and the intermediate portion can be surrounded by {}. The Begin section is prepared for some preparations before formal processing. And the end is handled after all the records have been processed, and some sweeping work. As you can see that the grammatics and C language of awk are very similar, and you can define a function. In fact, AWK did similar to C in many places, so the readers who have the foundation of C language should soon be able to master the AWK usage. First, at the start part, we set the field separator FS to / n, which means that one line is handled as a field. Second, we define a function trans. Its function is mainly to restore the numeric characters represented by the ASCII code to normal digital form. The ASCII code of the number 0-9 corresponds to 30-39. So convert an ASCII code into a corresponding number, as long as the ASCII code minus 30. For example, the ASCII code of 1 is 31, then 31 - 30 = 1 quickly depends on the number corresponding to 31. The TRANS function is using this principle, converts the ASCII code into a corresponding number, then connects these numbers, forms an account or operator number, and returns to the caller. In the formal processing section, we used the substr function. This is a built-in function and is used to extract some characters in the string. To remember, after the conversion of OD, we use 2 digits to represent a byte. So the account is 8 bytes, while we are extracted, take 16 bytes. In addition, since the account and the operator number are now represented in an ASCII mode, you need to translate through the Trans function. The amount of the amount is expressed because of the 16-backed ASCII code value, but it is not translated again. The Printf function is used when output. The usage of this function is exactly the same as the standard C language Printf function. According to the requirements of the output data structure, the Printf is modified on the account and the amount, so that the account is left aligned and the length is 12 bytes, and the slice is filled with the shorter part; and the amount part uses the leading zero-filled. If you have the above procedure, you get the results we want. # awk -f ck1.awk fxt.a

25856009, 0,00000050000000, 101

25800234, 1,00000208020200, 102

Reference documentation:

Linux and Unix Shell Programming Guide, (US) David Tansley, Machinery Press The GNU AWK User's Guide

转载请注明原文地址:https://www.9cbs.com/read-79136.html

9cbs

New Post(0)