Like several other popular scripting languages, Python is an excellent tool for browsing and processing text data. This article summarizes Python's text processing tools for beginners. The article illustrates some of the regular concepts of the rules, and provides the recommendations for processing the text, and should be used (or not using) the suggestions for the rules.
What is python? Python is developed by Guido Van Rossum, free, very advanced interpreted languages. Its syntax is simple and easy to understand, and its object-oriented semantic function is powerful (but flexible). Python can be widely used and highly portable.
String - Inverted sequences, like most advanced programming languages, the growing string is the basic type in Python. Python assigns memory in the "Background" to save strings (or other values), programmers don't have to worry about it. Python also has some other high-level languages that have no string processing features.
In Python, the string is "unhakable sequence". Although strings (such as byte groups) cannot be modified in "by position", the program can reference the elements or subsequences of the string, just like using any sequence. Python uses a flexible "fragment" action to reference subsequences, and the format of the character fragment is similar to a range of rows or columns in the spreadsheet. The following interactive session describes the usage of strings and character fragments:
String and fragmentation
>>> s = "Mary Had a little lamb"
>>> s [0] # index is zero-based
'm'
>>> S [3] = 'x' # Changing Element in-Place Fails
TRACEBACK (InnerMost Last):
FILE "
TypeError: Object Doesn't Support Item Assignment
>>> S [11:18] # 'SLICE' A SUBSEQUENCE
'Little'
>>> S [: 4] # Empty slice-begin assumes Zero
'mary'
>>> S [4] # index 4 is not inclined in slice [: 4]
'' '
>>> s [5: -5] # can use "from end" index with negatives
'Had a little'
>>> S [: 5] S [5:] # slice-begin & slice-end area company
'Mary Had a Little Lamb'
Another powerful string operation is a simple IN keyword. It provides two intuitive and effective constructs:
IN keyword
>>> s = "Mary Had a little lamb"
>>> for c in s [11:18]: Print C, # Print Each Char in SLICE
...
L i t t l e
>>> IF 'x' in s: print 'got x' # test for char occurrence
...
>>> IF 'Y' In s: Print 'Got Y' # test for char occurrence ...
Got Y
In Python, there are several ways to form a string text. You can use single quotes or double-quotes, as long as the left quotation number and right quotation mark match, there is also a variation in other quotes. If the string contains a chart or embedding number, the triple quotation marks can easily define such a string, as shown in the following example:
Triple quotation number
>>> S2 = "" "Mary Had a little lamb
... ITS Fleece Was White as SNOW
... and everywhere That Mary Went
... The lamb disposed to go "" ""
>>> PRINT S2
Mary Had a little lambB
ITS Fleece Was White as SNOW
And Evewhere That Mary Went
The Lamb Was Sure To Go
You can add one letter "R" to indicate that Python should not explain the rule expression special characters in front of the string of single quotes or triple quotation. E.g:
Use "R-Strings"
>>> S3 = "this / n and / n That"
>>> Print S3
THIS
and
That
>>> S4 = r "this / n and / n That"
>>> PRINT S4
This / n and / n That
In "r-strings", the backslash that may be additionally constituting the modifier is treated as a conventional backslash. This topic will be further illustrated in the future rule expression discussion.
Documents and string variables When we talk about "text processing", we usually refer to handled content. Python reads the contents of the text file into the string variable that can be operated. The file object provides three "read" methods: .read () ,. readline () and .readlines (). Each method can accept a variable to limit the amount of data per read, but they usually do not use variables. .read () reads the entire file each time it is usually used to place the contents of the file in a string variable. However, .read () generates the most direct string representation of the file content, but for continuous flooding, it is unnecessary, and this processing is not possible if the file is greater than the available memory.
.readline () and .readlines () very similar. They are all used in the following structures:
Python .readLines () example
FH = Open ('c: //autoexec.bat')
For line in fh.readlines ():
Print Line
The difference between .readline () and .readlines () is that the latter reads the entire file at a time, like .read (). .readLines () automatically analyzes the contents of the file into a list of rows, which can be processed by the Python's for ... IN ... structure. On the other hand, .readline () only reads only one line each time, usually more than .ready. Readlines (). You should use .readline () when you don't have enough memory to read the entire file at a time.
If you are using a standard module for processing files, you can use the CSTRINGIO module to convert a string into a "virtual file" (if you need to generate a subclass of the module, you can use the StringIO module, the beginner may not do this). For example: CStringIO module
>>> Import Cstringio
>>> fH = cstringio.stringio ()
>>> FH.WRITE ("Mary Had a Little Lamb")
>>> fH.getValue ()
'Mary Had a Little Lamb'
>>> fH.seek (5)
>>> fh.write ('ATE')
>>> fH.getValue ()
'Mary ATE a Little Lamb'
However, remember that the CStringIO "virtual file" is not permanent, this is different from the real file. If it does not save it (such as writing it to a real file, or using the shelve module or database), it will disappear when the program ends.
Standard module: StringString module may be a Python 1.5. * The most commonly used module in the standard release. In fact, in Python 1.6 or higher, the functionality in the String module will be used as a built-in string method (when writing this article, the details have not been released). Of course, any program that performs text processing tasks should start with the following line:
How to use String
IMPORT STRING
General Merchants tell us that if you can use the String module to complete the task, then that is the correct way. Compared to RE (rule expressions), String functions are usually faster, most of them are more likely to understand and maintain. Third-party Python modules, including a fast module written in C, which is suitable for specialized tasks, but portability and familiarity are recommended to use String as long as you may use String. If you are accustomed to using other languages, there will be exceptions, but it is worse than you imagine.
The String module contains several types such as functions, methods, and classes; it also contains a string of common constants. E.g:
String use of law 1
>>> Import String
>>> String.Whitespace
'/ 011/012/013/014/015'
>>> String.uppercase
'Abcdefghijklmnopqrstuvwxyz'
Although these constants can be written by hand, String versions are more or less to ensure that constants will be correct for national language and platforms that run the Python script.
The String also includes functions that convert strings in common ways (which can make several rare conversions). E.g:
String use of law 2
>>> Import String
>>> s = "Mary Had a little lamb"
>>> String.capWords (s)
'Mary Had a Little Lamb'
>>> String.Replace (s, 'little', 'ferocious')
The 'Mary Had A Ferocious Lamb' has many other conversions not specifically described here; you can find more information in the Python manual.
You can also use String functions to report string properties, such as subtrings, for example:
String use of law 3
>>> Import String
>>> s = "Mary Had a little lamb"
>>> String.Find (S, 'Had') 5 >>> String.count (S, 'A') 4
Finally, String provides a very pythonized strange thing. .split () and .join () provide a fast way to convert between strings and byte groups, you will find them very useful. Usage is very simple:
String use of law 4
>>> Import String >>> S = "Mary Had a Little Lamb"
>>> l = string.split (s)
>>> L
['Mary', 'Had', 'A', 'Little', 'Lamb']
>>> String.join (L, "-")
'Mary-Had-a-Little-Lamb'
Of course, in addition to .join (), you will use a list to do other things (such as some of the For ... in ... structures involving our familiar ").
Standard module: The RERE module discarded the Regex and RegSub modules used in old Python code. Although there are still a few limited advantages relative to Regex, these advantages are negligible, not worthwhile in the new code. Outdated modules may be removed from the future Python release, and the version 1.6 may have an improved interface-compatible RE module. Therefore, the rule expression will still use the RE module.
Rule expressions are complicated. Maybe someone will write a book about this topic, but in fact, many people have done this! This article attempts to capture the "full form" of rule expressions, allowing readers to master it.
The rule expression is a very simple method for describing the mode that may appear in text. Do you have some characters? Does it appear in a specific order? Will the sub mode a certain number of times? Will other sub-patterns Exclusions outside match? Conceptually, it doesn't seem to describe the mode in natural language. The trick is to encode this description using a simplified grammar using rule expressions.
When processing a rule expression, it is handled as its own programming problem, even if only one or two lines of code; these rows effectively constitute a small program.
Start from the minimum. From most basic, any rule expression involves matches a particular "character class". The simplest character class is a single character, which is just a word in the mode. Typically, you want to match a class of characters. It can be indicated in square brackets in square brackets; in parentheses, there can be a set of characters or the character range specified by breaking numbers. You can also use a lot of life-name characters to determine your platform and national language. Here are some examples:
Character class
>>> IMPORT RE
>>> s = "Mary Had a little lamb"
>>> IF Re.Search ("M", S): Print "match!" # char l
Match!
>>> if Re.search ("[@ a-z]", s): Print "match!" # char class
... # match Either At-Sign Or Capital Letter
...
>>> IF Re.Search ("/ d", s): Print "Match!" # Digits Class
...
You can view the character class as a "atom" of the rule expression, usually combine those atoms into "molecules". This action can be done in conjunction with the packets and loops. The packets are indicated by parentheses: any sub-expression included in parentheses is considered to be atoms for grouping or cycling. The loop is represented by one of the following operators: "*" means "zero or more"; " " means "one or more"; "?" Indicates "zero or one". For example, please see the following example:
Sample rule expression
ABC ([D-W] * / D / D?) XYZ
For strings to match this expression, it must start with "ABC", ending with "XYZ" - but what must be there in it? The intermediate sub-expression is ([D-W] * / d / d?), And followed by "one or more" operators. Therefore, the middle of the string must include a character or a string that matches a child expression in parentheses. String "abcxyz" does not match because there is no necessary character in the middle.
However, what is this internal sub-expression? It starts with zero or multiple letters in the D-W range. Be sure to pay attention: zero letters are effectively matched, although using English words "Some" to describe it, it may feel very awkward. Next, the string must have a number; then zero or an additional number. (The first numeric character class has no loop operator, so it only occurs once. The second digital character class has a "?" Operator.) In summary, this will translate into "one or two numbers". Here are some strings that match the rules express:
Match the string of sample expressions
ABC1234567890XYZ
ABCD12E1F37G3xyz
ABC1XYZ
There are also some expressions that do not match the rule expressions (thinking about why they don't match):
String of the same expression
ABC123456789DXYZ
Abcdefghijklmnopqrstuvwxyz
ABCD12E1F37G3xyz
ABC12345% 67890xyz
ABCD12E1F37G3xyz
Some exercises are needed to habits to create and understand rules expressions. However, once the rule expression is mastered, you have a powerful expression. That is, it is often easy to use rule expressions to solve problems, and this type of problem can actually use simpler (and faster) tools, such as String to solve.
Reference
Jeffery E. F. FRIEDL writes Mastering Regular Expressions (O'Reilly 1997 published) is a very standard and authoritative reference book for rule expressions. For some good introductions for earlier text processing tools that are still widely used and very effective, see Dale Dougherty and Arnold Robbins written by Sed & Awk (O'Reilly and Associates published in 1997). Please read MXTextTools, Python's Quick Text Operating Tools. Details of the rules express: Python.org rule expressions How-to documentation Kentucky University rule expression overview (Overview of Regular Expressions)