Processing method for character for GBK encoded in Perl

xiaoxiao2021-03-06  117

Processing method for character for GBK encoded in Perl

# Return the length of the string

Use string :: Multibyte; $ gbk_str = "Last"; $ GBK = String :: Multibyte-> New ('GBK'); $ GBK_LEN = $ GBK-> Length ($ GBK_STR);

Constructor

New (charset)>

$ MBCS = String :: Multibyte-> New (Charset)

New (charSet, _verbose) ">

$ MBCS = String :: Multibyte-> New (Charset, Verbose)

Charset is The Charset Name; Exactly Speaking, The File Name of The Definition File (without The in Instance To Tell Methods In Which Charset The Specified Strings Should Be Handled.

Charset May Be a Hashref; this is how to define a charset .pm file.

# See perlfaq6 :-) my $ martian = String :: Multibyte-> New ({charset => "martian, regexp => '[a-z] [a-z] |});

If True Value IS Specified As Verbose, The Called Method (Excepting islegal) Will Check ITS Arguments and carps if any of them is not legally encoded.

OtherWise Such A Check Won't Be Carried Out (Saves A Bit of Time, But Unsafe, Though You CAN Use The Islegal Method if Necessary).

Check WHETHER THE STRING IS LEGAL

Detect whether the string is a legitimate GBK character

Islegal (list)>

$ mbcs-> islegal (list)

Returns a Boolean Indicating WHETER All The strings in arguments Are Legally encode in The Concerned Charset. Returns False Even IF One Element Is Illegal In List.

Length

Length (String)>

$ MBCS-> Length (String)

Returns the length in characters of the specified string.

Reverse

Character string

Strrev (String)>

$ MBCS-> Strrev (String)

Returns a reverse string in character.

Search

search for

Index (String, _SUBSTR)>

$ MBCS-> Index (String, Substr)

Index (String, _SUBSTR, _POSITION ">

$ MBCS-> Index (String, Substr, Position) Returns the position of the first ompurrence of substr in string at Occurrence. if position is omitted, starts search from the beginning of the string.

If the substring is not found, returns -1.

Reverse search

Rindex (String, _SUBSTR)>

$ mbcs-> rindex (String, Substr)

Rindex (String, _SUBSTR, _POSITION ">

$ MBCS-> RINDEX (String, Substr, Position)

Returns the position of the last occupical of substr in string at ORAFTER POSITION. IF position is specified, returns the last occupical at or before what position.

If the substring is not found, returns -1.

STRSPN (String, _SearchList ">

$ mbcs-> strspn (string, searchlist)

Search the location of characters in the first string not included in the character set of the second string

Returns Returns The position of the first occurrence of any character not contained in the search list.

$ MBCS-> StRSPN (" 0.12345 * 12", " -. 0123456789"); # returns 8.

If the specified string does not contain any Character in the search list, returns 0.

The String Consists of Characters in The Search List, The Returned Value Equals The Length of The String.

SearchList Can Be An Arrayref. Egiff A Charset Treats CRLF AS A SINGLE CHARACTER, "/ R / N" IS A One-Element List of only "/ r / n". A Two-element list of "/ r" and " / N "Can Be Given AS [" / r "," / n "] (of course" / n / r "IS Also Ok Since The Character Order of SearchList Doesn't Matter In StRSPN).

STRCSPN (String, _Searchlist ">

$ mbcs-> strcspn (string, searchlist)

Returns returns the position of the first occurrence of any character contained in the search list.

If the specified string does not contain any character in the search list, the returned value equals the length of the string.SEARCHLIST can be an ARRAYREF. Eg if a charset treats CRLF as a single character, "/ r / n" is a one -Element List of only "/ r / n". A Two-element list of "/ r" and "/ n" can be given as ["/ r", "/ n"] (of course "/ n / r "IS Also Ok Since The Character Order of Searchlist Doesn't Matter in STRCSPN).

Substring

Substring

Substr (string_or_scalar_ref, _offset ">

$ MBCS-> SUBSTR (String or Scalar Ref, Offset)

Substr (string_or_scalar_ref, _offset, _length) ">

$ MBCS-> SUBSTR (String OR Scalar Ref, Offset, Length)

Substr (scalar, _offset, _length, _replacement ">

$ MBCS-> Substr (Scalar, Offset, Length, Replacement)

IT Works Like Core :: Substr, But Use Character Semantics of Multibyte Charset Encoding.

IF The Replacement As The Fourth Argument is Specified, Replaces Parts of The Scalar and returns what...

You can Utilize The Lvalue Reference, Returned IF A Reference of Scalar Variable Is Used As The First Argument.

$ {$ STR, $ OFF, $ LEN)} = $ replace; Works Like Core :: Substr ($ Str, $ OFF, $ LEN) = $ replace;

The Returned Lvalue Is Not Multibyte Character-Oriented Butte-Oriented, The Successive Assignment May Lead to Odd Results.

Split

segmentation

STRSPLIT (Separator, _String) ">

$ mbcs-> strsplit (separator, string)

Strsplit (Separator, _String, _Limit ">

$ MBCS-> STRSPLIT (Separator, String, Limit)

This Function Emulates Core :: split, but splits on the separator string, not by a pattern.

IF not in list context, online

$ bytes-> strsplit ('', 'this is perl.', 7); # ('t', 'h', 'i', 's', '', 'i', 's perl.')

Character Range

Returns a list of all characters in a certain internal code value area

Mkrange (charlist, _allow_reverse ">

$ mbcs-> mkrange (charlist, allow_reverse)

Returns The Character List (Not in List Context, As a constnated string) Gained by Parsing The Specified Character Range.

The Result Depends on The the the the character Order for the Concerned Charset. About The Character Order for Each Charset, See Its Definition File.

If The Character Order is undefined in the definition file, returns an identical string with the specified string.

A Character Range I Specified with a Hyphen ('-', But Exactly Speaking, $ Obj -> {hyphen}).

The Backslashed Combinations '/ -' and '//' (EXACTLY Speaking, "$ OBJ -> {escape} $ OBJ -> {hyphen}" $ obj -> {escape} $ obj -> {escape} ") Are used instead of the characters '-' and '/', respective. The Hyphen at the beginning is also engaluated as the hyphen itself.

For example, $ mbcs-> mkrange (' /- 0-9a-f') returns (' ', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F') AND SCAL $ MBCS-> Mkrange ('A-P') returns 'Abcdefghijklmnop'.

If True Value Is Specified As The Second Argument, Reverse Character Ranges Such As' 9-0 ',' ZA 'Are Allowed. $ BYTES = String :: Multibyte-> New (' Bytes'); $ BYTES-> Mkrange (' PER-L ', 1); # ponmlkjihgfefghijklmnopqponml

Transliteration

Search and replace

StRTR (String_or_scalar_ref, _searchlist, _replacementList)>>

$ MBCS-> StRTR (STRING OR Scalr Ref, SearchList, ReplacementList)

StRTR (String_or_scalar_ref, _searchlist, _replacementlist, _modifier ">

$ MBCS-> StRTR (STRING OR Scalar Ref, SearchList, ReplacementList, Modifier)

THESLITERATES All Occurrence of The Characters Found in The Search List with The Replacement List.

If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.

IF 'h' Modifier IS Specified, Returns A Hash of Histogram in List Context; A Reference To Hash of Histogram in Scalar Context;

SearchList and ReplacementListListListList

Character Ranges (INTERLY Utilizing Mkrange ()) Are Supported.

If The ReplacementList Is Empty (Specified As', Not Undef, Because The Use of Uninitialized Value Causes Warning Under -w Option), The SearchList is Replicated.

If The Replacement List is Shorter Than The Search List, The Final Character In The Replacement List is Replicated Till It Is Long Enough (But Differently Works Works Works Works).

Searchlist and replacementlist can be be an arrayref. Egiff a charSet Treats "/ r / n" (CRLF) AS A SINGLE Character, "/ R / N" is a one-element list of only "/ r / n". A Two -Element List of "/ r" and "/ n" Should Be Given AS ["/ R", "/ n"]. of course "/ n / r" is also ok but the character order is different; cf. straTr ($ STR, ["/ R", "/ n"], ["/ n", "/ r"]) That swaps "/ n" and "/R" /R "/R" /R "/R" /R "/ RAN / RAYREF CAN include Character Modifiers r and r affect their evaction as usual.

["Ac", "hz"] IS Evaluated Like "a-ch-z" if Charset Does not include grapheme "ch". The former prevents "c" and "h" from evAluation AS "ch" Even if the charset included Grapheme "CH".

Modifier

c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. h Return a hash (or a hashref) of histogram. R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table INTERNALLY.

IF 'R' Modifier IS Specified, '-' Is Not Evaluated AS A Meta Character But Hyphen Itself Like In Tr '' '. Compare:

$ MBCS-> StRTR ("90 - 32 = 58", "0-9", "AJ"); # output: "JA - DC = Fi" $ MBCS-> StRTR ("90 - 32 = 58", " 0-9 "," AJ "," R "); # output:" JA - 32 = 58 "# cf. ($ STR =" 90 - 32 = 58 ") = ~ TR'0-9'a-j '; #' 0 'to' a ',' - 'to' - ', and' 9 'to' j ​​'.

IF 'R' Modifier IS Specified, Reverse Character Ranges Are Allowed. EG $ MBCS-> StRTR ($ Str, "0-9", "9-0", "R") Is Equivalent to $ MBCS-> StRTR ($ STR, "0123456789", "9876543210")

Caching the conversion TABLE

IF 'o' Modifier IS Specified, The Conversion Table IS Cached Internal. E.g.g.g.g.

Foreach (@Source_Strings) {Print $ MBCS-> StRTR ($ _, $ from_list, $ to_LIST, 'O');}

Will BE Almost As Efficient As THIS:

$ trans = $ mbcs-> trclosure ($ from_list, $ to_LIST); Foreach (@Source_Strings) {Print & $ TRANST ($ _);}

You can use for whichever you like.

WITHOUT 'O',

Foreach (@Source_Strings) {Print $ MBCS-> StRTR ($ _, $ from_list, $ to_LIST);}

Will Be Very Slow Since The Conversion Table is Made WHENEVER THE FUNCTION IS CALLED.

Generation of the closure to transliterate

Returns a reference to a function pointing to a search rule

Trclosure (SearchList, _ReplacementList ">

$ mbcs-> trclosure (searchlist, replacementList)

Trclosure (SearchList, _ReplacementList, _Modifier ">

$ MBCS-> Trclosure (SearchList, ReplacementList, Modifier)

Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify arguments every time.

My $ trans = $ mbcs-> trclosure ($ from_list, $ to_LIST); Print & $ TRANST ($ string); # ok to perl 5.003 Print $ Trans -> ($ String); # perl 5.004 or better

..........................................

Searchlist and replacementList Can Be An ArrayRef Same as straTr ().

Cavet $ [

This Modules Suppose $ [Is Always Equal to 0, Never 1.

Grapheme manipulation

Since v. 1.01, manipulation of sequence of graphemes is to be supported.

IN a grapheme-oriented manipulation, notice what the beginning.

E.g. Imagine A Grapheme Set Where A Grapheme Comprises Either a Leading Latin Capital Letter Followed by One or More Latin Small Letters, or a Single Byte. Such A Set Can Be Define As Below.

$ GRA = String :: Multibyte-> New ({regExp => '[a-z] [a-z] * | [/ x00- / xff]',});

Think About $ GRA-> Index ("Perl", "PE"). As Both "Perl" and "PE" Are A Single Grapheme, They Are Not Equal to Each Other. So the result of this must be -1 (Meaning) No match.

转载请注明原文地址:https://www.9cbs.com/read-103911.html

New Post(0)