SMS sends Chinese with Unicode encoding
Author: Chen Yifei
Last updated: 2003-03-08
Keywords: SMS, PDU, Unicode, GB2312, Linux, Coding conversion
SMS is a specification established by ESTI (GSM 03.40 and GSM 03.38). There are two ways to send and receive SMS messages: text mode or PDU (Protocol Description Unit) mode. Text mode can only send a normal ASCII character, and send pictures, ringtones, other encoded characters (such as Chinese) must use PDU mode.
In the PDU mode, three coding methods can be used to encode the content to send, respectively, 7-bit encoding, 8-bit encoding, 16-bit encoding. 7-bit encoding is used to send a normal ASCII character; 8-bit encoding is usually used to send data messages, such as pictures and ringtone, etc .; and 16-bit encoding is used to send Unicode characters. In these three coding methods, the maximum number of characters that can be sent is 160, 140, 70, respectively.
To send Chinese (or Japanese, etc.), the Unicode encoding method of the PDU mode must be used.
I have recently been involved in a project that transmits and receiving text messages under Linux. Among them, the transmission and reception of Chinese is required. Since there is no experience in Chinese coding, Unicode encoding, some information is checked, and some issues have been mentioned in some forums. Now put it out, I hope to have a help in the future of similar projects. I am more simple, about PDU specification, you can see here: http://www.ascend-tech.com.cn/sustain/sms_pdu-mode.pdf, or go to Wavecom's website to find it.
1, GB2312 encoded to Unicode encoding conversion
On the RedHat 7.3 system, the default is to save the Chinese characters with GB2312 (which is also the text mixed for Chinese and English). So first you need to convert the GB2312-encoded string to the Unicode encoded string. GB2312 encoding is a multi-byte encoding method. For Chinese, it is represented by two bytes. For English, in English, it is the English ASCII code. (Note: I have not read the Norm of GB2312 encoding, and the above understanding is actually developed, and the correctness cannot be guaranteed). Unicode encoding is a double-byte encoding method, and 2 byte encoding is used for all characters. On the Linux platform, GB2312 encodes the conversion of Unicode encoding, there can be three implementations (or more):
1) Use the MBstowcs () function. It is the transition of multi-byte encoding to wide character. I tried it, can be converted correctly, but this function may not be very reliable.
2) Use GB2312 à Unicode's conversion table, manually check the conversion. There is such a conversion table on the Internet, you need to conversion to each GB2312 character, according to it is Chinese characters or English characters, respectively.
3) Use the ICONV () function. This may be a standard method on Linux, not only can convert GB2312 to Unicode, but also switches between any two codes (provided that the Linux system supports these codes).
First, use iConv_Open () to open a converter handle, specify two encoding and conversion encoding before conversion.
Then use ICNOV () to convert. Finally, use iconv_close () to close the handle, release the resource.
#include
CHAR INBUF [BUFLLEN];
Char Outbuf [buflen];
Char * PIN = Inbuf;
Char * pout = Outbuf;
... Open the file, read GB2312 data to Inbuf, data length is LEN
INT INLEFT = LEN;
Int outlex = buflen;
Iconv_t cd;
IF ((cd = iconv_open ("gb2312", "unicode")) == (iconv_t) -1)
Return -1;
IF (Iconv (CD, & PIN, & INLEFT, & Pout, & Outleft) == (size_t) -1)
Return -1;
Iconv_close (CD);
When using ICONV (), you need to pay attention to the use of parameters. Inleft is the length of the input buffer data, and OUTLEFT is the output buffer size. (Need to ensure that the output buffer is large enough).
After the conversion, OUTLEFT is the size of the hollow space in Outbuf, so BUFLEN-OUTLEFT is the real Unicode data length.
Note: Whether it is GB2312 encoding, or Unicode encoding, some byte sequences are all in memory, so we can unify the type of Char (or unsigned char) to save. Therefore, buflen-outleft is the number of characters (CHAR), not the NNICODE character.
2, Unicode encoded to 16-bit encoded conversion
After getting a Unicode encoding, it is also necessary to convert to the 16-bit encoding of the PDU to send correctly. Pay attention to two points during this conversion:
1) The 0xFeff flag at the unicode encoding is to be removed, and the content after 0xFeff is the real Unicode character. (As for why this 0xfeff flag knows, knowing the friends tell me, huh, huh).
2), Unicode is a double-character character. Because my system is a little-endian, that is, when storage is stored, it is the first low, the post-high, such as "in" Unicode encoding is 0x4E2D, when stored, 2D4E, when converting to 16-bit encoding, pay attention to this order. Of course, if your system is big-endian, then don't do this.
3) In order to convert 0x4E2D Unicode encoding to "4E2D" 16-bit encoding, each byte can be converted with Sprintf (buf, "% 02x", outbuf [i]).
3, correctly calculate the length of the 16-bit encoded message body
In the PDU specification, you need to include the length of the message body, here you have to calculate.
4, correctly set first-octet, TP-MR, TP-PI, TP-DCS, TP-VP
In the PDU format, the settings of the first-ocTet, TP-MR, TP-PID, TP-DCS, and TP-VP are correct, and it is important to send Unicode. According to the protocol specification and my debugging results, the correct settings of the above markers are (all 16): first-octet: 11
TP-MR: 00
TP-PID: 00
TP-DCS: 08 (encoding method, 16-bit)
TP-VP: A7
After the above steps, you can already send a Chinese characters.
Hope this document, can provide some help for friends who are preparing to develop SMS in Linux.
Reference:
★ an introduction to the SMS in PDU Mode GSM Recommendation Phase 2