Multi-language processing -> unicode -> ICU class library for IBM
Recently written a C program needs to handle mutual conversion problems between different language codes, mainly in several codes of GBK, BIG5, UTF-8 and UTF-16, and the program is small, but the conversion operation is relatively complex. At first, I only wanted to use C itself - including Wchar_t types in the C standard, the Locale function set in the C language running library, MBstowcs and WCSTombs functions, as well as WString, Wfstream, etc. in the standard C library - Solve the problem. However, actually playing the program, although these things can achieve the function of encoding conversion, it is always more cumbersome and lacks object-oriented fluency. The most annoying thing is that different C compilers are different from these things, VC is relatively good, and others always want to run for a long time to run, sometimes it will meet the strange coding error.
Now think about it, in the MFC, use the CString or TCHAR array to store strings, using Win32's multibytetowideChar and WideChartomultibyte function to process encoded conversion than the above method, and it is more difficult to transplant.
So, this will take particular nostalgia to deal with similar problems in Java language. Java itself's char type is Unicode, and also provides an object-oriented coding conversion mechanism in the API. So, in Java, the following is quite comfortable in the following conversion file encoding:
InputStreamReader in = new InputStreamReader (new FileInputStream (old_name), "BIG5"); OutputStreamWriter out = new OutputStreamWriter (new FileOutputStream (new_name), "GBK"); while ((i = in.read ())> = 0) out .write (i);
More complex coding conversion issues, it is also very simple to solve in Java. Where is the most essential problem in this? Is Java inappropriate the word ("kernel" word, is a bit like "kernel-based Chinese", "Nuclear grade") is designed as unicode? In other words, is it the benefits of the java's char type itself is the benefits of Unicode types? The language of the .NET platform, like C # and VB.NET is also designed in similar Java, which is also very convenient to handle encoding conversion in these languages. C / C is the internationalization function introduced later. CHAR and WCHAR_T are also separated from each other. Do you inevitably have a lot of trouble?
A few days ago, write a Ruby script on Linux, used to manage and monitor the broadband connection in my own home (I am not familiar with Ruby, I check the Ruby Manual), and find that the Ruby built-in character type is also single word. In the case, if you have access to Unicode or encoded, you have to use other libraries, you also need to turn back and forth, and when you write the control interface with the Ruby TK, you have a headache to show Chinese on Linux. This is not to say, such as Ruby's scripting language, if the internal character type is changed to Unicode, it will support international functional support, will be accepted by more people? Or say the explanation scripting language (dynamic language?), Should you use a mechanism similar to Java? Back to C . I found that the C itself has limited internationalization, start looking for a class library. I first checked Boost, I found that there were not much related things in Boost, the only one of the class library Date-Time, which is useful, is very bad, and the Date-Time supported in Boost is very limited, but only Gregorian System after Gregory Highlights, is too much to be too much compared to the Gregoriancalendar class in Java and the System.Globalization.Calendar class in Java.
Subsequently, I think of the apache's Xerces-C , which handles XML class libraries, of course, should support encoding conversion and Unicode programming. But turned over the document, found that Xerces-C is just a class library dedicated to XML processing, although there are some international support, but there are more things that are missing.
Then, I found Mark Davis's article on the IBM website forMs of unicode (http://www-900.ibm.com/developerworks/cn/Unicode/utfencodingforms/index_eng.shtml), this thing is so good, Not only is UNICODE basic knowledge, but also let me find a class library called ICU. This ICU is now the abbreviation of International Components for Unicode, the URL is
http://oss.software.ibm.com/icu/
Then I was very pleased, and the internationalization function in Java was the earliest of the ICU (why didn't I find it before?). So, the current ICU class library (including C and Java versions) is very similar to the Java API on the interface. ICU is very comprehensive for various coding conversions, and provides the functionality such as Regular Expression. The GregorianCalendar class in the ICU can provide substantially the same functionality as the GregorianCalendar class in Java (I am at http://www.cbsblog.net/ WangyongGang / Archive / 2004/05/17 / 732.Spx said these things).
Oops, IBM is really good, ICU is good! The functions I need are now, and now you can stop hard search, studying the ICU, writing my procedure.