Port of the chmlib-0.31 That Works on Mac OS X
OCT 16, I ported chmlib-0.31 to mac OS X 10.2 Using GCC 3.3.
Here is my steps: 1.file chm_lib.c, line 145, append "| __PPC__" to the end.2.file makefile, line 11, delete -dm_use_io64.3.file makefile, line 15, modify "libtool = libtool "to" libtool =. / libtool "4.in Directory CHMLIB-0.31, Provide A Script Names Libtool (WRAPS LIBTOOL Command), Which Typically Generated by AutoConf, My Libtool Script is here.
Now, After the Four Steps, Chmlib-0.31 Can Work Well Under OS X.
Published by WLFJCK
12:35
|
Reply (7)
October 15, 2003
Method for localization of VC programs
The most common technologies of resources in Visual C is to store resources using DLLs.
With this technology, there is a problem 1. Code and resource cannot be separated. In order to support different languages, the corresponding DLL project must be established. 2. Developers must participate in the process of resource localization. Essentially, the use of DLL cannot separate resources and code.
We can use script files to store all resources related to localization. This will solve the problem of using the DLL to store resources. Proceed as follows:
1. The resource of the menu is localized, we can use the menu function, get the menu ID once, then obtain the corresponding localized string from the resource. 2. Resources of the dialog localization, we obtain all the sub-control IDs all from a given dialog ID, then obtain the corresponding localized string from the resource. 3. Resource localization of the status bar, we can overload getMessageString () in the cframewnd, and obtain the corresponding localized string from the resources according to the ID. 4. Tooltip resources localization, we can override ONTooltiptext () in CFrameWnd, and get a localized string from the corresponding resources. 5. For custom controls and view, we can provide a ID from the corresponding string to the appropriate string when we can provide an ID using the string used.
In fact, many software are done, such as flashget, and more.
Published by WLFJCK
14:32
|
Reply (6)
Microsoft HTML HELP Format Research
Http://www.speakeasy.org/~russotto/chm/http://bonedaddy.net/pabs3/hhm/index.htmlhtp://66.93.236.84/~jedwin/projects/chmlib/
Published by WLFJCK
13:44
|
Reply (9)
Comparison of technical analysis of Windows dynamic libraries and Linux shared objects
Original source: http://www.ahcit.com/200306/3a.doc
Summary: Dynamic link library technology implementation and design procedures commonly used technologies, there are dynamic libraries in Windows and Linux systems, using dynamic libraries to reduce program size, save space, improve efficiency, and increase program scalability, Easy to modular management. However, dynamic libraries of different operating systems are different due to the format, and dynamic library programmap is required when they need different operating systems. This paper analyzes and compares two operating system dynamic library technology, and gives the methods and experiences of dynamic libraries prepared by Visual C to Linux. Keywords: dynamic link library Linux programming program transplantation
1 Introduction
Dynamic Link Library ABBR, DLL technology is a technique that is often used in programming. Its purpose reduces the size of the program, saving space, improving efficiency, high flexibility. Dynamic library technology is easier to upgrade software versions. Unlike static link library, the function in the dynamic library is not part of the execution program itself, but is on demand according to the execution needs, and its execution code can be shared in multiple programs.
This approach can be used in this way in Windows and Linux operating systems, but their call mode and the programming method are not the same. This paper first analyzes the dynamic library call modification methods and programming methods commonly used in these two operating systems, and then analyzes the differences in these two ways. Finally, according to the actual transplant experience, describe Windows prepared by VC . The dynamic library is transplanted to the method under Linux.
2 dynamic library technology
2.1 Windows Dynamic Library Technology
The dynamic link library is an important technical means for implementing Windows applications to share resources, save memory space, and improve the efficiency of use efficiency. Common dynamic libraries include external functions and resources, and some dynamic libraries only contain resources, such as Windows font resource files, called resource dynamic link libraries. Usually dynamic libraries as a suffix in .dll, .drv, .fon, etc.. The corresponding Windows static library is usually at the end of .lib, Windows yourself will implement some of the main system functions in the form of a dynamic library module.
Windows Dynamic Library is running in a virtual space that is running into a process at runtime, using memory allocated from the virtual address space from the calling process, becomes part of the calling process. The DLL can only be accessed by the thread of the process. The handle of the DLL can be used to use; the handle of the calling process can be used by the DLL. A variety of export functions are included in the DLL module for serving the outside world. DLL can have its own data segment, but there is no own stack, use the same stack mode as the application called it; one DLL only has an instance in memory; DLL implements code encapsulation; DLL preparation and specific programming language And the compiler is independent, you can achieve mixed language programming through the DLL. Any objects (including variables) created in the DLL function (including variables) are homed to use its threads or processes.
Depending on the way the call mode, the call to the dynamic library can be divided into static call mode and dynamic call mode.
(1) Static call, also known as implicit calls, and the encoding of DLL uninstallation is completed by the compilation system, the DLL uninstalled encoding (the Windows system is responsible for counting the number of DLL calls), the call mode is simple, can meet usually Claim. The usual call mode is to add the .lib file generated when generating a dynamic connection library to the application, you want to use the function in the DLL, you only have to declare in the source file. The lib file contains the symbolic name of each DLL export function and the selectable identification number and the DLL file name, which does not contain actual code. The information contained in the lib file enters the generated application, and the called DLL file is loaded in the memory when the application is loaded.
(2) Dynamic call, that is, the program is loaded by the programmer to load and unload the DLL to the purpose of calling the DLL, but more complicated, but more efficient use of memory is an important way to prepare a large application. . In a Windows system, functions related to dynamic library calls include: 1LoadLibrary (or MFC AFXLoadLibrary), load dynamic libraries. 2GetProcAddress, get the function to be introduced, convert the symbol name or identification number to the DLL internal address. 3Freelibrary (or MFC AFXFREELIBRARY), release the dynamic link library. Creating a dynamic library in Windows is also very convenient and simple. In Visual C , you can create a DLL program written directly with the C language without MFC, or create a DLL program based on the MFC class library. Each DLL must have an entry point, in VC , DLLMAIN is a default entry function. DllMain is responsible for initialization and ending. The dynamic library output function also has two conventions, which are based on invoking agreement and name modification. The function defined by the DLL program is divided into internal functions and export functions, and the dynamic library export function is called for other program modules. It is usually possible to export functions below:
1 Specify the function or variable to be entered using the export portion of the module definition file.
2 Use the modified symbol_declspec (DLLEXPORT) provided by the MFC.
3 Use the / export command line to output the relevant functions in the manner.
In the Windows dynamic library, you sometimes need to write module definition files (.def), which is a text file for describing a module statement of a DLL property.
2.2 Linux Sharing Object Technology
In the Linux operating system, there is a lot of shared object technology, although it corresponds to the dynamic library in Windows, but it is not called a dynamic library. The corresponding shared object file is used as a suffix, in order to facilitate, in this article, the concept is not specifically distinguished. There are many shared objects that are ending with SO / lib and standard graphics interfaces / usr / lib and standard graphics interfaces. Similarly, under Linux, there is also a static function library such a call mode, the corresponding suffix ends. Linux adopts this shared object technology to make it easy for program, saving programs to occupy space, increase program scalability and flexibility. Linux also allows developers to replace the system module using modules in their own libraries through the LD-Preload variable.
Like the Windows system, creation and using dynamic libraries in Linux is relatively easy, add -shared options when compiling the function library source program, which is the dynamic link library. Usually such a program is a suffix, in the Linux dynamic library program design process, the usual process is to write the user's interface file, usually .h file, write the actual function file, with .c or .cpp as a suffix, and write Makefile file. For smaller dynamic library programs, do not have to be used, but this design makes the program more reasonable.
After compiling the dynamic connection library, it can be called in the program. In Linux, a variety of call modes can be used, just like Windows System Directory (../system32, etc.), you can copy dynamic library files to / lib directory or establish a symbol connection in the / lib directory so that all users are used. Here is the function of Linux calls dynamic libraries, but when using a dynamic library, the source program must contain a DLFCN.h header file that defines the prototype of the function of calling the dynamic link. (1) _ Open Dynamic Link Library: DLOPEN, Function Prototype VOID * DLOPEN (Const Char * FileName, INT FLAG); DLOPEN is used to open the dynamic link library of the specified name (filename) and return the operating handle.
(2) Take the function execution address: DLSYM, function original to: void * dlsym (void * handle, char * symbol); DLSYM is executed according to the dynamic link library operation handle (HANDLE) and symbol (Symbol), return the symbol corresponding function Code address.
(3) Close Dynamic Link Library: DLClose, the origin is: int dlclose (void * handle); DLClose is used to close the dynamic link library of the specified handle, only when the use count of this dynamic link library is 0, will be truly The system is uninstalled. (4) Dynamic library error function: DLERROR, the origin is: const char * DLERROR (Void); When the dynamic link library operation function failed, DLERROR can return an error message, and the return value is NULL indicates that the operation function is successful.
After getting a function execution address, you can declare the function in the dynamic library in accordance with the function interface provided by the dynamic library in the dynamic library. When writing the Makefile file of the program that calls the dynamic library, you need to add the compile option -rdynamic and -ldl.
In addition to writing and calling a dynamic library in this way, the Linux operating system also provides a more convenient dynamic library call mode, which is also convenient for other programs, which is similar to the implicit link of the Windows system. Its dynamic library naming mode is "lib * .so. *". In this naming method, the first * represents the library name of the dynamic link library, the second * usually represents the version number of the dynamic library, or there is no version number. In this call mode, you need to maintain the configuration file /etc/ld.so.conf for dynamic link libraries to make the dynamic link library to be used by the system, usually append the directory name where the dynamic link library is located to the dynamic link library configuration file. If you have a x window window system release, there are / usr / x11r6 / lib, which points to the Dynamic Link Library of the X Window window system. In order to make the dynamic link library to be shared by the system, you need to run the management command of the dynamic link library ./sbin/ldconfig. When compiling the referenced dynamic library, you can compile the Dynamic Link Link Library to the GCC or direct reference. In Linux, the LDD command can be used to check the program dependent on the shared library. 3 Two System Dynamic Libraries Comparative Analysis
Windows and Linux adopt a dynamic link library technology is basically consistent, but due to the different operating systems, they are still different in many ways, and the following aspects are explained.
(1) Dynamic library program, execute file format under the Windows system is PE format, and dynamic libraries require a DLLMAIN function as an initialized population, which typically requires the _Declspec (DLLExport) keyword when deriving the declaration of the export function. The executable of GCC compiled under Linux is the ELF format by default, and does not need to initialize the entry, and it is not necessary to make a special statement, which is more convenient to write.
(2) Dynamic library compile, under the Windows system, there is a convenient debug compilation environment, usually do not need to write the Makefile file, but under Linux, you need to write the Makefile file yourself, so you must master a certain makefile writing skill, In addition, usually the Linux compilation rules are relatively strict. (3) Dynamic library calls, Windows and Linux's dynamic libraries prepared under this can be explicitly called or implicitly called, but the specific call mode is not the same.
(4) Dynamic library output functions View, in Windows, there are many tools and software to view functions output from the DLL, such as the DUMPBIN in the command line, and the Depends program in the VC tool. In the Linux system, NM is usually used to view the output function, or you can use the LDD to view the shared object files implicitly linked.
(5) Dependent on the operating system, these two dynamic libraries run dependent on their own operating systems, cannot be used across platforms. Therefore, for dynamic libraries to achieve the same function, there must be different dynamic library versions of two different operating systems. 4 dynamic library migration method
If you want to prepare a dynamic link library that can be used in both systems, you will usually choose the initial development in the debug environment provided by Windows VC . After all, the graphical editing and debug interface provided by VC is more convenient than VI and GCC. . After the test is completed, the dynamic library program is transplanted. Usually, the default compilation rules of GCC are strict than VC default compilation rules, even if there is no warning error in VC , there will be many warning errors in GCC debugging, and the -w option can be closed in GCC.
The rules and experiences that the program porting needs to be followed below.
(1) Try not to change the order of the original dynamic library header file. Usually in the C / C language, the order of the header file has a considerable relationship. Although the C / C language is case sensitive, the Linux must be the same as the header file, because the EXT2 file system is sensitive to the file name, otherwise it cannot be compiled correctly, and below Windows, head The case case can be compiled correctly.
(2) Unique header files in different systems. In a Windows system, a Windows.h header file usually includes a Winsock..h header file if the underlying communication function is called. Therefore, when porting to the Linux system, comment out of these Windows systems unique header files and some constant definitions of the Windows system, increase the header files of the support for Linux underground communication.
(3) Data type. VC has many unique data types, such as __int16, __ int32, true, socket, etc., the GCC compiler does not support them. The usual practice is to copy the statements defined in Windows.h and BaseTypes.h to a header file, and then contain this header file in Linux. For example, the type of socket is changed to INT.
(4) Keywords. VC has many keywords that have not been used in standard C, such as Bool, Byte, DWORD, __ ASM, etc., usually do not use them for porting, try not to use them, if you can't avoid #ifdef and #ndif for Linux and Windows Write two versions.
(5) Modification of function prototype. It is usually necessary to rewrite the function, but for the system call function, due to the difference between the two systems, such as network communications prepared in Linux due to the difference between the two systems. In the dynamic library, use the Close () function instead of the CloseSocket () function under the Windows operating system to close the socket. In addition, there is no file handle under Linux. To open the file available Open and FOPEN functions, the usage of these two functions can be referred to [2]. (6) Makefile preparation. The following is usually debugged by the VC compiler under Windows, but the GCC needs to write the Makefile file yourself or you can refer to the Makefile file generated by VC . For dynamic library transplantation, you need to add the -shared option when compiling dynamic libraries. For programs that use mathematical functions, such as power-level numbers, the call dynamic library is required to join -LM.
(7) Other places to pay attention to 1 Program design structure analysis, for transplanting the dynamic library program written by others, the program structure analysis is an essential step, usually in the dynamic library program, will not include an interface, etc., so It is relatively easy. 2 In Linux, the permissions of the file or directory are divided into owners, groups, others. So when you have a file, pay attention to the file is read or write, if you write a file, pay attention to modify the file or directory, or you cannot write files. 3 The use of the pointer, define a pointer only to allocate four bytes of memory. If you want to assign a value to which the pointer points to, you must use the malloc function to assign memory to it or define it as a pointer to the variable. This is strict than Windows next to Linux. The same structure cannot be transmitted in the function. If you want to perform structural values in a function, the structure in the function must be defined as a structural pointer. 4 path identifier, "/" under Linux, is "/" under Windows, pay attention to Windows and Linux's dynamic library search paths. 5 Programming and commissioning skills. Different debugging techniques for different debugging environments are not described here.
5 Conclusion
This article analyzes Windows and Linux dynamic libraries implementation and use, from program writing, compiling, call, and integrated analysis of operating system dependencies, comparing these two call modes, according to actual program transplantation, give The method of porting the Windows dynamic library prepared by VC to Linux and the problem that needs attention, and gives the program segmentation, actually in the process of programmation, due to the design of the system, it may be carefully needed. It is much more complex than the top, and this paper provides intentional experience and skills to intend to transplant different operating system programs through summary.
references
[1] David J. Kruglinski, Visual C 6.0 Technology insider (fifth edition), hoping book creation. Beijing: Beijing Hope Electronics Press, 1999.5.
[2] Zhongke Hongqi Software Technology Co., Ltd. Linux / UNIX Advanced Programming Tsinghua University Press 2001.
[3] Sandra Loosemore et.al. The GNU C Library Reference Manual, Machinery Industry Press, 2000.8.
Published by WLFJCK
09:06
|
Reply (8)
October 12, 2003
Transplantation Mozilla character set detection code
Character set detection is very useful, in the case where character information cannot be obtained, we need to have a mechanism, method to detect the character's encoding, thus performing the correct operation. In a Windows environment, we can call the MLANG interface to perform character set detection. You can complete the automatic detection of the code page by calling the DetectCodePageInIStream, DetectInputCodePage in iMultilanguage2. However, in other environments, such as UNIX, Linux, or Mac systems, we need our own code to provide character set detection. This set of code is provided in the Mozilla Open Source Project (which can be viewed on "a language / encoding method for language / encoding detection" on October 2, "in order to make this code from the Mozilla project peeled out, we need to do Some small work, you can refer to the article "How to build Standalone Universal Charset Detector from Mozilla Source" on the Mozilla website.
Below I will briefly introduce my transplant work, steps are as follows: 1. First get the Mozilla code, you can download Mozilla's code from www.mozilla.org, or extract code from CVS, all character sets detected The code is stored in the directory of Mozilla / Extensions / UniversalChardet / src. 2. Create two files nscore.h and config.h, the content is as follows
#ifndef nscore_h __ # define nscore_h__
#include
Typedef int print32; typedef unsigned int pruint32; typef Bool prBool; typedef short print16;
#define pr_false false # define pr_true true # Define nsnull null
#define pr_malloc malloc # define pr_freeif (x) DO {if (x) free (x);} while (0)
#ENDIF
The above code is nscore.h to replace the type defined in the Mozilla used in the original code.
#ifndef config_h # Define config_h
#if Defined (win32) || defined (_Win32) || Defined (__ win32 __) # define build_win32 # ENDIF
#if defined (BUILD_WIN32) &&! defined (BUILD_NO_DLL) # ifdef CLASS_IMPLEMENTATION # define CLASS_EXPORT __declspec (dllexport) # else # define CLASS_EXPORT __declspec (dllimport) # endif # else # define CLASS_EXPORT # endif
#ENDIF
The above code is config.h to set the compilation option.
3. Modify NSUniversalDetector.h, add #include "config.h" on the head, modify nsuniversaldetector.h and nsuniversaldetector.cpp, annotate all XPCOM's class. 4. Provide a subclass of a NSUNIVERDETECTOR class to provide a Report method. 5. Write makefile, here, I wrote two makefiles, compiled with VC6, and compiled with GCC. In VC6, you can compile into DLLs and static libraries.
Through the above steps, we can transplant the character set detection code from the Mozilla project, which is used by our own projects.
I have a compressed transplant code, detect.zip, and after decompression, I can compile NMAKE, or compiled by MAKE in any platform that provides GCC. Such as nmake -f makefile.msvc or make -f makefile.gcc. When using a static connection library in the VC, #define build_no_dll is required. At the same time, a UniversalstringDetector class is provided. When the character set is detected, you can call the DOIT method of this class, and the result is stored in the Ocharset.
Published by WLFJCK
19:45
|
Reply (8)
October 11, 2003
Java virtual machine implementation and other links
http://www.kaffe.org/links.shtml
Published by WLFJCK
09:45
|
Reply (10)
October 10, 2003
Compilation of Anjuta
There are not many development environments under Linux. The most common KDeveloper and Anjuta. The general Linux system will default for you to kDeveloper, Anjuta is not installed.
Personally think that ANJUTA is better than KDeveloper. Download the source code of Anjuta from http://anjuta.sourceforge.net, Anjuta current version is anjuta-1.1.97.
TAR ZXVF Anjuta-1.1.97.tar.gzcd Anjuta-1.1.97./configure
In the configuration, because the personal environment is different, it may be unsuccessful, but it is nothing,. / Configure will finally prompt you missing which packages. What you need is to come back from the Internet, and then INSTALL after compiling.
In my Redhat 9 environment, LibgomePrint-2.0, VTE, LIBXVT-2.0 package is lacking in the environment of my Redhat 9, and I found it with Gnome mirror website http://fto, acquiring. These libraries are installed separately. One thing, it is important to note that the default library will be installed in / usr / local / lib, so it is necessary to transplant pkg_config_path in .bash_profile
PKG_CONFIG_PATH = / usr / lib / pkgconfig: / usr / local / lib / pkgconfigexport pkg_config_path
The next thing is very simple ./configuremakemak installanjuta
Beautiful Anjuta appears!
Published by WLFJCK
15:14
|
Reply (7)
October 02, 2003
Composite method for language / encoding detection
Translated website from Mozilla. This paper discusses three different detection methods to achieve automatic character set detection. A Compositive Approach To Language / Encoding Detection) Shanjian Li (Shanjian@Netscape.com) Katsuhiko Momoi (MOMOI@Netscape.com) Netscape Communications Corp.
[Note: This paper was originally published in the 19th International Unicode Conference (SAN Jose). In the future, our implementation has experienced the test of time and practical application, and we have made a lot of improvements. A major change is that we now use a positive sequence to detect a single-byte character set, see Sections 4.7 and 4.7.1. This paper writes the general-purpose character set detection code to integrate to Moailla primary code (see Section 8). Since then, the character set detection code is merged into the code tree. For the latest implementation, please check the corresponding code in the Mozilla's code tree. Author - November 25, 2002. ] 1. Summary:
This paper provides three automatic detection methods to determine the encoding of documents without obvious character set declaration. We will discuss the advantages and disadvantages of each method, and provide a composite, more efficient approach to detect coding, so that three detection methods can be complementary. We believe that automatic detection is useful in manual selection coding of the browser user to avoid frequently using the encoding menu, while providing a more reasonable processing method if the encoding menu is rare. We assume that the document is transformed into Unicode to transparent to the user. Regardless of whether the character encoding is a unicode encoding or local encoding, the user only needs to know that the character is finally displayed correctly. Good automatic encoding detection can effectively help users handle most of the coding matters without user manual participation.
2. Background:
Since entering the computer era, people have created many coding schemes that use computer data to express different text / character sets. With the development of globalization and the development of Internet, information exchange between cross-language and region is increasingly important. But existing various coding schemes are a barrier. Unicode provides a universal coding solution, however, a wide variety of factors have not replaced the existing area coding scheme. In addition to W3C and IETF recommendations, UTF-8 is used as default encodings, such as XML, XHTML, RDF. Therefore, today's internationalization software must not only handle Unicode encodings, but also handle other different coding methods.
Our current work is to develop in the environment where the Internet browser is developed. In order to deal with the various languages that use different codes on the Web, we have done a lot of efforts. In order to get the correct display result, the browser needs to use the encoding information, web pages, or end users returned by the HTTP server, or end users can be obtained by selecting the encoding menu. In addition, most users have no ability to manually operate through the encoding menu. If there is no encoded information, the webpage is sometimes displayed as "garbage" character, and the user can't get the information they want. This will eventually lead to the user to think that their browser has failed or there is bug.
Since more and more Internet standard protocols specify Unicode as the default encoding method, the web page will not be suspicious to use Unicode to encode. A good universal automatic detection method provides an important contribution to this steering because they work very nature without the need for users to use the encoding menu. In this case, gradual steering will be carried out in a manner that the user is not easy, this is because of the user, the web page will always display correct, and they do not need to consider using the encoding menu. This smooth transition will make the encoding to the user more and more no attention. Automatic detection will be critical in this scenario.
3. Problem scope:
3.1. Universal mode
Let us start from the general mode. For most applications, the following example will represent a common framework for automatic detection:
Enter data -> Auto Detector -> Return Results
Application / Program accepts the result returned by the automatic detector, and uses the information in different purposes, for example, sets the encoding of the data, displays the original codent's data, and transmits it to other programs, and so on.
The automatic detection method discussed in this paper uses the Internet browser as an application environment, and other applications can be easily transplanted. 3.2. Browser and Automatic Detection
The browser can use some test algorithm to automatically detect the encoding method of the web page. A program can potentially explain a piece of text, but in addition to some extremely rare cases, only one explanation is the web page. This is why it usually displays only the specified language to display the web page.
In order to list the main factors in the design automatic detection algorithm, I have the following assumptions for input text and steps, with web data sheets,
1. For some language, the input text consists of a word / sentence with readers. (= Data has non-clothine)
2. Enter text is made from a typical web page on the Internet. (= Data is not from some disappearance or ancient language)
3. Input text is likely to include and encoded, additional noise data. For example, HTML tags, unrelated words (eg, English words that appear in Chinese documents), spaces, and other format / control characters.
For automatic detection, you want to include all known languages and coding methods is almost a task that cannot be completed. In today's methods, we tried to include allocated codes in all East Asian languages, and a universal mode is also provided for single-character detection. Russian coding is selected as an example implemented as a latter detection method, as well as a single-character detection test basis.
4. Multi-byte coding methods in the target include UTF8, Shift-JIS, EUC-JP, GB2312, BIG5, EUC-TW, EUC-KR, ISO2022-XX, and Hz.
5. Provides a universal mode to handle single-byte coding - Russian code (KOI8-R, ISO8859-5, Window1251, Mac-Cyrillic, IBM866, IBM855) will be used as an implementation, and also as a test basis.
4. Three methods for automatic detection
4.1 Introduction
In this section, we discussed three different methods of detecting text data encoding, which) Coding schema method (2) Character Distribute, and 3) Double character sequence distribution (2- CHAR SEQUENCE DISTRIBUTE). When each method is used alone, it has its strengths and insufficient, and if we use all three ways in a complementary manner, the results will be ideal.
4.2 Coding Mode Method This method is perhaps the most obvious method when detecting multi-byte coding, and is also the most easily used method. In any multi-byte encoding mode, not all possible code points are used. If you are verifying a specific encoding, you can immediately determine that this coding speculation is not correct immediately. A small number of code points can also represent specific coding methods so that we can use this fact to make correct judgments. Frank Tang (Netscape Communications) has developed a very effective, encoded mode, and detects the algorithm of the character set by using the Parallel State Machine. His basic idea is:
For each encoding mode, there is a corresponding state machine being used to verify this specific encoded byte sequence. Each byte received for the detector will be entered into each available, active state machine, each one byte each time. The state machine changes its state based on the previous state and the byte it receives. Automatic detector is interested in three states of state machines:
START Status: This state represents two cases, initializes, or a legitimate field sequence representing the character set has been verified. ME Status: This state represents a state machine to verify a character set unique to a character set, and other possible character sets do not include this byte sequence. This can cause the detector to return to a certain answer now. Error Status: This status represents the state machine to verify an illegal byte sequence of the character set. This will immediately lead to negative answers to this character set. The detector will eliminate this encoding method from now on, not considered. In a typical example, there is only one state opportunity to make a certain answer, while other status opportunities make a negative answer.
The PSM (Parallel State Machine version used in our current work is the version after the Frank Tang originally operated. As long as a state machine reaches the START status, it means that it successfully detects a legal character sequence, and we query the state machine how many bytes in this character sequence. This information will be used in 2 aspects:
First, in the UTF-8 encoding method, if there are many multi-bytes of sequences through verification, the input data is rarely likely to be encoded using UTF-8. Therefore, we count the multi-byte number of UTF-8 state machines. When reaching a specific number (= threshold), the conclusion can be made. Second, for other multi-byte coding, this information can be entered into the character distribution analyzer (which can be viewed below) so that the analyzer can work on the character data rather than the original data.
4.3 Character distribution method:
No matter which language, there are always some characters more common than other characters. With this fact, we can establish a corresponding data pattern for each language. For those languages, such as Chinese, Japanese and Korean, is especially useful. We often hear the distribution statistics associated with this, but we didn't find too many announcements. Therefore, in order to continue discussions, we rely on the data you collected.
4.3.1. Simplified Chinese:
We show the following distribution results for the results of the 6763 character studies encoded by GB2312:
Number of Most Frequent CharactersAccumulated Percentage100.11723640.319831280.452982560.618725120.7913510240.9226020480.9850540960.9992967631.00000
Table 1. Simplified Chinese Character Distribution Table4.3.2. Traditional Chinese:
A similar results showed similar results for the annual study of the Taiwan Mandarin Promotion Council, using BIG5 encoding.
Number of Most delCentage10 0.1171364 0.29612128 0.4261256 0.57851512 0.748511024 0.893842048 0.975834096 0.99910
Table 2. Traditional Chinese Character Distribution Table
4.3.3. Japanese
For Japanese, we use our own collected data and write a utility to analyze them. The table below shows the results:
Number of MOST FREQUENT CharactersAccumulated Percentage100.27098640.667221280.770942560.857105120.9263510240.9713020480.9943140960.999811.00000
Table 3. Japanese Character Distribution Table4.3.4. Korean
Similarly, for Korean, we use our own data to collect data from the Internet, and analyze them with our own utility, as follows:
Number of MOST FREQUENT CharactersAccumulated Percentage100.25620640.642931280.792902560.923295120.9865310.999994420480.9999940960.99999
Table 4. Korean Character Distribution Table4.4. General Features of Distribution Results:
For all 4 languages, we found that there is a large percentage of coding points in our defined applications. In addition, a further investigation of these code points shows that they are spread in a wide range of codes. This gives us a way we overcome the general problems encountered in the analysis of the code mode, such as the codes of different countries, have some codes that are mutually overlapping. Since the most common character sets in these languages have the features we described above, the different codes have a problem with each other in the encoding mode method, which will become irrelevant in the distribution method.
4.5. Analysis Algorithm
In order to verify that a language-based frequency / distribution statistical characteristics, we need a algorithm that calculates a value from the text input stream. This value will indicate that the text is the possibility of a certain character set coding. A very intuitive approach is to calculate this value with the frequency weight of each character. However, from our experience of different character encoding, we find that this approach is not required, and it takes up too many CPU processing capabilities and excess memory. A simple version also provides a very satisfactory result and uses very little resources, and runs very quickly.
In our current method, all characters in a given encoding will be divided into two categories, "often use" and "non-common use". If a character is appearing in the first 512 characters in the frequency distribution table, it will be classified to "often use". The reason why the 512 is selected is because it covers a large part of the percentage of input text in all four languages, while only occupying a small number of code points. In the input text, we count the number of characters in each category in batches, and then calculate a floating point value we are called distribution ratio.
The definition of the distribution rate is as follows
Distribution Rate = Quantity of characters in 512 most common characters / characters appear in the remaining characters
Each detected multi-byte encoding method has a unique distribution rate. From the distribution rate, we can calculate the credibility of a given encoding method to the original input text. The discussion of each coding method will be more clear about this problem.
4.6 Distribution Rate and Confidence Level:
Let us vary from four languages to see the distribution rate. Note that the term distribution rate represents two meanings. The "ideal" distribution rate is the definition of the language / character set rather than the definition of the encoding. If a language / character set is characterized by a variety of encoding methods, then for each coding, we calculate the "actual" distribution rate by classifying input data as "commonly used" or "non-frequently used". . This value is then compared to the ideal language / character set distribution rate. Based on the actual distribution rate, we can calculate the credibility of the input data to each character set, described below.
4.6.1. Simplified Chinese (GB2312)
GB2312 contains two-stage Chinese characters. Level 1 contains 3,755 characters, Level 2 contains 3008 characters. Level 1 characters are commonly used than Level 2 characters, and it is not difficult to get 512 most common characters in Level 1 characters in GB2312. Since the 1 character use pronunciation, 512 most common characters are almost completely dispersed in 3755 code points. Although these characters occupy 13.64% in all code points in Level 1 characters, in a typical Chinese text, it has 79.135% opportunities. Under the ideal conditions, a Chinese text containing enough multi-characters will return to us as follows: Distribution rate = 0.79135 / (1-0.79135) = 3.79
Also, for the use of the same coding scheme randomly generated, the ratio is substantially 512 / (3755-512) = 0.157 without considering the 2 character.
If we take a level 2 character, we assume that the probability of each character in the level 1 character is P1, 2, and the calculation formula will be:
512 * P1 / (3755 * P1 3008 * P2 - 512 * P1) = 512 / (3755 3008 * P2 / P1-512)
Obviously, this value will be smaller. In the later analysis, we use the worst case as comparison.
4.6.2.big 5:
BIG5 and EUC-TW (eg, CNS character set) have similar situations. BIG5 also encodes Chinese characters into Level 2. The most commonly used 512 characters are uniformly distributed in 5401 level 1 characters. The ideal distribution rate obtained from the BIG5 encoding text is
Distribution rate = 0.74851 / (1-0.74851) = 2.98
Also, the text, the subsequent text, the distribution rate is
512 / (5401-512) = 0.105
Since the level 1 character of BIG5 is basically equivalent to the level 1 character of the CNS character set, the same analysis can be applied to EUC-TW.
4.6.3. Japanese SHIFT_JIS and EUC-JP:
For Japanese, Petice Monochrome and Sphiny Pseudo is commonly used than Japanese Chinese characters. Since SHIFT-JIS and EUC-JP will be encoded in different coding regions, we can still separate these two coding regions in different coding regions. Japanese Chinese characters that appear in 512 most common characters are still distributed at 2965 level 1 JIS Japanese Chinese characters. The same analysis derived the following distribution rate:
Distribution rate = 0.92635 / (1-0.92635) = 12.58
To the original text data generated, the distribution rate is the least
512 / (2965 63 83 86-512) = 0.191
The above calculation contains Hankaku frame fake name (63), a parity name (83), and a slice fake name (86).
4.6.4. Korean EUC-KR:
In EUC-KR encoding, the Chinese characters actually use in typical Korean text can be ignored. Use this coded 2350 Korean characters to arrange according to pronunciation. In the frequency table obtained after we analyzed a large number of Korean text data, the most commonly used characters are distributed in 2350 code points. With the same analysis, in the ideal case, we get
Distributed rate = 0.98653 / (1-0.98653) = 73.24
For randomly generated Korean text, the distribution rate is
512 / (2350-512) = 0.279
4.6.5. Calculation of credibility
From the previous discussion of each language, we can define the credibility of each data set as follows.
Confidence Detecting (InputText) {for each multi-byte character in InputText {TotalCharacterCount ; if the character is among 512 most frequent ones FrequentCharacterCount ;} Ratio = FrequentCharacterCount / (TotalCharacterCount-FreqentCharacterCount); Confidence = Ratio / CHARSET_RATIO; Return Confidence;}
The credibility of each data set is defined as the distribution rate of input data divided by the ideal distribution rate obtained from above.
4.7. Double character sequence distribution method
For languages that only use few part characters, we need to further happen to statistical single characters. Character combination reveals more languages - character characteristics. We define the two-character sequence as 2 characters that appear in the input text. In this case, the order is very important. Due to the same occurrence rate in a language, the double character sequence distribution is very inclined to be related to language / coding. This feature can be used in language testing. This will result in better credibility when detecting character encoding, which is useful in detecting single byte coding.
Use Russian for example. We downloaded 20MB of Russian plain text and wrote a program to analyze the text. This program found a total of 21,199,528 double character sequences. In the sequence we find, some are unrelated to our investigation, such as space-space combination. These sequences are considered to be noise sequences, and their appearances are not included in the analysis. In our data we used to detect Russian, these noise data is left, and the 20134122 double character sequence remains. A total of 95% of our sequence found in the data. The sequence used to build our language mode can be divided into 4096 different sequences, and 1961 sequences in our 20134122 samples have less than 3 times. We call these sequences as the negative sequence column set of this language.
4.7.1. Credibility detection algorithm
For single-byte languages, I define the following credibility:
Confidence Detecting (InputText) {for each character in InputText {If character is not a symbol or punctuation character TotalCharacters ; Find its frequency order in frequency table; If (Frequency order First, the sequence analysis does not perform all characters. We can of course contain all those characters sequences by establishing a 256 x 256 matrix, but many of them are not related to language / coding analysis. Since most of the single-byte languages are only used in only 64 letters, the most commonly used 64 characters have almost all languages. Therefore, the matrix can be reduced to a smaller 64 x 64. So we use 64 as the sampling size in our work. We have chosen 64 characters used to build our model, and is based on frequency statistics and adjusted accordingly. Some characters, such as 0x0d and 0x0a, in our point of view, and spaces (0x20) very similar, so removed from our sample. Second, many sequences are similar to those included in the 64 x 64 matrix mode, and many sequences are similar to those used to detect language / coding. Almost all single-byte language codes contain ASCII subsets, in other language data, English words are more common, especially in web pages. Space - Space sequence is obviously unrelated to any language coding. All of this is considered to be noise data in our tests, and is removed by filtration. Finally, in the calculation of credibility, we also need statistics that appear in and do not appear within our sampling range. If most characters in a small sampling data appear within our sampling range, because in this case, the negative sequence rarely occurs, so the sequence distribution itself will return a higher value. By filtration, if the text is used by the desired encoding, most of the characters that provide to the detector will fall within the sampling range. Therefore, the credibility obtained by the statistical negative sequence needs to be adjusted with this numerical value. The summary of the previous narrative is as follows: When character set is verified, only a small number of subsets in all characters are used. This guarantees that our model will not be too large. We also guarantee that our detection accuracy is higher by reducing noise sequences. Each language mode uses a script / tool when processing the Latin alphabet character: If the Latin letter is not used in the language, the letter-alphabetic sequence is considered to be a noise sequence and is removed from the detection. (For example, English words in other languages appear in other languages) If the Latin letter is used in the language, these sequences are reserved to analyze. The characters falling into the sampling range and characters that have not fallen into the sampling range are counted, so they can be used to calculate credibility. 5. Three ways comparison: 5.1. Encoding mode: For many single-byte coding, all coded points used are evenly distributed. Even the coding method containing some unique code points, these useless code points are rarely used in other coding methods, so it is not suitable for coding detection. For other multi-byte encoding methods, this method can get a good result and is very effective. In fact, due to some coding methods, such as EUC-CN and EUC-KR have almost completely similar code points, it is difficult to distinguish them from them. Considering the fact that the browser usually does not contain a lot of text, we must use other methods to detect the encoding. For 7bit encoding methods, such as ISO-2022-XX and Hz, since they use easy-to-recognize escape sequences or transformation sequences, this method can draw a satisfactory result. The encoding mode method is summarized below, It is very suitable for processing 7bit encoding, such as ISO-2022-XX and Hz. Suitable for some multi-byte coding, such as Shift_JIS, and EUC-JP, but not suitable for processing other codes, such as EUC-CN and EUC-KR. Not very useful for single-byte coding. Can be applied to any type of text. Fast and effective. 5.2. Character distribution: Multi-byte coding, especially those coding mode methods, without effective processing, the character distribution method provides a lot of help while avoiding in-depth analysis of complex context. For single-byte coding, since the amount of data is typically rare, but because there are too many possible coding methods, this mode is unlikely the ideal result unless otherwise. Since the double character sequence distribution method can achieve good test results on this situation, we do not consider excessive use of this method in single-byte coding detection. Summary of the character distribution method is as follows It is very suitable for processing multi-byte encoding. Only for specific text. Fast and effective. 5.3. Double character sequence distribution: In the dual character sequence distribution, we can use more data information to detect language / encoding. Even in the case of only very little data samples, it can also get good results. However, since the sequence replaces the word (by space separation), the matrix will become large when processing the multi-byte language. Therefore, the method: It is very suitable for processing single-byte coding. For multi-byte coding methods, it is not too suitable. A good result can be obtained in the case where the sample size is small. Only for specific text. 6. Composite method: 6.1. Combination three methods: In the language / encoding processed by our character set automatic detector, both single-byte coding, and multi-byte encoding. Based on the definition of the above three methods, any of them can be used alone cannot generate satisfactory results. So we recommend using a compound method to process all of these codes. Double character sequence distribution method can be used to detect all single-byte encodings. The encoding mode method can be used in UTF-8, ISO-2022-XX and Hz detection. In the UTF-8 detection, a single modification is made to the existing state machine. The successful detection of the UTF-8 detector is made after multiple multi-word sequence verification. (See Martin Duerst's (1977) for details. The encoding mode method and the character distribution method functions together on the main East Asian character encoding, such as GB2312, BIG5, EUC-TW, EUC-KR, SHIFT_JIS, and EUC-JP. For Japanese coding, such as Shift_JIS and EUC-JP, the two-character sequence distribution method can also be used to detect, because they contain a number of fake names with significant features, these fake name characters and the letters in the single-byte language are very similar. Double character sequence distribution method can also get accurate results in very little text. We tried two ways, one with a two-character distribution method, another is not used. They all have achieved satisfactory results. There are some websites containing many Chinese characters and flash names, but there are only some flat-fake names. In order to obtain the best results as much as possible, we use both character distribution methods in Japanese code detection, and use double-character distribution methods. Here, there is an example of these three detection methods that combine together. The uppermost control module (for automatic detection) algorithm is as follows: Charset AutoDetection (InputText) {if (all characters in InputText are ASCII) {if InputText contains ESC or "~ {" {call ISO-2022 and HZ detector with InputText; if one of them succeed, return that charset, otherwise return ASCII; Else Return ASCII;} Else IF (InputText Start With Bom) {Return UCS2;} else {call all multi-byte detectors and salesle-byte detectors; return the one with best confidence;}} The summary of the above code sequence sequence is as follows. A large part of the website still uses ASCII encoding. The upper layer control algorithm starts from ASCII verification. If all characters are ASCII, other detectors do not need to be used in addition to ISO-2022-XX and Hz encoding. ISO-2022-XX and HZ detectors are loaded in ESC or "~ {", and they will be discarded immediately when they encounter 8bit bytes. When verifying the UCS2 encoding, it is searched for whether the BOM appears. We have found some websites to send 0x00 in the HTTP stream, but use this byte to verify that UCS2 encoding is proven to be untrustworthy. If any of the detectors in the activation state receive sufficient data and achieve high credibility, the entire automatic detection process will be terminated, and the character encoding will return as a result. We call it shortcut. 6.2. Test results: As a test of the recommended method of this article, we apply our detector in 100 popular, but there is no international website based on the document or server to send HTTP character set information. For all encoding methods contained in our detector, we can or get 100% accuracy. For example, when we accesses a website that does not provide character set information (for example, before the server on http://www.yahoo.co.jp does not send character information before sending character information), our character detector generates the following output : [UTF8] is inactive [SJIS] is inactive [EUCJP] detector has confidence 0.950000 [GB2312] detector has confidence 0.150852 [EUCKR] is inactive [Big5] detector has confidence 0.129412 [eUCTW] is inactive [Windows-1251] detector has confidence 0.010000 [KOI8-R] detector has confidence 0.010000 [ISO-8859-5] detector has confidence 0.010000 [x-mac-cyrillic] detector has confidence 0.010000 [IBM866] detector has confidence 0.010000 [IBM855] detector has confidence 0.010000 Therefore, EUC-JP encoding is the most likely encoding method for this site. 7. Conclusion: In our environment, it is very effective to detect language / encoding proven to detect language / coding by using the encoding mode, character distribution, and double-character sequence distribution. We cover Unicode encoding, multi-byte, and single-byte encoding. These coding methods are very representative in our current Internet digital text. We have reason to believe that by extension, this method can cover the remaining encoding methods included in this paper. Although at present, in our test, only coding information is what we need. In most cases, language information can also be identified. In fact, characters distribution and dual-character distribution methods rely on distribution patterns of different language character sets. Only in the case of UTF16 and UTF8, the encoding method can be identified, and the language information is still unknown. Even in this case, our work can also be extended to overwrite the detection of language information. The three detection methods listed here have been implemented in Netscape 6.1 PR1, and as the "DETECT All" option in the subsequent version. We hope that our work can be used to use the trouble to free from the troubles of the character encoding menu operation. Character encoding menu (or other form of encoding menu) is different from the interface elements of other Internet clients, which exposes some international information to general users. Its existence reflects how messy of the current web page to join the language / encoding method. We hope that by providing default coding and universal automatic detection can help users avoid most of the issues in handling online transactions. The web standard is being transferred to the default encoding using Unicode, especially, is transferred to UTF-8. We expect to be gradually used in the Web. Due to the use of automatic detection, this transfer can be quietly performed, more and more users will be liberated from the face coding event when reading / sending messages. This is also why we advocate Internet clients to use good automatic detection methods and good default encoding settings. 8. In the future: Our automatic detection is designed to identify the language. The encoding judgment is this identified by-product. In the current work, in the single-byte implementation, we use Russian examples. Since it identifies the language and the encoding method in this language, the more language data, the higher the quality of the encoding detection. To add other single-byte languages / encoding, we need a large number of text sampling data in every language while needing a certain depth of cognition / analysis. We currently use scripts to generate a language mode for all encodings of a language. Currently, our work has not been reflected in the Mozilla source code, but we hope that our work in the future will become open. We hope that some people contribute to this area. Because we have not tested many single-byte codes, we hope that when we are handling other languages / codes, we have better adjustments, modifications or even redesign. 9. Quote Duerst, Martin. 1977. The Properties and Promizes of Utf-8. 11th Unicode Conference. Http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/IUC11-UTF-8. pdf Mandarin Promotion Council, Taiwan. Annual survey results of Traditional Chinese character usage. http://www.edu.tw/mandr/result/87news/index1.htmMozilla Internationalization Projects. http://www.mozilla.org/projects/ INTLMOZILLA.ORG. http://www.mozilla.org/mozilla source viewing. http://lxr.mozilla.org/ Copyright 1998-2003 The Mozilla Organization Finally modified November 26, 2002 Document History Published by WLFJCK 21:26 | Reply (11) October 01, 2003 C language problem First, let's see the following program char * xyzstring = "hello, world!"; Int main (void) {xyzstring [0] = 'a'; "% s / n", xyzstring; return 0;} Does this program have problems? Yes, xyzstring data should be static data and cannot be modified. This is also said in the standard, C language. Sure enough, do we do a test, use four compilers to see the actual operation. CodeWarrior 8.0, Visual C 6.0, Borland C 5.6 has no problem after compiling, all played "Aello, World!", GCC 3.2 compiled after running the program dead. Why do you have this situation, the answer is very simple, GCC will "Hello, World!" In the .Text segment, and the other three compilers will "Hello, World!" To the .data segment. In the program, the .TEXT segment is used to store the code, so it cannot be modified, so when we use XyzString [0] to modify the content, the program will be done. And the .data segment is because it is a data segment, it can be modified, so the program will not be dropped. Here is to verify the above statement, we use GCC to edit the source code to Test.s Run the following command gcc -s Test.c .file "test.c" .globl _xyzstring .textlc0: .ascii "hello, world! / 0" .data .lign 4_xyzstring: .long lc0 .def ___main; .scl 2;. TYPE 32; .endef .textlc1: .ascii "% s / 12/0" .align 2.globl _main .def _main; .scl 2; .type 32; .ndef_main: pushl% EBP MOVL% ESP,% EBP SUBL $ 8 % ESP ANDL $ 0,% ESP MOVL $ 0,% EAX MOVL% EAX, -4 (% EBP) MOVL-4 (% EBP),% EAX CALL __ALLOCA CALL ___MAIN MOVL _XYZSTRING,% EAX MOVB $ 97, (% EAX Subl $ 8,% ESP PUSHL _XYZSTRING PUSHL $ LC1 CALL _PRINTF ADDL $ 16,% ESP MOVL $ 0,% EAX Leave Ret .def _printf ;.scl 2; .Type 32; .endef or more is the output result, we see Test.s The first two lines are .Globl _xyzstring.text Description _xyzstring is defined in the .Text segment, now we rename it .Globl _xyzstring.Data then compile GCC -O Test.exe test.s, then perform Test. EXE, what happened, haha "Aello, World!", there is no problem. The above is a solution to this type of code, but in order to be compatible, we must never write the code above. Published by WLFJCK 14:34 | Reply (26)