public Reader getReader (InputStream is, String encoding) throws IOException, UnsupportedEncodingException {PushbackInputStream pis = new PushbackInputStream (is, 1024); String bomEncoding = getBOMEncoding (pis); if (bomEncoding == null) {input = new BufferedReader (new InputStreamReader ( pis, encoding));} else {input = new BufferedReader (new InputStreamReader (pis, bomEncoding));}} protected String getBOMEncoding (PushbackInputStream is) throws IOException {String encoding = null; int [] bytes = new int [3] BYTES [0] = is.read (); bytes [1] = is.read (); bytes [2] = is.read ();
IF (Bytes [0] == 0xfe && Bytes [1] == 0xFF) {encoding = utf_16be; is.unread (Bytes [2]);} else} else} (bytes [0] == 0xff && Bytes [1] = = 0xfe) {eNCoding = utf_16le; is.unread (Bytes [2]);} else if (bytes [0] == 0xef && bytes [1] == 0xbb && bytes [2] == 0 xbf) {encoding = utf_8 Else {for (int i = bytes.length - 1; i> = 0; I -) {is.unread (bytes [i]);}}
Return encoding;
Byte Order Mark (BOM) FAQ
Q: What is a bom?
A:. A byte order mark (BOM) consists of the character code U FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files Under some higher Level Protocols, Use of a Bom May Be Mandatory (or Prohibited) in The Unicode Data Stream Defined in That Protocol. [AF]
Q: Where is a bom useful?
A: a bom is useful at the beginning, but for which it is not known WHETER THEY IN BIG OR LITTLE Endian Format-IT Can Also Serve Asia Hint Indicating That That File Is in Unicode, AS Opposed to in a legacy encoding and furthermore, it act as a signature for the specific Encoding form buy. [md] & [af]
Q: What does 'Endian' mean?
A:.. Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last The former is called big-endian, the latter little-endian When data are exchange in the same byte order as they were in the memory of the originating system, they may appear to be in the wrong byte order on the receiving system. in that situation, a BOM would look like 0xFFFE which is a noncharacter, allowing the receiving system to apply byte reversal before processing The Data. UTF-8 IS BYTE ORIENTED AND THEREFORE DOES NOTHELESS, An Initial Bom Might Be Useful to Identify The DataStream As UTF-8. [AF]
Q: WHEN A BOM IS USED, IS IT INLY IN 16-bit Unicode Text?
A: NO, A BOM Can Be Used As a Signature No Matter How The Unicode Text Is Transformed: UTF-16, UTF-8, UTF-7, etc. The Exact Bytes Comprising The Bom Will Be Whatver The Unicode Character Feff Iss Converted Into by That Transformation Format. in That Form, The Bom Serves To Indicate Both That IT IS A Unicode File, And Which of The Formats It is in. EXAMPLES:
Bytesencoding form00 00 00 00 00 00 00UTF-32, Little-endianfe fFUTF-16, BIG-Endianff Feutf-16, Little-Endianef BB BFUTF-8
[Md]
Q: Can a UTF-8 Data Stream Contain The Bom Character (in UTF-8 FORM)? If Yes, THEN ISSUME THE IN BIG-ENDION ORDER? A: YES, UTF-8 CAN ... contain a BOM However, it makes no difference as to the endianness of the byte stream UTF-8 always has the same byte order An initial BOM is only used as a signature - an indication that an otherwise unmarked text file is in UTF -8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters At the beginning, Such as the use of "#!" of at the beginning of unix shell scripts. [AF] & [MD]
Q: What SHOULD I do WITH U Feff in The Middle of A File?
A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U FEFF should normally not occur For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP). , and is then part of the content of the file or string. The use of U 2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it can not be confused with a BOM. When designing a markup language or data protocol, The Use of U Feff Can Be Restricted To this of Byte Order Mark. In That Case, Any U Feff Occurring In The Middle of the File Can Be ignored, or what is an error. [AF]
Q: I am Using a protocol trotracol That Has Bom at the start of text. How do I represent an inTIAL ZWNBSP?
A: Use u 2060 Word Joiner INSTEAD. [MD]
Q: How do i tag data trias not interpret Feff ABOM?
A: use the Tag UTF-16BE TO INDICATE BIG-Endian UTF-16 TEXT, AND UTF-16LE TO INDICATE LITTLE-ENDIAN UTF-16 Text. If you do use a bom, Tag the text as simply utf-16. [MD ] Q: Why Wouldn't i always use a protocol troration request a bom?
A: WHERE THE DATA IS TYPED, SUCH AS A FIELD IN A DATABASE, A BOM IS Unnecessary. In Particular, IF A TEXT DATA STREAM IS Marked AS UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, A BOM IS neither necessary nor permitted. Any Feff Would Be Interpreted As A Zwnbs.
Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is Prefaced by a bom). [md]
Q: How I shop DEAL with BOMS?
A: Here Are Some Guidelines To Follow:
A particular protocol (eg Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM. Some protocols allow optional BOMs in the case Of Untagged Text. in Those Cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything. Where a text data stream is known to be plain Unicode text ( but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. Where the precise type of the data stream is known (eg Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, WHENEVER A DATA STREAM IS DECLARED TO BE UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE A BOM MUST NOT BE USED. [AF] & [MD] http://www.unicode.org/faq/utf_bom .html # 22