当前位置:编程学习 > JAVA >>

Java处理UTF-8带BOM的文本的读写

 

什么是BOM

 

BOM(byte-order mark),即字节顺序标记,它是插入到以UTF-8、UTF16或UTF-32编码Unicode文件开头的特殊标记,用来识别Unicode文件的编码类型。对于UTF-8来说,BOM并不是必须的,因为BOM用来标记多字节编码文件的编码类型和字节顺序(big-endian或little- endian)。

 

BOMs 文件头:

   00 00 FE FF    = UTF-32, big-endian

   FF FE 00 00    = UTF-32, little-endian

   EF BB BF       = UTF-8,

   FE FF          = UTF-16, big-endian

   FF FE          = UTF-16, little-endian

 

 

下面举个例子,针对UTF-8的文件BOM做个处理:

 

String xmla = StringFileToolkit.file2String(new File(“D:\\projects\\mailpost\\src\\a.xml”),“UTF-8”);

 

byte[] b = xmla.getBytes(“UTF-8”);

 

String xml = new String(b,3,b.length-3,“UTF-8”);

 

..............

 

思路是:先按照UTF-8编码读取文件后,跳过前三个字符,重新构建一个新的字符串,然后用Dom4j解析处理,这样就不会报错了。

 

其他编码的方式处理思路类似,其实可以写一个通用的自动识别的BOM的工具,去掉BOM信息,返回字符串。

 

不过这个处理过程已经有牛人解决过了:http://koti.mbnet.fi/akini/java/unicodereader/

 

Java代码 

‍Example code using UnicodeReader class 

Here is an example method to read text file. It will recognize bom marker and skip it while reading.  

 

//import ‍http://koti.mbnet.fi/akini/java/unicodereader/UnicodeReader.java.txt 

   public static char[] loadFile(String file) throws IOException { 

      // read text file, auto recognize bom marker or use  

      // system default if markers not found. 

      BufferedReader reader = null; 

      CharArrayWriter writer = null; 

      UnicodeReader r = new UnicodeReader(new FileInputStream(file), null); 

   

      char[] buffer = new char[16 * 1024];   // 16k buffer 

      int read; 

      try { 

         reader = new BufferedReader(r); 

         writer = new CharArrayWriter(); 

         while( (read = reader.read(buffer)) != -1) { 

            writer.write(buffer, 0, read); 

         } 

         writer.flush(); 

         return writer.toCharArray(); 

      } catch (IOException ex) { 

         throw ex; 

      } finally { 

         try { 

            writer.close(); reader.close(); r.close(); 

         } catch (Exception ex) { } 

      } 

   } 

 

Java代码 

Example code to write UTF-8 with bom marker 

Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.  

 

 

   public static void saveFile(String file, String data, boolean append) throws IOException { 

      BufferedWriter bw = null; 

      OutputStreamWriter osw = null; 

   

      File f = new File(file); 

      FileOutputStream fos = new FileOutputStream(f, append); 

      try { 

         // write UTF8 BOM mark if file is empty 

         if (f.length() < 1) { 

           final byte[] bom = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF }; 

            fos.write(bom); 

         } 

 

         osw = new OutputStreamWriter(fos, "UTF-8"); 

         bw = new BufferedWriter(osw); 

         if (data != null) bw.write(data); 

      } catch (IOException ex) { 

         throw ex; 

      } finally { 

         try { bw.close(); fos.close(); } catch (Exception ex) { } 

      } 

   } 

  

 

 

实际应用:

Java代码 

package com.dayo.gerber; 

 

import java.io.BufferedReader; 

import java.io.BufferedWriter; 

import java.io.File; 

import java.io.FileInputStream; 

import java.io.FileOutputStream; 

import java.io.IOException; 

import java.io.InputStream; 

import java.io.InputStreamReader; 

import java.io.OutputStreamWriter; 

import java.io.Reader; 

import java.util.Properties; 

 

/**

 * 

 * @author 刘飞(liufei)

 * 

 */ 

public class Generate4YYQTPScript { 

 

    private static final String ENCODING = "UTF-8"; 

    private static final String GERBER_CONFIG = "config/gerber4yy.properties"; 

 

    private static Properties GERBER_CONFIG_PROPS = null; 

    private static final String GERBER_FORMAT_DIALOG_TITLE_SCRIPT = "{#GERBER_FORMAT_DIALOG_TITLE}"; 

    private static String GERBER_FORMAT_DIALOG_TITLE = ""; 

 

    /* gerber properties parmters keys config */ 

    private static final String QTP_SCRIPT_IN = "script.in"; 

 

    private static final String QTP_SCRIPT_OUT = "script.out"; 

 

    private static final String QTP_SYSTEM_PATH = "QTP.system.path"; 

    private static final String QTP_SYSTEM_PATH_S

补充:软件开发 , Java ,
CopyRight © 2012 站长网 编程知识问答 www.zzzyk.com All Rights Reserved
部份技术文章来自网络,