爲什麼maven會給我不同的utf-8字符而不是eclipse（測試在eclipse中運行，maven失敗）？

我目前的項目是關於解析自然語言。一個測試從文件讀取文本，刪除某些字符，並將文本標記爲單個單詞。測試實際上比較了獨特單詞的數量。在日食中，這個測試是「綠色」，在maven中，我得到的字數比預期的要多。比較單詞的列表，我看到下面的其他單詞：爲什麼maven會給我不同的utf-8字符而不是eclipse（測試在eclipse中運行，maven失敗）？

acquirer⊙s
card⊙s
institution⊙s
issuer⊙s
provider⊙s
PSAM ⊙s
⊜從⊝
⊜slot⊝
⊜to⊝

望着文本源，它包含要過濾掉下面的文字：「」」

這個工作在日食，但不是在行家。我正在使用utf-8。文件似乎是正確編碼，在行家POM我指定以下內容：

<properties> 
     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> 
     <org.apache.lucene.version>3.6.0</org.apache.lucene.version> 
</properties>

編輯：下面是讀取該文件的代碼（這是根據蝕，編碼爲UTF-8）。

 BufferedReader reader = new BufferedReader(
       new FileReader(this.file)); 
     String line = ""; 
     while ((line = reader.readLine()) != null) { 
      // the csv contains a text and a classification 
      String[] reqCatType = line.split(";"); 
      String reqText = reqCatType[0].trim(); 
      String reqCategory = reqCatType[1].trim(); 
      // the tokenizer also removes unwanted characters: 
      String[] sentence = this.filter.filterStopWords(this.tokenizer 
        .tokenize(reqText)); 
      // we use this data to train a machine learning algorithm 
      this.dataSet.learn(sentence, reqCategory); 
     } 
     reader.close();

編輯：下面的信息可能對分析問題有用：

mvn -v 
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800) 
Maven home: /usr/share/maven 
Java version: 1.6.0_33, vendor: Apple Inc. 
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home 
Default locale: en_US, platform encoding: MacRoman 
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"

來源

2012-09-04 oerich

顯示在您讀取文件的代碼。 – jtahlborn

也許http://maven.apache.org/plugins/maven-resources-plugin/examples/encoding.html會有幫助嗎？ – afk5min

感謝您的建議，@ afk5min，但如果我正確應用它，這並不能解決問題。我在示例中添加了maven-resources-plugin和配置，但沒有任何更改。與以前一樣，mvn會在其他消息中安裝結果，消息如下：「[INFO]使用'UTF-8'編碼來複制已過濾的資源 [INFO]複製10個資源」你爲什麼認爲這有幫助？ – oerich

所以，你的數據文件是UTF-8。該文件中的eclipse設置沒有任何影響，因爲正在運行的Java程序將是解釋含義的指令。

FileReader總是使用平臺默認編碼，這通常是一個壞主意。 Eclipse可能會爲你設置「platorm default」，而Maven則不是。

修復您的代碼以指定編碼。

見的JavaDoc：

To specify these values yourself, construct an InputStreamReader on a FileInputStream.

來源

2012-09-05 05:41:47

謝謝，那是解決方案。當然，我也必須改變閱讀不需要標誌的部分。 BufferedReader現在啓動爲：'BufferedReader reader = new BufferedReader（new InputStreamReader（new FileInputStream（filename），Charset.forName（「UTF-8」）））;'對於輸入文件，我應該實現自動檢測編碼如下所述：[link]（http://docs.oracle.com/javase/tutorial/essential/io/file.html）。我討厭被智能工具愚弄。 – oerich

爲什麼maven會給我不同的utf-8字符而不是eclipse（測試在eclipse中運行，maven失敗）？

回答

相關問題