令牌不能構建正確

我嘗試使用下面的代碼來標記文本文件：令牌不能構建正確

String fileContent = ""; 
    String fileContentTokens[]; 
    try{ 
     fileContent = new Scanner(new File(fname)).useDelimiter("\\Z").next(); 
    } catch(Exception ex) { 
     System.out.println(ex.getMessage()); 
    } 

    fileContent = fileContent.replaceAll("\\s*([,.?!\"'()-:*;])\\s*", " $1 "); 
    //System.out.println(fileContent); 
    fileContentTokens = fileContent.split(" ");

的問題是，該令牌不形成正確的，我的意思是，有些話還是得附有他們一些語錄仍然有撇號。上面的代碼應該在每個標點符號之間留有空白，所以它不會附加到它自己的單詞上。例如：「這很酷」應該是「這很酷」。但由於某種原因，這並不是那樣做的。這只是爲了一些不是全部的話而做的。

來源

2012-12-09 Haseeb

我沒有把它打印出來，它的存在在的意見，那就是我怎麼知道有問題 – Haseeb

你的正則表達式爲我工作。請給出更多的背景和例子。 – SergeyS

這是我所做的改變，但同樣的事情發生：fileContent = fileContent.replaceAll（「\\ s *（[，。？！\'（） - ：*;]）\\ s *」，「 $ 1「）; fileContent = Matcher.quoteReplacement（fileContent）; System.out.println（fileContent）; – Haseeb

您的字符串另一種類型的撇號出現錯誤的位置。

Karachi’s Manghopir area , DawnNews reported on Saturday . The

在該字符串必須’ 但在你的正則表達式你必須'

這些都是differrent。前者撇號添加到您的正則表達式也和它的工作：

fileContent = fileContent.replaceAll("\\s*([,.?!\"'’()-:*;])\\s*", " $1 ");

來源

2012-12-09 14:12:34 SergeyS

謝謝，順便說一句，你是怎麼得到另一個撇號？我似乎無法找到它：/ – Haseeb

我剛剛從您提供的字符串中複製它）它的Unicode值是8217，在這裏查看更多信息http://www.fileformat.info/info/unicode/char/2019/index熱媒 – SergeyS

從Java API：請注意，替換字符串中的反斜槓（\）和美元符號（$）可能會導致結果與將其視爲字面替換字符串時的結果不同;見Matcher.replaceAll。使用Matcher.quoteReplacement(java.lang.String)抑制這些字符的特殊含義，如果需要的話

來源

2012-12-09 13:49:49 OmniOwl

他是我做的改變，但是同樣的事情發生了：fileContent = fileContent.replaceAll（「\\ s *（[，。？！\'（' - ）：*;]）\\ s *」，「 $ 1「）; fileContent = Matcher.quoteReplacement（fileContent）;的System.out.println（fileContent）; – Haseeb

令牌不能構建正確

回答

相關問題