Java：使用FileInputStream逐頁閱讀utf-8文件

我需要一些代碼，以便我可以從UTF-8文件一次讀取一頁。Java：使用FileInputStream逐頁閱讀utf-8文件

我已經使用了代碼;

File fileDir = new File("DIRECTORY OF FILE"); 
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8")); 
String str; 
while ((str = in.readLine()) != null) { 
     System.out.println(str); 
    } 
in.close(); 
}

用try catch塊包圍它後，它會運行，但會輸出整個文件！有沒有辦法修改這段代碼，一次只顯示一頁文字？該文件採用UTF-8格式，在記事本++中查看後，我可以看到該文件包含FF字符來表示下一頁。

來源

2014-07-02 Steve

使用'掃描儀'並將'delimiter'設置爲'\ u000C'。 –

感謝鮑里斯。我如何讓Scanner讀取utf-8文件？我認爲唯一的方法是使用InputStreamReader？ – Steve

您需要通過比較0x0C來查找換頁字符。

例如：

char c = in.read(); 
while (c != -1) { 
    if (c == 0x0C) { 
    // form feed 
    } else { 
    // handle displayable character 
    } 

    c = in.read(); 
}

EDIT加入使用掃描儀的一個例子，由Boris

Scanner s = new Scanner(new File("a.txt")).useDelimiter("\u000C"); 
    while (s.hasNext()) { 
     String str = s.next(); 

     System.out.println(str); 
    }

來源

2014-07-02 15:33:41

這聽起來不好玩。掃描儀有什麼問題？ –

公平點。我正在展示已經被使用的方法的自然延伸。掃描儀更自然，所以我也增加了一個例子。 –

乾杯！掃描儀工作了一種享受 – Steve

的建議可以使用正則表達式來檢測形式進料（分頁符）字符。嘗試這樣的：

File fileDir = new File("DIRECTORY OF FILE"); 
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8")); 
String str; 
Regex pageBreak = new Regex("(^.*)(\f)(.*$)") 
while ((str = in.readLine()) != null) { 
    Match match = pageBreak.Match(str); 
    bool pageBreakFound = match.Success; 
    if(pageBreakFound){ 
     String textBeforeLineBreak = match.Groups[1].Value; 
     //Group[2] will contain the form feed character 
     //Group[3] will contain the text after the form feed character 
     //Do whatever logic you want now that you know you hit a page boundary 
    } 
    System.out.println(str); 
} 

in.close();

圍繞部分正則表達式的括號表示捕獲組，記錄在Match對象中。 \ f匹配換頁符。

編輯道歉，出於某種原因，我閱讀C＃而不是Java，但核心概念是相同的。這裏的正則表達式文檔的Java：http://docs.oracle.com/javase/tutorial/essential/regex/

來源

2014-07-02 15:56:29 neumann1990

如果該文件是有效的UTF-8，也就是頁面由U + 00FF，又名（焦炭）0xFF的分裂，又名「\ u00ff」可以，'ÿ'，然後緩衝讀者可以做。如果它是一個字節0xFF，則會出現問題，因爲UTF-8可能使用字節0xFF。

int soughtPageno = ...; // Counted from 0 
int currentPageno = 0; 
try (BufferedReader in = new BufferedReader(new InputStreamReader(
     new FileInputStream(fileDir), StandardCharsets.UTF_8))) { 
    String str; 
    while ((str = in.readLine()) != null && currentPageno <= soughtPageno) { 
     for (int pos = str.indexOf('\u00FF'; pos >= 0;)) { 
      if (currentPageno == soughtPageno) { 
       System.out.println(str.substring(0, pos); 
       ++currentPageno; 
       break; 
      } 
      ++currentPageno; 
      str = str.substring(pos + 1); 
     } 
     if (currentPageno == soughtPageno) { 
      System.out.println(str); 
     } 
    } 
}

對於一個字節0xFF的（錯誤的，遭到黑客攻擊UTF-8）使用的FileInputStream和讀取器之間的纏繞的InputStream：

class PageInputStream implements InputStream { 
    InputStream in; 
    int pageno = 0; 
    boolean eof = false; 
    PageInputSTream(InputStream in, int pageno) { 
     this.in = in; 
     this.pageno = pageno; 
    } 
    int read() throws IOException { 
     if (eof) { 
      return -1; 
     } 
     while (pageno > 0) { 
      int c = in.read(); 
      if (c == 0xFF) { 
       --pageno; 
      } else if (c == -1) { 
       eof = true; 
       in.close(); 
       return -1; 
      } 
     } 
     int c = in.read(); 
     if (c == 0xFF) { 
      c = -1; 
      eof = true; 
      in.close(); 
     } 
     return c; 
    }

以此爲一個例子，更多的工作要做。

來源

2014-07-02 15:59:06

Java：使用FileInputStream逐頁閱讀utf-8文件

回答

相關問題