我建議您逐行讀取文件,然後在字邊界上調用split
以獲取單詞數。
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
這樣可以避免從你的程序調用grep。
你得到的奇怪字符很可能是使用錯誤的編碼。你確定你的文件是在「UTF-8」嗎?
編輯
OP要讀取一個文件中的行由行,然後搜索在另一個文件中讀取行的出現。
這仍然可以使用java更容易地完成。根據有多大你的其他文件,你可以先讀入內存,並搜索,或搜索一下行由行也
一個簡單的例子把文件讀入內存:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
這應該如果您的「語料庫」文件少於百兆字節左右,則工作正常。任何大,你會想改變「countOccurencesOf」的方法是這樣的
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
現在你只需通過你的「文件」對象進入方法,而不是字符串化的文件。
請注意,流式方法逐行讀取文件並因此丟棄換行符,如果您的Pattern
依賴於它們,則需要在解析String
之前將它們添加回去。
爲什麼不使用Java正則表達式引擎? – 2013-04-07 11:38:32
你確定你的文件是用UTF-8編碼的嗎?更可能是ISO-8859-1或ISO-8859-15或類似的東西。 – 2013-04-07 11:38:41