FindWithinHorizon無法匹配

我想在文本文件中找到「$$$$」模式的實例數。以下方法適用於某些文件，但不適用於所有文件。例如，它不適用於以下文件（http://www.hmdb.ca/downloads/structures.zip - 它是一個帶有.sdf擴展名的壓縮文本文件）我找不到原因？我也試圖逃避空格。沒有運氣。當有超過35000個「$$$$」模式時，它返回11。請注意，速度至關重要。因此，我不能使用任何較慢的方法。FindWithinHorizon無法匹配

public static void countMoleculesInSDF(String fileName) 
{ 
    int tot = 0; 
    Scanner scan = null; 
    Pattern pat = Pattern.compile("\\$\\$\\$\\$"); 

    try { 
     File file = new File(fileName); 
     scan = new Scanner(file); 
     long start = System.nanoTime(); 
     while (scan.findWithinHorizon(pat, 0) != null) { 
      tot++; 
     } 
     long dur = (System.nanoTime() - start)/1000000; 
     System.out.println("Results found: " + tot + " in " + dur + " msecs"); 
    } catch (Exception e) { 
     e.printStackTrace(); 
    } finally { 
     scan.close(); 
} 
}

來源

2013-10-08 lochi

對於鏈接的文件和你的代碼，你已經貼吧，我一共218比賽不斷了。這當然是不正確的：使用記事本++的計數函數進行驗證，該文件應該包含41498匹配。所以，在最後一場比賽結束時，即當Scanner告訴我們沒有更多的比賽結果時，Scanner（我認爲）出現了問題，並開始調試。這樣做我遇到了一個例外，它的私有方法readInput()不是直接拋出，而是保存在一個locale變量中。

try { 
    n = source.read(buf); 
} catch (IOException ioe) { 
    lastException = ioe; 
    n = -1; 
}

可以使用方法Scanner#ioException()檢索此異常：

IOException ioException = scanner.ioException(); 
if (ioException != null) { 
    ioException.printStackTrace(); 
}

打印此異常有那麼表明some input could not be decoded

java.nio.charset.UnmappableCharacterException: Input length = 1 
    at java.nio.charset.CoderResult.throwException(CoderResult.java:278) 
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338) 
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) 
    at java.io.Reader.read(Reader.java:100) 
    at java.util.Scanner.readInput(Scanner.java:849)

所以我只是嘗試，並通過一個字符集的掃描儀的構造函數：

scan = new Scanner(file, "utf-8");

它使它工作！

Results found: 41498 in 2431 msecs

所以問題是掃描儀使用了平臺的字符集，它不適合完全解碼你的文件。

這個故事告訴我們：

文本時，務必明確地傳遞一個字符集。
與Scanner一起使用時請檢查IOException。

PS：有些得心應手的方式來引用一個字符串作爲使用正則表達式

Pattern pat = Pattern.compile("\\Q$$$$\\E");

或

Pattern pat = Pattern.compile(Pattern.quote("$$$$"));

來源

2013-10-08 19:31:13 A4L

非常感謝您的努力。優秀的答案。 – lochi

@lochi不客氣！ – A4L

這裏就是我終於實現了......（你發佈你的答案之前）。這種方法似乎比掃描儀更快。你會建議什麼實施？掃描儀或內存映射？對於大文件，內存映射會失敗嗎？不知道..

private static final Charset CHARSET = Charset.forName("ISO-8859-15"); 
private static final CharsetDecoder DECODER = CHARSET.newDecoder(); 

public static int getNoOfMoleculesInSDF(String fileName) 
    { 
    int total=0; 
    try 
    {  
    Pattern endOfMoleculePattern = Pattern.compile("\\$\\$\\$\\$"); 
    FileInputStream fis = new FileInputStream(fileName); 
    FileChannel fc = fis.getChannel(); 
    int fileSize = (int) fc.size(); 
    MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fileSize); 
    CharBuffer cb = DECODER.decode(mbb); 
    Matcher matcher = endOfMoleculePattern.matcher(cb); 
    while (matcher.find()) { 
     total++; 
    } 
    } 
    catch(Exception e) 
    { 
     LOGGER.error("An error occured while counting molecules in the SD file"); 
    } 
    return total; 
    }

來源

2013-10-08 21:02:24 lochi

這種方法看起來不錯，但不幸的是它不適用於大型文件，比如你鏈接的文件（〜250MB）。它由於'DECODER'而與'OutOfMemoryError：Java堆空間'崩潰。decode（mbb）'試圖分配一個與文件本身一樣大的char緩衝區，即使用'-Xmx'選項增加jvm堆空間也不會避免。我之前嘗試過的是使用緩衝讀取器並在每行上應用圖案，但它運行良好，但比Scanner長4倍。我認爲Scanner方法是在運行時避免OOME的最佳選擇。掃描儀的緩衝區只有1024！ – A4L

請看這個問題的答案（http://stackoverflow.com/questions/7298455/huge-arrays-throws-out-of-memory-despite-enough-memory-available）至於爲什麼OOME仍然可以儘管使用'-Xmx'設置了 – A4L

此方法與-Xms2000m一起使用。它的速度要快得多 - 對於同一個文件，這是600毫秒，而1900毫秒。但是，有限的記憶會成爲問題。我要去掃描儀.. – lochi

FindWithinHorizo​​n無法匹配

回答

相關問題

FindWithinHorizon無法匹配