使用Apache OpenNLP查找空格分隔的名稱

我正在使用Apache Open NLP的NER。我已成功培訓了我的自定義數據。在使用名稱查找程序時，我將基於空格的給定字符串分割並傳遞字符串數組，如下所示。使用Apache OpenNLP查找空格分隔的名稱

NameFinderME nameFinder = new NameFinderME(model); 
String []sentence = input.split(" "); //eg:- input = Give me list of test case in project X 
Span nameSpans[] = nameFinder.find(sentence);

在這裏，當我使用分裂，測試和情況給定爲獨立的值，並且永遠不會被檢測到的NameFinder。我將如何克服上述問題。有沒有一種方法可以傳遞完整的字符串（不將它分割成數組），這樣測試用例本身將被視爲一個整體？

來源

2017-01-30 Hari Ram

你可以使用正則表達式來做到這一點。嘗試用這種替代下聯：

String []sentence = input.split("\\s(?<!(\\stest\\s(?=case\\s)))");

也許有更好的方式來寫的表達，但是這個工作對我來說，輸出是：

Give 
me 
list 
of 
test case 
in 
project 
X

編輯：如果你有興趣在這裏詳細檢查我分裂的地方：https://regex101.com/r/6HLBnL/1

編輯2：如果你有很多單詞不g我分開了，我寫了一個方法爲你生成正則表達式。這是在這種情況下，正則表達式應該怎麼樣子（如果你不希望「項目」分開「測試案例」和）：

\s(?<!(\stest\s(?=case\s))|(\sin\s(?=project\s)))

下面是一個簡單的程序證明它。在這個例子中，你只需將不需要分隔的單詞放在數組unseparated中。

class NoSeparation { 

private static String[][] unseparated = {{"test", "case"}, {"in", "project"}}; 

private static String getRegex() { 
    String regex = "\\s(?<!"; 

    for (int i = 0; i < unseparated.length; i++) 
     regex += "(\\s" + separated[i][0] + "\\s(?=" + separated[i][1] + "\\s))|"; 

    // Remove the last | 
    regex = regex.substring(0, regex.length() - 1); 

    return (regex + ")"); 
} 

public static void main(String[] args) { 
    String input = "Give me list of test case in project X"; 
    String []sentence = input.split(getRegex()); 

    for (String i: sentence) 
     System.out.println(i); 
} 
}

編輯3：以下是處理字符串超過2個字非常骯髒的方式。它的工作原理，但我相當肯定，你可以以更有效的方式做到這一點。它在短期投入中可以正常工作，但時間較長時可能會很慢。

你必須把不應該被拆分成2d數組的單詞，如unseparated。如果您因某種原因不想使用%%（例如，如果您的輸入有機會包含它），您還應該選擇分隔符。

class NoSeparation { 

private static final String SEPARATOR = "%%"; 
private static String[][] unseparated = {{"of", "test", "case"}, {"in", "project"}}; 

private static String[] splitString(String in) { 
    String[] splitted; 

    for (int i = 0; i < unseparated.length; i++) { 
     String toReplace = ""; 
     String replaceWith = ""; 
     for (int j = 0; j < unseparated[i].length; j++) { 
      toReplace += unseparated[i][j] + ((j < unseparated[i].length - 1)? " " : ""); 
      replaceWith += unseparated[i][j] + ((j < unseparated[i].length - 1)? SEPARATOR : ""); 
     } 

     in = in.replaceAll(toReplace, replaceWith); 
    } 

    splitted = in.split(" "); 

    for (int i = 0; i < splitted.length; i++) 
     splitted[i] = splitted[i].replaceAll(SEPARATOR, " "); 

    return splitted; 
} 

public static void main(String[] args) { 
    String input = "Give me list of test case in project X"; 
    // Uncomment this if there is a chance to have multiple spaces/tabs 
    // input = input.replaceAll("[\\s\\t]+", " "); 

    for (String str: splitString(input)) 
     System.out.println(str); 
} 
}

來源

2017-01-30 13:29:33 jackgu1988

好的，如果我有很多空格分隔的單詞（範圍從15-20）。在那種情況下，我將如何使用'split（）'函數？在這種情況下采用這種方法會有效嗎？ –

@HariRam請檢查我的第二次編輯。我添加了一些代碼。 – jackgu1988

老兄！但事情是，我可能還會在單詞之間有3或4個空格（循環中檢測到缺陷）。正則表達式應該如何看待它們之間3或4個空格的情況？我不介意寫一個會產生正則表達式的函數。我只是在上面提到的情況下需要正則表達式字符串的格式。 –

使用Apache OpenNLP查找空格分隔的名稱

回答

相關問題