迭代包含命名實體映射的兩個文件，並計算精度和召回率

我有兩個文件，我必須迭代並計算我的命名實體標記器的精度和召回率。一個文件是黃金集，另一個是我的系統的輸出。我只想了解如何迭代兩個文件中的句子並計算完整和部分匹配的數量。我只想計算組織，人員和地點的比賽。僞代碼或只是一個想法讓我開始將工作得很好。迭代包含命名實體映射的兩個文件，並計算精度和召回率

文件1：黃金集合

Sentence 1: 
{ORGANIZATION=[Fulton County Grand Jury]} 
Sentence 2: 
{ORGANIZATION=[City Executive Committee]} 
{LOCATION=[City of Atlanta]} 
Sentence 3: 
{LOCATION=[Fulton]} 
{PERSON=[Superior Court Judge Durwood Pye]} 
{PERSON=[Mayor-nominate Ivan Allen Jr.]} 
Sentence 4: 
Sentence 5: 
Sentence 6: 
{LOCATION=[Fulton]} 
Sentence 7: 
{LOCATION=[Fulton County]} 
Sentence 8: 
Sentence 9: 
{ORGANIZATION=[City Purchasing Department]} 
Sentence 10: 
Sentence 11: 
Sentence 12: 
{ORGANIZATION=[State Welfare Department]} 
Sentence 13: 
{LOCATION=[Fulton County]} 
{ORGANIZATION=[State Welfare Department]} 
{LOCATION=[Fulton County]}

檔案2：我的輸出

Sentence 1: 
{ORGANIZATION=[Fulton County Grand Jury], DATE=[Friday], LOCATION=[Atlanta]} 
Sentence 2: 
{ORGANIZATION=[City Executive Committee], LOCATION=[Atlanta]} 
Sentence 3: 
{ORGANIZATION=[Fulton Superior Court Judge Durwood Pye], DATE=[September October], PERSON=[Ivan Allen Jr.]} 
Sentence 4: 
Sentence 5: 
{LOCATION=[Georgia]} 
Sentence 6: 
Sentence 7: 
{LOCATION=[Atlanta, Fulton County]} 
Sentence 8: 
Sentence 9: 
{ORGANIZATION=[City Purchasing Department]} 
Sentence 10: 
{LOCATION=[Georgia]} 
Sentence 11: 
Sentence 12: 
{ORGANIZATION=[State Welfare Department]} 
Sentence 13: 
{ORGANIZATION=[State Welfare Department], LOCATION=[Fulton County, Fulton County]}

來源

2017-02-07 serendipity

可以得到如下開始解析文件，並收集所需的數據。下面將提取所有組織。

Scanner scanner = new Scanner(new File("path-to-file")); 
    List<String> orgLines = new ArrayList<String>(); 
    while(scanner.hasNextLine()){ 
     String line = scanner.nextLine(); 
     if(line.startsWith("{ORGANIZATION")){ 
      orgLines.add(line); 
     } 
    }

一旦得到兩個文件的結果，就可以使用retainAll來查找完整匹配。

orgLines.retainAll(orgLines2);

對於部分匹配，則需要通過基於您的匹配邏輯的所有條目和計算進行迭代。

來源

2017-02-07 09:36:42

我不需要迭代地圖值來提取組織值嗎？見第二個文件。我的行可能並不總是以組織密鑰開頭.. – serendipity

當前的精確匹配只會匹配具有相同字段相關值的組織 - 基本上是完全匹配的行 - 例如「{ORGANIZATION = [State Welfare Department]} ' - 但是如果你更願意與忽略DATE等的名稱匹配，那麼需要建立一個自定義邏輯。 –

如果您正在使用斯坦福NER，爲何不使用built in命令來測試分類器？

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier path/to/ner-model.ser.gz -testFile gold-annotated-text.tsv

您將不得不將黃金設置爲this格式。

參考：http://nlp.stanford.edu/software/crf-faq.html#a

來源

2017-02-17 18:52:05 Simon

迭代包含命名實體映射的兩個文件，並計算精度和召回率

回答

相關問題