使用ner/nlp從文本中檢測員工指定

我對NLP領域非常陌生，我對檢測位置/名稱/角色以及他們的姓名，電子郵件，電話號碼等感興趣。我嘗試使用stanford NLP從文本中檢測名稱。電子郵件和電話號碼解析看起來非常簡單。但是我無法檢測到給定文本的名稱。使用ner/nlp從文本中檢測員工指定

例如，這裏有文字

1）醫院院長，醫生的一些樣品示例。 A.B.艾哈邁德，example1 @ example.com
姓名： A.B.艾哈邁德，電子郵箱：[email protected]

2）副院長學者S.安東尼教授[email protected]
姓名： S.Antony，Email：[email protected]

3）院長院士& PG-Cell & Surg。紀律常駐訓練。 PROGRAME，先生。桑迪普
名稱：桑迪普先生，電子郵件：無

4）主任，網絡，羅伯特·亞當斯，示例3 @ example.com，9900131213
名稱：羅伯特·亞當斯，電子郵件：[email protected]，電話：9900131213

我對任何正則表達式匹配算法都不感興趣，因爲文本的性質是非確定性的。我有興趣知道的是如何從文本中提取上述設計。任何解決方案，甚至超越斯坦福NLP，比如使用nltk，lingpipe等都不錯。如果我正在使用斯坦福NLP，我如何建立一個與「POSITION」或「DESIGNATION」不同的實體類型相同的培訓模型，以及如何將此模型與我的其他模型一起包括在內（我正在服務器中運行stanford NLP模式）。

來源

2013-10-17 Venkatesan Vaidhyanathan

你需要訓練你自己的NER模型，在你的訓練集中引入你自己的標籤'DESIGNATION'。看看他們的文檔。 http://nlp.stanford.edu/software/crf-faq.shtml#a – meghamind

培訓斯坦福分析器'指定'你需要很多的訓練數據，你必須收集更大的數據，因爲少量的數據可能不會給你正確的數據 –

嘗試使用以下文件（designation.rules.txt）

ENV.defaultStringPatternFlags = 2 

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" } 
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" } 

$Designation = (
    /CFO/| 
    /Director/| 
    /CEO/| 
    /Chief/| 
    /Executive/| 
    /Officer/| 
    /Vice/| 
    /President/| 
    /Senior/| 
    /Financial/ 
) 

ENV.defaults["ruleType"] = "tokens" 
ENV.defaults["stage"] = 1 
{ 
    pattern: ($Designation), 
    action: (Annotate($0, ner, "DESIGNATION")) 
} 

ENV.defaults["stage"] = 2 
{ 
    ruleType: "tokens", 
    pattern: (([ { ner:PERSON } ]) /has/ ([ { ner:DESIGNATION } ]+)), 
    result: Format("hasDesignation(%s,%s)",$1.word, Join(" ",$2.word)) 
}

，並使用下面的Java文件生成

package org.itcookies.nlpdemo; 

import java.io.IOException; 
import java.io.PrintWriter; 
import java.util.List; 
import java.util.Properties; 

import edu.stanford.nlp.io.IOUtils; 
import edu.stanford.nlp.ling.CoreAnnotations; 
import edu.stanford.nlp.ling.CoreLabel; 
import edu.stanford.nlp.pipeline.Annotation; 
import edu.stanford.nlp.pipeline.StanfordCoreNLP; 
import edu.stanford.nlp.util.CoreMap; 

/** 
* Demo illustrating how to use TokensRegexAnnotator 
*/ 
public class TokensRegexAnnotatorDemo { 

    public static void main(String[] args) throws IOException { 
    PrintWriter out; 

    String rules; 
    if (args.length > 0) { 
     rules = args[0]; 
    } else { 
     rules = "org/itcookies/nlp/rules/designation.rules.txt"; 
    } 
    if (args.length > 2) { 
     out = new PrintWriter(args[2]); 
    } else { 
     out = new PrintWriter(System.out); 
    } 

    Properties properties = new Properties(); 
    properties.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregexdemo"); 
    properties.setProperty("customAnnotatorClass.tokensregexdemo", "edu.stanford.nlp.pipeline.TokensRegexAnnotator"); 
    properties.setProperty("tokensregexdemo.rules", rules); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 
    Annotation annotation; 
    if (args.length > 1) { 
     annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[1])); 
    } else { 
     annotation = new Annotation("John is CEO of ITCookies"); 
    } 

    pipeline.annotate(annotation); 

    // An Annotation is a Map and you can get and use the various analyses individually. 
    out.println(); 
    // The toString() method on an Annotation just prints the text of the Annotation 
    // But you can see what is in it with other methods like toShorterString() 
    out.println("The top level annotation"); 
    out.println(annotation.toShorterString()); 
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class); 

    for (CoreMap sentence : sentences) { 
     // NOTE: Depending on what tokensregex rules are specified, there are other annotations 
     //  that are of interest other than just the tokens and what we print out here 
     for (CoreLabel token:sentence.get(CoreAnnotations.TokensAnnotation.class)) { 
     // Print out words, lemma, ne, and normalized ne 
     String word = token.get(CoreAnnotations.TextAnnotation.class); 
     String lemma = token.get(CoreAnnotations.LemmaAnnotation.class); 
     String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); 
     String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); 
     String normalized = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class); 
     if(ne.equals("DESIGNATION")) 
      out.println("token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", ne=" + ne + ", normalized=" + normalized); 
     } 
    } 
    out.flush(); 
    } 

}

及以下的輸出

The top level annotation 
[Text=John is CEO of ITCookies Tokens=[John-1, is-2, CEO-3, of-4, ITCookies-5] Sentences=[John is CEO of ITCookies]] 
token: word=CEO, lemma=CEO, pos=NNP, ne=DESIGNATION, normalized=null

來源

2017-04-04 06:52:06

使用ner/nlp從文本中檢測員工指定

回答

相關問題