自然語言處理中的性別鑑定

我使用stanford nlp軟件包編寫了下面的代碼。自然語言處理中的性別鑑定

GenderAnnotator myGenderAnnotation = new GenderAnnotator(); 
myGenderAnnotation.annotate(annotation);

但是對於「安妮上學」這句話，卻無法確定安妮的性別。

應用程序的輸出是：

 [Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON] 
    [Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O] 
    [Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O] 
    [Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O] 
    [Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]

什麼是正確的做法，以獲得的性別？

來源

2013-05-01 quartz

如果您的命名實體識別器爲token輸出PERSON，則可以根據名字使用（或構建，如果您沒有）。例如，請參閱NLTK庫教程頁面中的Gender Identification部分。他們使用以下功能：

姓名的最後一個字母。
名字的第一個字母。
名稱長度（字符數）。
字符unigram存在（布爾是否在名稱中的字符）。

雖然，我有一個預感，使用字符n-gram頻率 - 可能達字卦 - 將給你很好的結果。

來源

2013-05-02 00:37:47

性別註釋不會將信息添加到文本輸出，但你仍然可以通過代碼訪問它，如下面的代碼片段：

Properties props = new Properties(); 
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender"); 

StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

Annotation document = new Annotation("Annie goes to school"); 

pipeline.annotate(document); 

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) { 
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { 
    System.out.print(token.value()); 
    System.out.print(", Gender: "); 
    System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class)); 
    } 
}

輸出：

Annie, Gender: FEMALE 
goes, Gender: null 
to, Gender: null 
school, Gender: null

來源

2015-05-19 20:17:23

有很多方法，其中一個概述在nltk cookbook。

基本上你會建立一個分類器，從名稱中提取一些特徵（第一個，最後一個字母，前兩個，後兩個字母等）並根據這些特徵進行預測。

import nltk 
import random 

def extract_features(name): 
    name = name.lower() 
    return { 
     'last_char': name[-1], 
     'last_two': name[-2:], 
     'last_three': name[-3:], 
     'first': name[0], 
     'first2': name[:1] 
    } 

f_names = nltk.corpus.names.words('female.txt') 
m_names = nltk.corpus.names.words('male.txt') 

all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names] 
random.shuffle(all_names) 

test_set = all_names[500:] 
train_set= all_names[:500] 

test_set_feat = [(extract_features(n), g) for n, g in test_set] 
train_set_feat= [(extract_features(n), g) for n, g in train_set] 

classifier = nltk.NaiveBayesClassifier.train(train_set_feat) 

print nltk.classify.accuracy(classifier, test_set_feat)

這個基本測試爲您提供了大約77％的準確度。

來源

2015-09-18 07:43:37

我在每個五個特徵上放了一個'＃'，例如：「＃'last_char'：name [-1]，」，所以不應該有任何提取的特徵，並且運行代碼給出一個62- 63％的準確性，爲什麼沒有特徵預測好於擲硬幣（50％）？ – KubiK888 2015-09-22 02:48:43

@ KubiK888原因可能是數據集不平衡（63％的男性），並且在瞭解NaiveBayes後決定最好的方法是始終選擇男性。 – 2015-09-22 04:25:15

自然語言處理中的性別鑑定

回答

相關問題