2014-03-01 93 views
10
from __future__ import division 
import urllib 
import json 
from math import log 


def hits(word1,word2=""): 
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s" 
    if word2 == "": 
     results = urllib.urlopen(query % word1) 
    else: 
     results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2) 
    json_res = json.loads(results.read()) 
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount']) 
    return google_hits 


def so(phrase): 
    num = hits(phrase,"excellent") 
    #print num 
    den = hits(phrase,"poor") 
    #print den 
    ratio = num/den 
    #print ratio 
    sop = log(ratio) 
    return sop 

print so("ugly product") 

我需要此代碼來計算點明智的相互信息,可用於將評論分爲正面或負面。基本上我使用Turney(2002)指定的技術:http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf作爲情感分析的無監督分類方法的示例。Python - 使用點互信息的情緒分析

正如文章中所解釋的,如果短語與單詞「poor」更強相關,並且與「excellent」更強相關,則該單詞的語義方向是否定的。

上面的代碼計算短語的SO。我使用Google來計算命中數並計算出SO(因爲AltaVista現在不存在)

計算出的值非常不穩定。他們不拘泥於特定的模式。 例如SO(「醜陋的產品」)原來是2.85462098541而SO(「美麗的產品」)是1.71395061117。而前者預計是負面的,另一個是正面的。

代碼有問題嗎?有沒有更簡單的方法來計算一個短語(使用PMI)SO與任何Python庫,比如說NLTK?我嘗試了NLTK,但無法找到計算PMI的明確方法。

+0

有什麼建議? – keshr3106

+1

啊,我有一個PMI的代碼,給我一分鐘。我會稍微上傳一下。 – alvas

回答

12

一般來說,計算PMI是棘手的,因爲該公式將取決於您要考慮到的ngram的大小而變化:

在數學上,對二元語法,你可以簡單地認爲:

log(p(a,b)/(p(a) * p(b))) 

編程,讓我們假設你已經計算了對unigram和雙字母組的所有頻率在你的陰莖,你這樣做:

def pmi(word1, word2, unigram_freq, bigram_freq): 
    prob_word1 = unigram_freq[word1]/float(sum(unigram_freq.values())) 
    prob_word2 = unigram_freq[word2]/float(sum(unigram_freq.values())) 
    prob_word1_word2 = bigram_freq[" ".join([word1, word2])]/float(sum(bigram_freq.values())) 
    return math.log(prob_word1_word2/float(prob_word1*prob_word2),2) 

這是鱈魚來自MWE圖書館的一個片段,但它處於預發展階段(https://github.com/alvations/Terminator/blob/master/mwe.py)。但請注意,它是平行MWE提取,所以這裏的你怎麼能「砍」它來提取單語MWE:

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py 
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt 
$ printf "" > trg.txt 
$ python 
>>> import codecs 
>>> from mwe import load_ngramfreq, extract_mwe 

>>> # Calculates the unigrams and bigrams counts. 
>>> # More superfluously, "Training a bigram 'language model'." 
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt') 

>>> sent = "This is another foo bar sentence not in the training corpus ." 

>>> for threshold in range(-2, 4): 
...  print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)] 

[出]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .'] 
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .'] 
0 ['this is', 'foo bar', 'bar sentence'] 
1 ['this is', 'foo bar', 'bar sentence'] 
2 ['this is', 'foo bar', 'bar sentence'] 
3 ['foo bar', 'bar sentence'] 
4 [] 

對於進一步的細節,我覺得這個論文對MWE提取進行快速而簡單的介紹:「擴展對數似然度量以改進配置識別」,請參閱http://goo.gl/5ebTJJ

+0

此方法對長文本以外的任何其他內容有用嗎?讓我們說Facebook的評論?或任何其他短小的文字? – Haseeb

+0

這一切都取決於PMI如何對文本做出反應,並且PMI似乎對高分母/低分子非常敏感以允許誤報。 – alvas

3

要回答結果不穩定的原因,重要的是要知道Google搜索不是可靠的詞源頻率。引擎返回的頻率僅僅是估計值,當查詢多個詞時,這些估計值特別不準確並且可能相互矛盾。這不是爲了抨擊Google,但它不適用於頻率計數。因此,您的實施可能會很好,但在此基礎上的結果仍然可能不合情理。

有關此事的更深入討論,請閱讀Adam Kilgarriff的「Googleology is bad science」。

4

Python庫DISSECT在共現矩陣上包含a few methods to compute Pointwise Mutual Information

實施例:

#ex03.py 
#------- 
from composes.utils import io_utils 
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting 

#create a space from co-occurrence counts in sparse format 
my_space = io_utils.load("./data/out/ex01.pkl") 

#print the co-occurrence matrix of the space 
print my_space.cooccurrence_matrix 

#apply ppmi weighting 
my_space = my_space.apply(PpmiWeighting()) 

#print the co-occurrence matrix of the transformed space 
print my_space.cooccurrence_matrix 

Code on GitHub for the PMI methods

參考:Georgiana Dinu,Nghia The Pham和Marco Baroni。 2013. DISSECT: DIStributional SEmantics Composition Toolkit。在系統演示ACL 2013年,索菲亞的 論文集,保加利亞

相關:Calculating pointwise mutual information between two strings