Python - 使用點互信息的情緒分析

from __future__ import division 
import urllib 
import json 
from math import log 


def hits(word1,word2=""): 
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s" 
    if word2 == "": 
     results = urllib.urlopen(query % word1) 
    else: 
     results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2) 
    json_res = json.loads(results.read()) 
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount']) 
    return google_hits 


def so(phrase): 
    num = hits(phrase,"excellent") 
    #print num 
    den = hits(phrase,"poor") 
    #print den 
    ratio = num/den 
    #print ratio 
    sop = log(ratio) 
    return sop 

print so("ugly product")

我需要此代碼來計算點明智的相互信息，可用於將評論分爲正面或負面。基本上我使用Turney（2002）指定的技術：http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf作爲情感分析的無監督分類方法的示例。Python - 使用點互信息的情緒分析

正如文章中所解釋的，如果短語與單詞「poor」更強相關，並且與「excellent」更強相關，則該單詞的語義方向是否定的。

上面的代碼計算短語的SO。我使用Google來計算命中數並計算出SO（因爲AltaVista現在不存在）

計算出的值非常不穩定。他們不拘泥於特定的模式。例如SO（「醜陋的產品」）原來是2.85462098541而SO（「美麗的產品」）是1.71395061117。而前者預計是負面的，另一個是正面的。

代碼有問題嗎？有沒有更簡單的方法來計算一個短語（使用PMI）SO與任何Python庫，比如說NLTK？我嘗試了NLTK，但無法找到計算PMI的明確方法。

來源

2014-03-01 keshr3106

有什麼建議？ – keshr3106

啊，我有一個PMI的代碼，給我一分鐘。我會稍微上傳一下。 – alvas

一般來說，計算PMI是棘手的，因爲該公式將取決於您要考慮到的ngram的大小而變化：

在數學上，對二元語法，你可以簡單地認爲：

log(p(a,b)/(p(a) * p(b)))

編程，讓我們假設你已經計算了對unigram和雙字母組的所有頻率在你的陰莖，你這樣做：

def pmi(word1, word2, unigram_freq, bigram_freq): 
    prob_word1 = unigram_freq[word1]/float(sum(unigram_freq.values())) 
    prob_word2 = unigram_freq[word2]/float(sum(unigram_freq.values())) 
    prob_word1_word2 = bigram_freq[" ".join([word1, word2])]/float(sum(bigram_freq.values())) 
    return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

這是鱈魚來自MWE圖書館的一個片段，但它處於預發展階段（https://github.com/alvations/Terminator/blob/master/mwe.py）。但請注意，它是平行MWE提取，所以這裏的你怎麼能「砍」它來提取單語MWE：

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py 
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt 
$ printf "" > trg.txt 
$ python 
>>> import codecs 
>>> from mwe import load_ngramfreq, extract_mwe 

>>> # Calculates the unigrams and bigrams counts. 
>>> # More superfluously, "Training a bigram 'language model'." 
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt') 

>>> sent = "This is another foo bar sentence not in the training corpus ." 

>>> for threshold in range(-2, 4): 
...  print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[出]：

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .'] 
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .'] 
0 ['this is', 'foo bar', 'bar sentence'] 
1 ['this is', 'foo bar', 'bar sentence'] 
2 ['this is', 'foo bar', 'bar sentence'] 
3 ['foo bar', 'bar sentence'] 
4 []

對於進一步的細節，我覺得這個論文對MWE提取進行快速而簡單的介紹：「擴展對數似然度量以改進配置識別」，請參閱http://goo.gl/5ebTJJ

來源

2014-03-09 16:26:44 alvas

此方法對長文本以外的任何其他內容有用嗎？讓我們說Facebook的評論？或任何其他短小的文字？ – Haseeb

這一切都取決於PMI如何對文本做出反應，並且PMI似乎對高分母/低分子非常敏感以允許誤報。 – alvas

要回答結果不穩定的原因，重要的是要知道Google搜索不是可靠的詞源頻率。引擎返回的頻率僅僅是估計值，當查詢多個詞時，這些估計值特別不準確並且可能相互矛盾。這不是爲了抨擊Google，但它不適用於頻率計數。因此，您的實施可能會很好，但在此基礎上的結果仍然可能不合情理。

有關此事的更深入討論，請閱讀Adam Kilgarriff的「Googleology is bad science」。

來源

2015-01-23 13:11:46 Carsten

Python庫DISSECT在共現矩陣上包含a few methods to compute Pointwise Mutual Information。

實施例：

#ex03.py 
#------- 
from composes.utils import io_utils 
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting 

#create a space from co-occurrence counts in sparse format 
my_space = io_utils.load("./data/out/ex01.pkl") 

#print the co-occurrence matrix of the space 
print my_space.cooccurrence_matrix 

#apply ppmi weighting 
my_space = my_space.apply(PpmiWeighting()) 

#print the co-occurrence matrix of the transformed space 
print my_space.cooccurrence_matrix

Code on GitHub for the PMI methods。

參考：Georgiana Dinu，Nghia The Pham和Marco Baroni。 2013. DISSECT: DIStributional SEmantics Composition Toolkit。在系統演示ACL 2013年，索菲亞的論文集，保加利亞

來源

2015-10-29 01:38:06

Python - 使用點互信息的情緒分析

回答

相關問題