2014-02-18 30 views
11

我有句子的列表:組詞的二元語法句子的列表與Python

text = ['cant railway station','citadel hotel',' police stn']. 

我需要形成二元對,並將其存儲在一個變量。問題是,當我這樣做時,我會得到一對句子而不是單詞。下面是我做的:

text2 = [[word for word in line.split()] for line in text] 
bigrams = nltk.bigrams(text2) 
print(bigrams) 

這將產生

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn']) 

無法火車站和城堡酒店形成一個兩字。我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on... 

第一句的最後一個單詞不應與第二句的第一個單詞合併。 我該怎麼做才能使它工作?

+0

現在有;) –

回答

23

使用list comprehensionszip

>>> text = ["this is a sentence", "so is this one"] 
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])] 
>>> print(bigrams) 
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',  
'one')] 
7

與其將文本轉換爲字符串列表,不如將每個句子作爲字符串分別開始。我也去掉標點符號和停用詞,只是刪除這些部分,如果不相關的你:

import nltk 
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer 
from nltk.tokenize import WordPunctTokenizer 
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures 

def get_bigrams(myString): 
    tokenizer = WordPunctTokenizer() 
    tokens = tokenizer.tokenize(myString) 
    stemmer = PorterStemmer() 
    bigram_finder = BigramCollocationFinder.from_words(tokens) 
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500) 

    for bigram_tuple in bigrams: 
     x = "%s %s" % bigram_tuple 
     tokens.append(x) 

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] 
    return result 

要使用它,不喜歡這樣:

for line in sentence: 
    features = get_bigrams(line) 
    # train set here 

注意,這更進一步一點,實際上統計學評分(在訓練模型時會派上用場)。

+0

'stemmer'變化'apple'到'appl'所以我得到'[ '申請籃']'。 – dashesy

+0

是的,Porter stemmer有一些限制。 – Dan

3

沒有NLTK:

ans = [] 
text = ['cant railway station','citadel hotel',' police stn'] 
for line in text: 
    arr = line.split() 
    for i in range(len(arr)): 
     if i < len(arr)-1: 
      ans.append([[arr[i]], [arr[i+1]]]) 


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]] 
+0

默認情況下它們是bigrams嗎?因爲我會需要他們拼寫正確。 –

+0

@劍你可以看到,它只從最後一行(在打印之前)生成兩個bigrams。玩它,嘗試不同的句子,看看自己;) – alfasin

0
>>> text = ['cant railway station','citadel hotel',' police stn'] 
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text for i,ele in enumerate(tex.split()) if i < len(tex.split())-1] 
>>> bigrams 
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')] 

使用枚舉和拆分功能。

1

剛剛殺青丹代碼:

def get_bigrams(myString): 
    tokenizer = WordPunctTokenizer() 
    tokens = tokenizer.tokenize(myString) 
    stemmer = PorterStemmer() 
    bigram_finder = BigramCollocationFinder.from_words(tokens) 
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500) 

    for bigram_tuple in bigrams: 
     x = "%s %s" % bigram_tuple 
     tokens.append(x) 

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] 
    return result 
0
import nltk 

from nltk import word_tokenize 

from nltk.util import ngrams 


text = ['cant railway station','citadel hotel',' police stn'] 
for line in text: 
    token =nltk.word_tokenize(line) 
    bigram = list(ngrams(token,2)) 

    # the 2 represents bigram...you can change it for as many as you want