組詞的二元語法句子的列表與Python

我有句子的列表：組詞的二元語法句子的列表與Python

text = ['cant railway station','citadel hotel',' police stn'].

我需要形成二元對，並將其存儲在一個變量。問題是，當我這樣做時，我會得到一對句子而不是單詞。下面是我做的：

text2 = [[word for word in line.split()] for line in text] 
bigrams = nltk.bigrams(text2) 
print(bigrams)

這將產生

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

無法火車站和城堡酒店形成一個兩字。我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

第一句的最後一個單詞不應與第二句的第一個單詞合併。我該怎麼做才能使它工作？

來源

2014-02-18 Hypothetical Ninja

現在有;） –

使用list comprehensions和zip：

>>> text = ["this is a sentence", "so is this one"] 
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])] 
>>> print(bigrams) 
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',  
'one')]

來源

2014-02-18 05:04:29 butch

與其將文本轉換爲字符串列表，不如將每個句子作爲字符串分別開始。我也去掉標點符號和停用詞，只是刪除這些部分，如果不相關的你：

import nltk 
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer 
from nltk.tokenize import WordPunctTokenizer 
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures 

def get_bigrams(myString): 
    tokenizer = WordPunctTokenizer() 
    tokens = tokenizer.tokenize(myString) 
    stemmer = PorterStemmer() 
    bigram_finder = BigramCollocationFinder.from_words(tokens) 
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500) 

    for bigram_tuple in bigrams: 
     x = "%s %s" % bigram_tuple 
     tokens.append(x) 

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] 
    return result

要使用它，不喜歡這樣：

for line in sentence: 
    features = get_bigrams(line) 
    # train set here

注意，這更進一步一點，實際上統計學評分（在訓練模型時會派上用場）。

來源

2014-02-18 04:55:32 Dan

'stemmer'變化'apple'到'appl'所以我得到'[ '申請籃']'。 – dashesy

是的，Porter stemmer有一些限制。 – Dan

沒有NLTK：

ans = [] 
text = ['cant railway station','citadel hotel',' police stn'] 
for line in text: 
    arr = line.split() 
    for i in range(len(arr)): 
     if i < len(arr)-1: 
      ans.append([[arr[i]], [arr[i+1]]]) 


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

來源

2014-02-18 05:00:53 alfasin

默認情況下它們是bigrams嗎？因爲我會需要他們拼寫正確。 –

@劍你可以看到，它只從最後一行（在打印之前）生成兩個bigrams。玩它，嘗試不同的句子，看看自己;） – alfasin

>>> text = ['cant railway station','citadel hotel',' police stn'] 
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text for i,ele in enumerate(tex.split()) if i < len(tex.split())-1] 
>>> bigrams 
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

使用枚舉和拆分功能。

來源

2014-02-18 06:21:45

剛剛殺青丹代碼：

def get_bigrams(myString): 
    tokenizer = WordPunctTokenizer() 
    tokens = tokenizer.tokenize(myString) 
    stemmer = PorterStemmer() 
    bigram_finder = BigramCollocationFinder.from_words(tokens) 
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500) 

    for bigram_tuple in bigrams: 
     x = "%s %s" % bigram_tuple 
     tokens.append(x) 

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8] 
    return result

來源

2016-10-02 20:34:01

import nltk 

from nltk import word_tokenize 

from nltk.util import ngrams 


text = ['cant railway station','citadel hotel',' police stn'] 
for line in text: 
    token =nltk.word_tokenize(line) 
    bigram = list(ngrams(token,2)) 

    # the 2 represents bigram...you can change it for as many as you want

來源

2018-02-19 18:30:32 gurinder

組詞的二元語法句子的列表與Python

回答

相關問題