將已過濾的ngram寫入outfile - 列表清單

我從一堆HTML文件中按照特定模式提取三個圖。當我打印它們時，我會得到一份清單（每行是三格）。我想將它打印出來作爲進一步的文本分析，但是當我嘗試它時，它只打印第一個三字節。如何將所有的三維圖打印到outfile？（三列表的列表）。理想情況下，我希望將所有的三元組合併成一個列表，而不是將多個列表與一個三元列表合併。您的幫助將不勝感激。將已過濾的ngram寫入outfile - 列表清單

我的代碼看起來像這樣至今：

from nltk import sent_tokenize, word_tokenize 
from nltk import ngrams 
from bs4 import BeautifulSoup 
from string import punctuation 
import glob 
import sys 
punctuation_set = set(punctuation) 

# Open and read file 
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*') 
for filename in text: 
with open(filename, encoding='ISO-8859-1', errors="ignore") as f: 
    mytext = f.read() 

# Extract text from HTML using BeautifulSoup 
soup = BeautifulSoup(mytext, "lxml") 
extracted_text = soup.getText() 
extracted_text = extracted_text.replace('\n', '') 

# Split the text in sentences (using the NLTK sentence splitter) 
sentences = sent_tokenize(extracted_text) 

# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging) 
all_tokens = [] 

for sent in sentences: 
    sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary) 
    tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization) 
    all_tokens.extend(tokenized_sent) # add tagged tokens to list 

n=3 
threegrams = ngrams(all_tokens, n) 


# Find ngrams with specific pattern 
for (first, second, third) in threegrams: 
    if first == "a": 
     if second.endswith("bb") and second.startswith("leg"): 
      print(first, second, third)

來源

2017-02-07 Lee

首先，標點符號去除本來是簡單的，看到Removing a list of characters in string

>>> from string import punctuation 
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known." 
>>> text.translate(None, punctuation) 
'The lazy birds flew over the rainbow Well not have known'

但它不是真正正確的，你做的標記化之前去除標點符號，你會看到We'll - >Well，我認爲這不是我們想要的。

可能這是一個更好的辦法：

>>> from nltk import sent_tokenize, word_tokenize 
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)] 
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

但是千萬注意，上面的成語不處理多字符標點。

E.g. ，我們看到的是，word_tokenize()改變" - >``，並使用成語它上面並沒有將其刪除：

>>> sent = 'He said, "There is no room for room"' 
>>> word_tokenize(sent) 
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"] 
>>> [word for word in word_tokenize(sent) if word not in punctuation] 
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]

來處理，明確地使punctuation成一個列表，並追加多字符標點它：

>>> sent = 'He said, "There is no room for room"' 
>>> punctuation 
'!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~' 
>>> list(punctuation) 
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'] 
>>> list(punctuation) + ['...', '``', "''"] 
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"] 
>>> p = list(punctuation) + ['...', '``', "''"] 
>>> [word for word in word_tokenize(sent) if word not in p] 
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']

至於獲取文檔流（如你叫它all_tokens），這裏有一個整潔的方式來得到它：

>>> from collections import Counter 
>>> from nltk import sent_tokenize, word_tokenize 
>>> from string import punctuation 
>>> p = list(punctuation) + ['...', '``', "''"] 
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known." 
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)] 
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

現在到您的實際問題的一部分。

你真正需要的不是檢查ngrams中的字符串，而是應該考慮正則表達式模式匹配。

您要查找的模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b，看到https://regex101.com/r/zBVgp4/4

>>> import re 
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha") 
['a legobatmanbb cave'] 
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha") 
[]

我們一個字符串寫入一個文件，你可以使用這個成語，見https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function：

with open('filename.txt', 'w') as fout: 
    print('Hello World', end='\n', file=fout)

在事實上，如果您只對沒有令牌的ngram感興趣，則不需要過濾或標記文本; P

你可以簡單的代碼到這一點：

soup = BeautifulSoup(mytext, "lxml") 
extracted_text = soup.getText() 
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b" 

with open('filename.txt', 'w') as fout: 
    for interesting_ngram in re.findall(pattern, extracted_text): 
     print(interesting_ngram, end='\n', file=fout)

來源

2017-02-07 09:06:15 alvas

非常感謝。我得到了正則表達式模式匹配，但我仍然無法打印到文本文件。以相同的方式不幸的是只打印第一行。正如你可以懷疑我是全新的，所以我可能會做錯事。 – Lee

你在答案中使用了代碼片段嗎？還是仍在使用你的？什麼是輸入文件，你可以分享嗎？預期的輸出是什麼？ – alvas

檢查您的縮進。刪除所有代碼，然後從文件迭代開始檢查，例如按行打印出內容。代碼的NLP部分應該沒問題。 – alvas

將已過濾的ngram寫入outfile - 列表清單

回答

相關問題