2017-02-07 25 views
1

我從一堆HTML文件中按照特定模式提取三個圖。當我打印它們時,我會得到一份清單(每行是三格)。我想將它打印出來作爲進一步的文本分析,但是當我嘗試它時,它只打印第一個三字節。如何將所有的三維圖打印到outfile? (三列表的列表)。理想情況下,我希望將所有的三元組合併成一個列表,而不是將多個列表與一個三元列表合併。您的幫助將不勝感激。將已過濾的ngram寫入outfile - 列表清單

我的代碼看起來像這樣至今:

from nltk import sent_tokenize, word_tokenize 
from nltk import ngrams 
from bs4 import BeautifulSoup 
from string import punctuation 
import glob 
import sys 
punctuation_set = set(punctuation) 

# Open and read file 
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*') 
for filename in text: 
with open(filename, encoding='ISO-8859-1', errors="ignore") as f: 
    mytext = f.read() 

# Extract text from HTML using BeautifulSoup 
soup = BeautifulSoup(mytext, "lxml") 
extracted_text = soup.getText() 
extracted_text = extracted_text.replace('\n', '') 

# Split the text in sentences (using the NLTK sentence splitter) 
sentences = sent_tokenize(extracted_text) 

# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging) 
all_tokens = [] 

for sent in sentences: 
    sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary) 
    tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization) 
    all_tokens.extend(tokenized_sent) # add tagged tokens to list 

n=3 
threegrams = ngrams(all_tokens, n) 


# Find ngrams with specific pattern 
for (first, second, third) in threegrams: 
    if first == "a": 
     if second.endswith("bb") and second.startswith("leg"): 
      print(first, second, third) 

回答

0

首先,標點符號去除本來是簡單的,看到Removing a list of characters in string

>>> from string import punctuation 
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known." 
>>> text.translate(None, punctuation) 
'The lazy birds flew over the rainbow Well not have known' 

但它不是真正正確的,你做的標記化之前去除標點符號,你會看到We'll - >Well,我認爲這不是我們想要的。

可能這是一個更好的辦法:

>>> from nltk import sent_tokenize, word_tokenize 
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)] 
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']] 

但是千萬注意,上面的成語不處理多字符標點。

E.g. ,我們看到的是,word_tokenize()改變" - >``,並使用成語它上面並沒有將其刪除:

>>> sent = 'He said, "There is no room for room"' 
>>> word_tokenize(sent) 
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"] 
>>> [word for word in word_tokenize(sent) if word not in punctuation] 
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"] 

來處理,明確地使punctuation成一個列表,並追加多字符標點它:

>>> sent = 'He said, "There is no room for room"' 
>>> punctuation 
'!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~' 
>>> list(punctuation) 
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'] 
>>> list(punctuation) + ['...', '``', "''"] 
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"] 
>>> p = list(punctuation) + ['...', '``', "''"] 
>>> [word for word in word_tokenize(sent) if word not in p] 
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room'] 

至於獲取文檔流(如你叫它all_tokens),這裏有一個整潔的方式來得到它:

>>> from collections import Counter 
>>> from nltk import sent_tokenize, word_tokenize 
>>> from string import punctuation 
>>> p = list(punctuation) + ['...', '``', "''"] 
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known." 
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)] 
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']] 

現在到您的實際問題的一部分。

你真正需要的不是檢查ngrams中的字符串,而是應該考慮正則表達式模式匹配。

您要查找的模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b,看到https://regex101.com/r/zBVgp4/4

>>> import re 
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha") 
['a legobatmanbb cave'] 
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha") 
[] 

我們一個字符串寫入一個文件,你可以使用這個成語,見https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function

with open('filename.txt', 'w') as fout: 
    print('Hello World', end='\n', file=fout) 

在事實上,如果您只對沒有令牌的ngram感興趣,則不需要過濾或標記文本; P

你可以簡單的代碼到這一點:

soup = BeautifulSoup(mytext, "lxml") 
extracted_text = soup.getText() 
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b" 

with open('filename.txt', 'w') as fout: 
    for interesting_ngram in re.findall(pattern, extracted_text): 
     print(interesting_ngram, end='\n', file=fout) 
+0

非常感謝。我得到了正則表達式模式匹配,但我仍然無法打印到文本文件。以相同的方式不幸的是只打印第一行。正如你可以懷疑我是全新的,所以我可能會做錯事。 – Lee

+0

你在答案中使用了代碼片段嗎?還是仍在使用你的?什麼是輸入文件,你可以分享嗎?預期的輸出是什麼? – alvas

+0

檢查您的縮進。刪除所有代碼,然後從文件迭代開始檢查,例如按行打印出內容。代碼的NLP部分應該沒問題。 – alvas