我從一堆HTML文件中按照特定模式提取三個圖。當我打印它們時,我會得到一份清單(每行是三格)。我想將它打印出來作爲進一步的文本分析,但是當我嘗試它時,它只打印第一個三字節。如何將所有的三維圖打印到outfile? (三列表的列表)。理想情況下,我希望將所有的三元組合併成一個列表,而不是將多個列表與一個三元列表合併。您的幫助將不勝感激。將已過濾的ngram寫入outfile - 列表清單
我的代碼看起來像這樣至今:
from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation)
# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
mytext = f.read()
# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')
# Split the text in sentences (using the NLTK sentence splitter)
sentences = sent_tokenize(extracted_text)
# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []
for sent in sentences:
sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
all_tokens.extend(tokenized_sent) # add tagged tokens to list
n=3
threegrams = ngrams(all_tokens, n)
# Find ngrams with specific pattern
for (first, second, third) in threegrams:
if first == "a":
if second.endswith("bb") and second.startswith("leg"):
print(first, second, third)
非常感謝。我得到了正則表達式模式匹配,但我仍然無法打印到文本文件。以相同的方式不幸的是只打印第一行。正如你可以懷疑我是全新的,所以我可能會做錯事。 – Lee
你在答案中使用了代碼片段嗎?還是仍在使用你的?什麼是輸入文件,你可以分享嗎?預期的輸出是什麼? – alvas
檢查您的縮進。刪除所有代碼,然後從文件迭代開始檢查,例如按行打印出內容。代碼的NLP部分應該沒問題。 – alvas