2017-04-18 23 views
1

Python新手在這裏,使用BeautifulSoup和NLTK在我的第一個網頁抓取/詞頻分析工作。如何加入Python的多個列表 - BeautifulSoup NLTK分析

我在刮取德克薩斯州司法部檔案中的罪犯最後陳述。

我已經到了能夠提取我想要從每個罪犯頁面分析的文本並標記所有段落的單詞的點,但它正在返回每個段落的標記化單詞列表。我希望將這些列表合併起來,並返回一個標記化詞語列表,以便對每個罪犯進行分析。

我最初認爲使用.join會解決我的問題,但它仍然每個段落返回一個列表。我也嘗試了itertools。沒有運氣。

下面是在犯罪者聲明中查找最常見單詞的所有代碼,但它會返回每個段落中最常用的單詞。 任何幫助將不勝感激!

from bs4 import BeautifulSoup 
import urllib.request 
import re 
import nltk 
from nltk import FreqDist 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords 

resp = urllib.request.urlopen 
("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html") 
soup = BeautifulSoup(resp,"lxml", 
from_encoding=resp.info().get_param('charset')) 

for link in soup.find_all('a', href=re.compile('last'))[1:2]: 
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href'] 
    resp2 = urllib.request.urlopen(lastlist) 
    soup2 = BeautifulSoup(resp2,"lxml", 
from_encoding=resp2.info().get_param('charset')) 
    body = soup2.body 

    for paragraph in body.find_all('p')[4:5]: 
     name = paragraph.text 
     print(name) 

    for paragraph in body.find_all('p')[6:]: 
     tokens = word_tokenize(paragraph.text) 
     addWords = 
     ['I',',','Yes','.','\'m','n\'t','?',':', 
     'None','To','would','y\'all',')','Last','\'s'] 
     stopWords = set(stopwords.words('english')+addWords) 
     wordsFiltered = [] 

     for w in tokens: 
      if w not in stopWords: 
       wordsFiltered.append(w) 

     fdist1 = FreqDist(wordsFiltered) 
     common = fdist1.most_common(1) 
     print(common) 

回答

0
from bs4 import BeautifulSoup 
import urllib.request 
import re 
import nltk 
from nltk import FreqDist 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords 

resp = urllib.request.urlopen("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html") 
soup = BeautifulSoup(resp,"lxml", from_encoding=resp.info().get_param('charset')) 
wordsFiltered = [] 
for link in soup.find_all('a', href=re.compile('last'))[1:2]: 
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href'] 
    resp2 = urllib.request.urlopen(lastlist) 
    soup2 = BeautifulSoup(resp2,"lxml", from_encoding=resp2.info().get_param('charset'))  
    body = soup2.body 

    for paragraph in body.find_all('p')[4:5]: 
     name = paragraph.text 
     print(name) 


    for paragraph in body.find_all('p')[6:]: 
     tokens = word_tokenize(paragraph.text) 
     addWords = ['I',',','Yes','.','\'m','n\'t','?',':','None','To','would','y\'all',')','Last','\'s'] 
     stopWords = set(stopwords.words('english')+addWords) 


     for w in tokens: 
      if w not in stopWords: 
       wordsFiltered.append(w) 

fdist1 = FreqDist(wordsFiltered) 
common = fdist1.most_common(1) 
print(common) 

我編輯了自己的代碼,讓每說法最常見的詞。如果您不瞭解某些內容,請隨時發表評論。另外,如果您在每次迭代中追加它,請始終記住不要在循環內聲明列表。

+0

是的 - 它做到了!謝謝Devaraj! – MHolmer

+1

但是你爲什麼要重建爲每一個設置的'stopWords'。該死的。段?這是一個巨大的時間浪費。 – alexis

+0

@alexis是正確的。建立停用詞表應該在主循環之外。 –