如何抓取多個網站來查找常見單詞（BeautifulSoup，Requests，Python3）

我想知道如何使用美麗的湯/請求來抓取多個不同的網站，而不必一遍又一遍地重複我的代碼。如何抓取多個網站來查找常見單詞（BeautifulSoup，Requests，Python3）

這裏是我的代碼現在：

import requests 
from bs4 import BeautifulSoup 
from collections import Counter 
import pandas as pd 
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards") 
soup = BeautifulSoup(Website1.content) 
texts = soup.findAll(text=True) 
a = Counter([x.lower() for y in texts for x in y.split()]) 
b = (a.most_common()) 
makeaframe = pd.DataFrame(b) 
makeaframe.columns = ['Words', 'Frequency'] 
print(makeaframe)

我所試圖做的理想是爬行5個不同的網站，發現所有的這些網站上的個別字，找到每個單詞的頻率上每個網站，爲每個特定詞彙添加所有頻率，然後將所有這些數據合併到一個可以使用Pandas輸出的數據框中。

希望輸出應該是這樣的

Word  Frequency 
the  200 
man  300 
is  400 
tired  300

我的代碼可以只爲一個網站做到了這點時間，現在我試圖避免重蹈我的代碼。

現在，我可以通過反覆重複我的代碼並抓取每個單獨的網站，然後將每個這些數據框的結果連接在一起，但看起來非常和諧。我想知道是否有人有更快的方式或任何建議？謝謝！

來源

2014-08-28 user3682157

只要打開你的代碼與輸入功能的網址是什麼？那麼你不需要重複代碼。 – joris 2014-08-28 21:39:15

...也許在某處添加[loop]（https://docs.python.org/2/tutorial/controlflow.html#for-statements） – 2014-08-28 21:48:30

只是循環和更新主計數器字典：

main_c = Counter() # keep all results here 
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"] 
for url in urls: 
    website = requests.get(url) 
    soup = BeautifulSoup(website.content) 
    texts = soup.findAll(text=True) 
    a = Counter([x.lower() for y in texts for x in y.split()]) 
    b = (a.most_common()) 
    main_c.update(b) 
make_a_frame = pd.DataFrame(main_c.most_common()) 
make_a_frame.columns = ['Words', 'Frequency'] 
print(make_a_frame)

的update方法與普通的dict.update增加值，它不會取代值

上的伴奏音符，用小寫變量名和使用下劃線的make_a_frame

嘗試：

comm = [[k,v] for k,v in main_c] 
make_a_frame = pd.DataFrame(comm) 
make_a_frame.columns = ['Words', 'Frequency'] 
print(make_a_frame).sort("Frequency",ascending=False)

來源

2014-08-28 21:52:24

嗨Padraic，雖然此代碼允許我抓取多個網站，但它沒有完全按照我的需要組合數據。我所希望的是將所有網站的頻率相加，並創建兩列：包含單詞的列A和包含所有頻率的列B。我編輯了我的原創帖子，希望能讓它更清晰。感謝您的時間！ – user3682157 2014-08-28 23:04:27

@ user3682157，嘗試編輯 – 2014-08-28 23:24:55

不幸的是，仍然是相同的問題 – user3682157 2014-08-30 16:44:01

做一個函數：

import requests 
from bs4 import BeautifulSoup 
from collections import Counter 
import pandas as pd 

cnt = Counter() 
def GetData(url): 
Website1 = requests.get(url) 
soup = BeautifulSoup(Website1.content) 
texts = soup.findAll(text=True) 
a = Counter([x.lower() for y in texts for x in y.split()]) 
cnt.update(a.most_common()) 

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com'] 
for url in websites: 
GetData(url) 

makeaframe = pd.DataFrame(cnt.most_common()) 
makeaframe.columns = ['Words', 'Frequency'] 
print makeaframe

來源

2014-08-28 21:59:05 Vizjerei

這將爲每個調用創建一個單獨的數據框，OP想要合併爲一個 – 2014-08-28 22:01:55

@PadraicCunningham編輯它 – Vizjerei 2014-08-28 22:26:29

嗨Vizjerei，雖然這段代碼允許我抓取多個網站，但它不會合並數據正是我需要的。我所希望的是將所有網站的頻率相加，並創建兩列：包含單詞的列A和包含所有頻率的列B。我編輯了我的原創帖子，希望能讓它更清晰。感謝您的時間！ – user3682157 2014-08-28 23:04:01

如何抓取多個網站來查找常見單詞（BeautifulSoup，Requests，Python3）

回答

相關問題