2016-12-30 42 views
2

我成功編寫了以下代碼以獲取the titles of a Wikipedia category.該分類包含超過404個標題。但是我的輸出文件只給出了200個標題/頁面。如何擴展我的代碼以獲得該類別鏈接的所有標題(next page)等。Pythonic beautifulSoup4:如何從維基百科分類的下一頁鏈接中獲得剩餘標題

命令:python3 getCATpages.py

getCATpages.py的代碼; -

from bs4 import BeautifulSoup 
import requests 
import csv 

#getting all the contents of a url 
url = 'https://en.wikipedia.org/wiki/Category:Free software' 
content = requests.get(url).content 
soup = BeautifulSoup(content,'lxml') 

#showing the category-pages Summary 
catPageSummaryTag = soup.find(id='mw-pages') 
catPageSummary = catPageSummaryTag.find('p') 
print(catPageSummary.text) 

#showing the category-pages only 
catPageSummaryTag = soup.find(id='mw-pages') 
tag = soup.find(id='mw-pages') 
links = tag.findAll('a') 

# giving serial numbers to the output print and limiting the print into three 
counter = 1 
for link in links[:3]: 
    print ('''  '''+str(counter) + " " + link.text) 
    counter = counter + 1 

#getting the category pages 
catpages = soup.find(id='mw-pages') 
whatlinksherelist = catpages.find_all('li') 
things_to_write = [] 
for titles in whatlinksherelist: 
    things_to_write.append(titles.find('a').get('title')) 

#writing the category pages as a output file 
with open('001-catPages.csv', 'a') as csvfile: 
    writer = csv.writer(csvfile,delimiter="\n") 
    writer.writerow(things_to_write) 

回答

2

的想法是按照下一頁直到沒有「下一頁」鏈接在頁面上。同時使多個請求收集所需的鏈接標題列表中,我們將保持網絡的刮會議:

from pprint import pprint 
from urllib.parse import urljoin 

from bs4 import BeautifulSoup 
import requests 


base_url = 'https://en.wikipedia.org/wiki/Category:Free software' 


def get_next_link(soup): 
    return soup.find("a", text="next page") 

def extract_links(soup): 
    return [a['title'] for a in soup.select("#mw-pages li a")] 


with requests.Session() as session: 
    content = session.get(base_url).content 
    soup = BeautifulSoup(content, 'lxml') 

    links = extract_links(soup) 
    next_link = get_next_link(soup) 
    while next_link is not None: # while there is a Next Page link 
     url = urljoin(base_url, next_link['href']) 
     content = session.get(url).content 
     soup = BeautifulSoup(content, 'lxml') 

     links += extract_links(soup) 

     next_link = get_next_link(soup) 

pprint(links) 

打印:

['Free software', 
'Open-source model', 
'Outline of free software', 
'Adoption of free and open-source software by public institutions', 
... 
'ZK Spreadsheet', 
'Zulip', 
'Portal:Free and open-source software'] 

省略了不相關的CSV寫作部分。

+0

某些維護類別包含超過幾十萬頁的頁面。例如,[https://en.wikipedia.org/wiki/Category:Commons_category_with_local_link_same_as_on_Wikidata 288,935頁]爲了避免服務器負載,是否可以在下一個頁面請求之間設置60秒的時間間隔? –

+0

@ info-farmer您需要調整代碼以分段操作下一頁。而且,是的,增加時間延遲並不是經常碰到wikipedia是個好主意。好的想法,謝謝。此外,看看Scrapy是否會幫助解決這個更容易導航到下一頁的問題。 – alecxe

+0

不好意思!我仍然在練習,以便很好地理解英語/打字和編程。我正在爲泰米爾語wiki而不是英語wiki做貢獻。以上代碼對我們非常有用。如果可能,請按照時間尺度重新編碼。 –

1

MediaWiki API爲此提供了一個generator。這裏是代碼,根據MediaWiki中提供的示例進行調整,並利用它。

import requests 

def query(request): 
    request['action'] = 'query' 
    request['format'] = 'json' 
    request['generator'] = 'categorymembers' 
    request['gcmtype'] = 'subcat' 
    previousContinue = {} 
    while True: 
     req = request.copy() 
     req.update(previousContinue) 
     result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json() 
     if 'error' in result: 
      raise Error(result['error']) 
     if 'warnings' in result: 
      print(result['warnings']) 
     if 'query' in result: 
      yield result['query'] 
     if 'continue' in result: 
      previousContinue = {'gcmcontinue': result['continue']['gcmcontinue']} 
     else: 
      break 

for result in query({'gcmtitle': 'Category:Free_software' }): 
    print (result) 

我覺得有理由重寫在其他地方展示的片段代碼,因爲我沒有完全找到MediaWiki文檔。

以下是此腳本的輸出。

{'pages': {'42113821': {'pageid': 42113821, 'ns': 14, 'title': 'Category:Free software by type'}, '6702554': {'pageid': 6702554, 'ns': 14, 'title': 'Category:Free application software'}, '12180074': {'pageid': 12180074, 'ns': 14, 'title': 'Category:Free software by programming language'}, '6962224': {'pageid': 6962224, 'ns': 14, 'title': 'Category:Free software lists and comparisons'}, '39563179': {'pageid': 39563179, 'ns': 14, 'title': 'Category:Bitcoin'}, '34482991': {'pageid': 34482991, 'ns': 14, 'title': 'Category:Free-software awards'}, '30945256': {'pageid': 30945256, 'ns': 14, 'title': 'Category:Single-platform free software'}, '49967344': {'pageid': 49967344, 'ns': 14, 'title': 'Category:Free software by license'}, '6721544': {'pageid': 6721544, 'ns': 14, 'title': 'Category:Free system software'}, '34313543': {'pageid': 34313543, 'ns': 14, 'title': 'Category:Cross-platform free software'}}} 
{'pages': {'39630972': {'pageid': 39630972, 'ns': 14, 'title': 'Category:Free and open-source Android software'}, '33751817': {'pageid': 33751817, 'ns': 14, 'title': 'Category:Copyleft'}, '40888749': {'pageid': 40888749, 'ns': 14, 'title': 'Category:Free and open-source software'}, '25128034': {'pageid': 25128034, 'ns': 14, 'title': 'Category:Open data'}, '5446650': {'pageid': 5446650, 'ns': 14, 'title': 'Category:Free software culture and documents'}, '7298930': {'pageid': 7298930, 'ns': 14, 'title': 'Category:Creative Commons'}, '21140817': {'pageid': 21140817, 'ns': 14, 'title': 'Category:Free communication software'}, '7457597': {'pageid': 7457597, 'ns': 14, 'title': 'Category:Software forks'}, '34474935': {'pageid': 34474935, 'ns': 14, 'title': 'Category:Free software distributions'}, '34482997': {'pageid': 34482997, 'ns': 14, 'title': 'Category:Free-software events'}}} 
{'pages': {'34348162': {'pageid': 34348162, 'ns': 14, 'title': 'Category:Free and open-source software licenses'}, '703116': {'pageid': 703116, 'ns': 14, 'title': 'Category:Free software projects'}, '39630965': {'pageid': 39630965, 'ns': 14, 'title': 'Category:History of free and open-source software'}, '1358456': {'pageid': 1358456, 'ns': 14, 'title': 'Category:GNU Project software'}, '34313891': {'pageid': 34313891, 'ns': 14, 'title': 'Category:Free mobile software'}, '6687643': {'pageid': 6687643, 'ns': 14, 'title': 'Category:Free computer programming tools'}, '39401957': {'pageid': 39401957, 'ns': 14, 'title': 'Category:Open-source software hosting facilities'}, '38962158': {'pageid': 38962158, 'ns': 14, 'title': 'Category:Open-source robots'}, '21840815': {'pageid': 21840815, 'ns': 14, 'title': 'Category:Free multilingual software'}, '52773626': {'pageid': 52773626, 'ns': 14, 'title': 'Category:Open source artificial intelligence'}}} 
{'pages': {'35912174': {'pageid': 35912174, 'ns': 14, 'title': 'Category:Free technical analysis software'}, '4530452': {'pageid': 4530452, 'ns': 14, 'title': 'Category:Free software stubs'}, '40516443': {'pageid': 40516443, 'ns': 14, 'title': 'Category:Works about free software'}, '49310608': {'pageid': 49310608, 'ns': 14, 'title': 'Category:Public-domain software with source code'}, '952642': {'pageid': 952642, 'ns': 14, 'title': 'Category:Public-domain software'}, '1819021': {'pageid': 1819021, 'ns': 14, 'title': 'Category:Free software websites'}, '46441720': {'pageid': 46441720, 'ns': 14, 'title': 'Category:Free software webmail'}, '36794168': {'pageid': 36794168, 'ns': 14, 'title': 'Category:Free speech synthesis software'}, '6643120': {'pageid': 6643120, 'ns': 14, 'title': 'Category:Free screen readers'}, '34403011': {'pageid': 34403011, 'ns': 14, 'title': 'Category:Open science'}}} 
相關問題