利用BeautifulSoup爲鏈接標題和URL刮頁面

我有一個熱門文章的網頁，我想爲每個引用的網頁的超鏈接和它顯示的文章的標題進行刮擦。利用BeautifulSoup爲鏈接標題和URL刮頁面

我的腳本所需的輸出是一個CSV文件，它將每個標題和文章內容列在一行中。所以如果這個網頁上有50篇文章，我想要一個文件有50行和100個數據點。

我的問題在於，文章標題及其超鏈接包含在一個SVG容器中，它將我拋棄。我之前使用過BeautifulSoup進行網頁抓取，但不知道如何選擇每篇文章的標題和超鏈接。任何和所有的幫助，非常感謝。

import requests 
from bs4 import BeautifulSoup 
import re 

res = requests.get('http://fundersandfounders.com/what-internet-thinks-based-on-media/') 
res.raise_for_status() 
playFile = open('top_articles.html', 'wb') 
for chunk in res.iter_content(100000): 
    playFile.write(chunk) 
    f = open('top_articles.html') 
    soup = BeautifulSoup(f, 'html.parser') 
    links = soup.select('p') #i know this is where i'm messing up, but i'm not sure which selector to actually utilize so I'm using the paragraph selector as a place-holder 
    print(links)

我知道，這實際上是一個兩步項目：我的腳本的當前版本不通過其實際內容，我將要刮所有超鏈接的列表進行迭代。這是我自己可以輕鬆執行的第二步，但是如果有人也想寫這些，那麼對你很有好處。

來源

2017-01-09 dataelephant

你應該做的兩個步驟：

解析HTML並提取鏈接svg
下載svg頁，與BeautifulSoup分析它，並提取了「泡沫」

執行：

from urllib.parse import urljoin # Python3 

import requests 
from bs4 import BeautifulSoup 


base_url = 'http://fundersandfounders.com/what-internet-thinks-based-on-media/' 

with requests.Session() as session: 
    # extract the link to svg 
    res = session.get(base_url) 
    soup = BeautifulSoup(res.content, 'html.parser') 
    svg = soup.select_one("object.svg-content") 
    svg_link = urljoin(base_url, svg["data"]) 

    # download and parse svg 
    res = session.get(svg_link) 
    soup = BeautifulSoup(res.content, 'html.parser') 
    for article in soup.select("#bubbles .bgroup"): 
     title, resource = [item.get_text(strip=True, separator=" ") for item in article.select("a text")] 
     print("Title: '%s'; Resource: '%s'." % (title, resource))

打印文章標題和資源：

Title: 'CNET'; Resource: 'Android Apps That Extend Battery Life'. 
Title: '5-Years-Old Shoots Sister'; Resource: 'CNN'. 
Title: 'Samsung Galaxy Note II'; Resource: 'Engaget'. 
... 
Title: 'Predicting If a Couple Stays Together'; Resource: 'The Atlantic Magazine'. 
Title: 'Why Doctors Die Differently'; Resource: 'The Wall Street Journal'. 
Title: 'The Ideal Nap Length'; Resource: 'Lifehacker'.

來源

2017-01-09 21:16:33 alecxe

感謝您的快速響應。我要安裝哪個模塊（用於Python3）以利用urllib.parse和urljoin？我似乎無法找到它。 – dataelephant

@Harelephant'urllib'是內置的，不需要安裝它。 – alecxe

利用BeautifulSoup爲鏈接標題和URL刮頁面

回答

相關問題