2017-10-16 73 views
1

這是我第一次嘗試使用編程來獲得有用的東西,所以請耐心等待。建設性的反饋是非常感謝:)創建來自特定網站的URL列表

我正在建立一個數據庫與歐洲議會的所有新聞稿。到現在爲止,我已經構建了一個可以從一個特定URL檢索我想要的數據的刮板。但是,在閱讀了幾篇教程之後,我仍然無法弄清楚如何創建一個包含來自這個特定站點的所有新聞稿的URL列表。

也許這是關係到網站的構建方式,或者我(可能)只是缺少一些明顯的事情,一個有經驗的項目將實現向右走,但是我真的不知道如何從這裏着手。

這是啓動URL:http://www.europarl.europa.eu/news/en/press-room

這是我的代碼:

links = [] # Until now I have just manually pasted a few links 
      # into this list, but I need it to contain all the URLs to scrape 

# Function for removing html tags from text 
TAG_RE = re.compile(r'<[^>]+>') 
def remove_tags(text): 
    return TAG_RE.sub('', text) 

# Regex to match dates with pattern DD-MM-YYYY 
date_match = re.compile(r'\d\d-\d\d-\d\d\d\d') 

# For-loop to scrape variables from site 
for link in links: 

    # Opening up connection and grabbing page 
    uClient = uReq(link) 

    # Saves content of page in new variable (still in HTML!!) 
    page_html = uClient.read() 

    # Close connection 
    uClient.close() 

    # Parsing page with soup 
    page_soup = soup(page_html, "html.parser") 

    # Grabs page 
    pr_container = page_soup.findAll("div",{"id":"website"}) 

    # Scrape date 
    date_container = pr_container[0].time 
    date = date_container.text 
    date = date_match.search(date) 
    date = date.group() 

    # Scrape title 
    title = page_soup.h1.text 
    title_clean = title.replace("\n", " ") 
    title_clean = title_clean.replace("\xa0", "") 
    title_clean = ' '.join(title_clean.split()) 
    title = title_clean 

    # Scrape institutions involved 
    type_of_question_container = pr_container[0].findAll("div", {"class":"ep_subtitle"}) 
    text = type_of_question_container[0].text 
    question_clean = text.replace("\n", " ") 
    question_clean = text.replace("\xa0", " ") 
    question_clean = re.sub("\d+", "", question_clean) # Redundant? 
    question_clean = question_clean.replace("-", "") 
    question_clean = question_clean.replace(":", "") 
    question_clean = question_clean.replace("Press Releases"," ") 
    question_clean = ' '.join(question_clean.split()) 
    institutions_mentioned = question_clean 

    # Scrape text 
    text_container = pr_container[0].findAll("div", {"class":"ep-a_text"}) 
    text_with_tags = str(text_container) 
    text_clean = remove_tags(text_with_tags) 
    text_clean = text_clean.replace("\n", " ") 
    text_clean = text_clean.replace(",", " ") # Removing commas to avoid trouble with .csv-format later on 
    text_clean = text_clean.replace("\xa0", " ") 
    text_clean = ' '.join(text_clean.split()) 

    # Calculate word count 
    word_count = len(text_clean.split()) 
    word_count = str(word_count) 

    print("Finished scraping: " + link) 

    time.sleep(randint(1,5)) 

    f.write(date + "," + title + ","+ institutions_mentioned + "," + word_count + "," + text_clean + "\n") 

    f.close() 
+0

HTML有電流法puting的URL,在HTML我們有:SRC,所有鏈接href和行動,爲SRC =>( '腳本', 'IMG', '源', '視頻',「 ('a','link','area','base')和action =>('form','if','input' ),首先你需要將這些標籤解壓縮,然後提取它們的每個src,href和action sub_tag(不需要解析任何東西或刪除髒字符串),用這種方法可以提取所有標準的html url,你可以用beautifulsoup模塊和兩個FORS! – DRPK

回答

1

下面是一個簡單的方法來獲得所需的鏈接列表與python-requestslxml

from lxml import html 
import requests 
url = "http://www.europarl.europa.eu/news/en/press-room/page/" 
list_of_links = [] 
for page in range(10): 
    r = requests.get(url + str(page)) 
    source = r.content 
    page_source = html.fromstring(source) 
    list_of_links.extend(page_source.xpath('//a[@title="Read more"]/@href')) 
print(list_of_links) 
+0

非常感謝您的反饋。我想知道你是否可以澄清我如何知道一個網站是否是動態的?你的方法適用於初始URL的前15個鏈接,但是我需要Selenium模塊在加載更多按鈕上「單擊」嗎? –

+1

如果內容位於頁面源代碼中,則它是靜態內容(如果它是由JavaScript生成的) - 它是動態內容。簡單地說,您可以通過右鍵單擊瀏覽器中的網頁來查看頁面源代碼:如果您可以找到所需的內容 - 它是靜態的,如果不是的話 - 它是動態的。 – Andersson

+1

@DanielHansen,您可以查看適用於前10頁(150鏈接)的更新答案。你可以設置更大的範圍或用'while'替換'for'循環 – Andersson

0

編輯:第15個網址是不使用硒模塊獲得。


不能使用urllib.request裏(我想這是你使用的是什麼),以獲得新聞稿的網址,因爲這個網站的內容是動態加載。

您可以嘗試使用硒模塊。

from bs4 import BeautifulSoup 
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 

driver = webdriver.Firefox() 
driver.get('http://www.europarl.europa.eu/news/en/press-room') 

# Click "Load More", repeat these as you like 
WebDriverWait(driver, 50).until(EC.visibility_of_element_located((By.ID, "continuesLoading_button"))) 
driver.find_element_by_id("continuesLoading_button").click() 

# Get urls 
soup = BeautifulSoup(driver.page_source) 
urls = [a["href"] for a in soup.select(".ep_gridrow-content .ep_title a")] 
+0

編號此內容不是動態的 – Andersson

0

您可以閱讀官方BeautifulSoup documentation以更好地抓取。你還應該檢查出Scrapy

下面是抓住該網頁上的鏈接需要一個簡單的片斷。
在以下示例中,我使用Requests庫。如果您有任何其他疑問,請告訴我。

雖然這個腳本不會點擊「加載更多」,並加載額外的版本。
我會離開,你;)(提示:使用SeleniumScrapy

def scrape_press(url): 
    page = requests.get(url) 

    if page.status_code == 200: 
     urls = list() 
     soup = BeautifulSoup(page.content, "html.parser") 
     body = soup.find_all("h3", {"class": ["ep-a_heading", "ep-layout_level2"]}) 
     for b in body: 
      links = b.find_all("a", {"title": "Read more"}) 
      if len(links) == 1: 
       link = links[0]["href"] 
       urls.append(link) 

     # Printing the scraped links 
     for _ in urls: 
      print(_) 

注意:你應該從一個網站刮的數據,當且僅當它是合法的,這樣做。

1

可以使用requestsBeautifulSoup只有6個襯墊碼搶鏈接。雖然劇本幾乎等於安德森爵士,圖書館,這裏應用的使用情況略有不同。

import requests ; from bs4 import BeautifulSoup 

base_url = "http://www.europarl.europa.eu/news/en/press-room/page/{}" 
for url in [base_url.format(page) for page in range(10)]: 
    soup = BeautifulSoup(requests.get(url).text,"lxml") 
    for link in soup.select('[title="Read more"]'): 
     print(link['href'])