這是我第一次嘗試使用編程來獲得有用的東西,所以請耐心等待。建設性的反饋是非常感謝:)創建來自特定網站的URL列表
我正在建立一個數據庫與歐洲議會的所有新聞稿。到現在爲止,我已經構建了一個可以從一個特定URL檢索我想要的數據的刮板。但是,在閱讀了幾篇教程之後,我仍然無法弄清楚如何創建一個包含來自這個特定站點的所有新聞稿的URL列表。
也許這是關係到網站的構建方式,或者我(可能)只是缺少一些明顯的事情,一個有經驗的項目將實現向右走,但是我真的不知道如何從這裏着手。
這是啓動URL:http://www.europarl.europa.eu/news/en/press-room
這是我的代碼:
links = [] # Until now I have just manually pasted a few links
# into this list, but I need it to contain all the URLs to scrape
# Function for removing html tags from text
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
# Regex to match dates with pattern DD-MM-YYYY
date_match = re.compile(r'\d\d-\d\d-\d\d\d\d')
# For-loop to scrape variables from site
for link in links:
# Opening up connection and grabbing page
uClient = uReq(link)
# Saves content of page in new variable (still in HTML!!)
page_html = uClient.read()
# Close connection
uClient.close()
# Parsing page with soup
page_soup = soup(page_html, "html.parser")
# Grabs page
pr_container = page_soup.findAll("div",{"id":"website"})
# Scrape date
date_container = pr_container[0].time
date = date_container.text
date = date_match.search(date)
date = date.group()
# Scrape title
title = page_soup.h1.text
title_clean = title.replace("\n", " ")
title_clean = title_clean.replace("\xa0", "")
title_clean = ' '.join(title_clean.split())
title = title_clean
# Scrape institutions involved
type_of_question_container = pr_container[0].findAll("div", {"class":"ep_subtitle"})
text = type_of_question_container[0].text
question_clean = text.replace("\n", " ")
question_clean = text.replace("\xa0", " ")
question_clean = re.sub("\d+", "", question_clean) # Redundant?
question_clean = question_clean.replace("-", "")
question_clean = question_clean.replace(":", "")
question_clean = question_clean.replace("Press Releases"," ")
question_clean = ' '.join(question_clean.split())
institutions_mentioned = question_clean
# Scrape text
text_container = pr_container[0].findAll("div", {"class":"ep-a_text"})
text_with_tags = str(text_container)
text_clean = remove_tags(text_with_tags)
text_clean = text_clean.replace("\n", " ")
text_clean = text_clean.replace(",", " ") # Removing commas to avoid trouble with .csv-format later on
text_clean = text_clean.replace("\xa0", " ")
text_clean = ' '.join(text_clean.split())
# Calculate word count
word_count = len(text_clean.split())
word_count = str(word_count)
print("Finished scraping: " + link)
time.sleep(randint(1,5))
f.write(date + "," + title + ","+ institutions_mentioned + "," + word_count + "," + text_clean + "\n")
f.close()
HTML有電流法puting的URL,在HTML我們有:SRC,所有鏈接href和行動,爲SRC =>( '腳本', 'IMG', '源', '視頻',「 ('a','link','area','base')和action =>('form','if','input' ),首先你需要將這些標籤解壓縮,然後提取它們的每個src,href和action sub_tag(不需要解析任何東西或刪除髒字符串),用這種方法可以提取所有標準的html url,你可以用beautifulsoup模塊和兩個FORS! – DRPK