從URL中提取HTML信息

我正在嘗試在python中編寫一個程序，該程序從網頁中讀取所有數據，並將任何標題標記<h1>到<h6>的內容附加到列表中。到目前爲止，我只是想首先獲取網站信息，事實證明這很困難。從URL中提取HTML信息

編輯：這是一個班。令人遺憾的是，我們不允許使用未預先安裝python的庫。

編輯2：感謝您的所有提示。該程序現在成功讀取給定網站的HTML。有沒有人有任何建議，搜索網頁內的特定字符串（即<H>標籤）？

import urllib 
from urllib.request import urlopen 

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/ 
userAddress = input("Enter a website URL: ") 

webPage = urllib.request.urlopen(userAddress) 

print (webPage.read()) 

webPage.close()

來源

2015-12-13 Cameron

http://docs.python-requests.org/en/latest/和http://www.crummy.com/software/BeautifulSoup/ BS4/DOC / – pvg

我想你使用python3來獲取網頁。它可以通過下面的代碼來獲取：

import urllib 
from urllib.request import urlopen 

address = "http://www.hobo-web.co.uk/headers/" 
webPage = urllib.request.urlopen(address) 

print (webPage.read())

對於從網頁拉出的信息，您可以使用BeautifulSoup。這是一個令人難以置信的工具，用於從網頁中提取信息。您可以使用它來提取表格，列表和段落，也可以使用過濾器從網頁中提取信息。

從這裏安裝：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

來源

2015-12-13 22:22:24 perfectus

我建議使用requests庫。

import requests 

r = requests.get('http://www.hobo-web.co.uk/') 
print(r.text)

檢查在http://docs.python-requests.org/en/latest/user/quickstart/

來源

2015-12-13 22:10:46 zsoobhan

檢查出beautifulsoup庫中的文檔了。它是解析DOM樹的API。你可以做一些事情，比如soup.find_all（'h1'），它將返回所有h1元素的列表。

來源

2015-12-13 22:13:45

其更好地使用with open因此它會自動關閉連接。這裏有一個例子：

import urllib.request 
address = "http://www.hobo-web.co.uk/headers/" 
with urllib.request.urlopen(address) as response: 
    html = response.read() 
    print html

來源

2015-12-13 22:25:43 heinst

您webPage變量是一個網絡對象，實際得到的HTML內容使用

content = webPage.read()

用於獲取標題標籤的內容，你可以使用BeautifulSoup庫

from bs4 import BeautifulSoup 

htmlContent = webPage.read() 
soup = BeautifulSoup(htmlContent, from_encoding=htmlContent.info().getparam('charset')) 
heads = soup.find_all('head').text

現在heads是所有出現的頭標記的內容列表

閱讀更多關於BeautifulSoup庫去：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

來源

2015-12-13 22:25:43 tffu

從URL中提取HTML信息

回答

相關問題