如何從html頁面提取文本？

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

我必須有公司的名稱及其地址和網站。我曾嘗試以下的HTML轉換爲文本：

import nltk 
from urllib import urlopen 

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"  
html = urlopen(url).read()  
raw = nltk.clean_html(html) 
print(raw)

但它返回的錯誤：

ImportError: cannot import name 'urlopen

來源

2015-11-06 Nique

您正在使用[Python 3 **'urllib' **]（https://docs.python.org/3/library/urllib.html），它與[Python 2 **' urllib' **]（https://docs.python.org/2/library/urllib.html） –

很確定你一旦得到它會失望：[**'clean_html' **]（ http://www.nltk.org/_modules/nltk/util.html#clean_html）未實現。看看[這個問題]（http://stackoverflow.com/questions/26002076/python-nltk-clean-html-not-implemented）。 –

醒木已經回答了你的問題（link）。

import urllib.request 

uf = urllib.request.urlopen(url) 
html = uf.read()

但是，如果你想提取數據（如公司，地址名稱和網站），那麼你將需要獲取你的HTML源代碼並使用HTML解析器解析它。

我建議使用requests來獲取HTML源文件，並使用BeautifulSoup來解析生成的HTML文件並提取所需的文本。

這是一個小snipet，會給你一個良好的開端。

import requests 
from bs4 import BeautifulSoup 

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50" 

html = requests.get(link).text 

"""If you do not want to use requests then you can use the following code below 
    with urllib (the snippet above). It should not cause any issue.""" 
soup = BeautifulSoup(html, "lxml") 
res = soup.findAll("article", {"class": "listingItem"}) 
for r in res: 
    print("Company Name: " + r.find('a').text) 
    print("Address: " + r.find("div", {'class': 'address'}).text) 
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

來源

2015-11-06 12:34:28 JRodDynamite

這並不能幫助他們理解錯誤。 –

@PeterWood - 我已經更新了我的答案。希望能幫助到你。 – JRodDynamite

如何從html頁面提取文本？

回答

相關問題