我試圖從中文報紙數據庫中刪除文章。下面是一些源代碼(粘貼節選B/C鍵的部位):使用BeautifulSoup從HTML表格中提取不乾淨源代碼的鏈接
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%@ page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概覽頁面</title>
...
</head>
...
</html>
</html>
當我嘗試做在表中的鏈接的一些簡單的刮像這樣:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(
Python做甚至在html中找不到任何東西,又返回一個空列表。我認爲這是因爲源代碼從基礎href標記開始,Python只能識別文檔中的兩個標記:base href和html。
任何想法如何刮這種情況下的鏈接?非常感謝!!
現在BS4已經停用並且處於活動狀態。 – Amanda