2013-07-23 118 views
0

我試圖解析HTML文件(demo.html使所有相關鏈接絕對這裏是我嘗試做這在Python腳本 -解析HTML編輯鏈接

from bs4 import BeautifulSoup 
f = open('demo.html', 'r') 
html_text = f.read() 
f.close() 
soup = BeautifulSoup(html_text) 
for a in soup.findAll('a'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('link'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('script'): 
    for x in a.attrs: 
     if x == 'src': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
f = open("demo_result.html", "w") 
f.write(soup.prettify().encode("utf-8")) 

但是,輸出文件demo_result.html包含了許多意想不到的變化。例如,

<script type="text/javascript" src="/scripts/ddtabmenu.js" /> 
/*********************************************** 
* DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com) 
* + Drop Down/ Overlapping Content- 
* This notice MUST stay intact for legal use 
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code 
***********************************************/ 
</script> 

變化

<script src="http://www.esplanade.com.sg/scripts/ddtabmenu.js" type="text/javascript"> 
</script> 
</head> 
<body> 
<p> 
    /*********************************************** 
    * DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com) 
    * + Drop Down/ Overlapping Content- 
    * This notice MUST stay intact for legal use 
    * Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code 
    ***********************************************/ 

有人可以告訴我我要去哪裏嗎?

感謝和最熱烈的問候。

+0

它在我的最後工作正常。 – duck

+0

@ user1471175 - 你是什麼意思?它是否只是轉換鏈接,而不是像我在我的問題中提到的那樣更改HTML的其他部分? –

+1

對不起,我正在尋找錯誤的錯誤:) – duck

回答

1

它接縫美麗的湯4所賜問題 剛剛降級Beautifult湯版本3如前所述 您的問題將得到解決

import BeautifulSoup  #This is version 3 not version 4 
f = open('demo.html', 'r') 
html_text = f.read() 
f.close() 
soup = BeautifulSoup.BeautifulSoup(html_text) 
print soup.contents 
for a in soup.findAll('a'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('link'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('script'): 
    for x in a.attrs: 
     if x == 'src': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
f = open("demo_result.html", "w") 
f.write(soup.prettify().encode("utf-8")) 
+0

謝謝@ user1471175:D這樣做。我編輯了一段代碼雖然爲了擺脫編碼/解碼錯誤:) –

0

您的HTML代碼很混亂。您已關閉script標籤並且再次

<script type="text/javascript" src="/scripts/ddtabmenu.js" /></script> 

關閉它它打破了DOM。單從<script type="text/javascript" src="/scripts/ddtabmenu.js" />

+2

誠實的問題,不是美麗的希望處理破碎的HTML? – HolgerSchurig

+0

真的這是一個問題,美麗的湯版本4,版本3工作正常 – duck

+0

@twil我同意它很混亂。但是,由於我試圖抓取網頁,因此我無法控制HTML。當然,我可以解析HTML以使其更好,但是,我更喜歡如果我不需要:) –

0

,迴歸到BeautifulSoup 3月底刪除/刪除問題。此外,像這樣的URL將有問題的HTML錨和javascript引用,所以我改變了代碼:

import re 
import BeautifulSoup 

with open("demo.html", "r") as file_h: 
    soup = BeautifulSoup.BeautifulSoup(file_h.read()) 

url = "http://www.esplanade.com.sg/" 
health_check = lambda x: bool(re.search("^(?!javascript:|http://)[/\w]", x)) 
replacer = lambda x: re.sub("^(%s)?/?" % url, url, x) 

for soup_tag in soup.findAll(lambda x: x.name in ["a", "img", "link", "script"]): 

    if(soup_tag.has_key("href") and health_check(soup_tag["href"])): 
     soup_tag["href"] = replacer(soup_tag["href"]) 

    if(soup_tag.has_key("src") and health_check(soup_tag["src"])): 
     soup_tag["src"] = replacer(soup_tag["src"]) 

with open("demo_result.html", "w") as file_h: 
    file_h.write(soup.prettify().encode("utf-8")) 
+0

嗨,謝謝你試圖改進我的代碼。但我對正則表達式不是很熟悉。所以,如果你還可以請詳細解釋那些正則表達式期望捕捉什麼樣的模式,我將非常感激。非常感謝:) –

+0

當然,health_check是一個函數,它接收一個href字符串並檢查:a)它不是以「javascript:」或「http://」開頭的;和b)它以正斜槓或字母數字字符開始。 replacer是一個在href或src字符串上執行正則表達式替換的函數;它從字符串的開始處取代,幷包括:a)url(如果存在)和b)正斜槓(如果存在),以及url本身。 – dilbert