解析HTML編輯鏈接

我試圖解析HTML文件（demo.html使所有相關鏈接絕對這裏是我嘗試做這在Python腳本 -解析HTML編輯鏈接

from bs4 import BeautifulSoup 
f = open('demo.html', 'r') 
html_text = f.read() 
f.close() 
soup = BeautifulSoup(html_text) 
for a in soup.findAll('a'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('link'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('script'): 
    for x in a.attrs: 
     if x == 'src': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
f = open("demo_result.html", "w") 
f.write(soup.prettify().encode("utf-8"))

但是，輸出文件demo_result.html包含了許多意想不到的變化。例如，

<script type="text/javascript" src="/scripts/ddtabmenu.js" /> 
/*********************************************** 
* DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com) 
* + Drop Down/ Overlapping Content- 
* This notice MUST stay intact for legal use 
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code 
***********************************************/ 
</script>

變化

<script src="http://www.esplanade.com.sg/scripts/ddtabmenu.js" type="text/javascript"> 
</script> 
</head> 
<body> 
<p> 
    /*********************************************** 
    * DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com) 
    * + Drop Down/ Overlapping Content- 
    * This notice MUST stay intact for legal use 
    * Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code 
    ***********************************************/

有人可以告訴我我要去哪裏嗎？

感謝和最熱烈的問候。

來源

2013-07-23 Shubham Goyal

它在我的最後工作正常。 – duck

@ user1471175 - 你是什麼意思？它是否只是轉換鏈接，而不是像我在我的問題中提到的那樣更改HTML的其他部分？ –

對不起，我正在尋找錯誤的錯誤:) – duck

它接縫美麗的湯4所賜問題剛剛降級Beautifult湯版本3如前所述您的問題將得到解決

import BeautifulSoup  #This is version 3 not version 4 
f = open('demo.html', 'r') 
html_text = f.read() 
f.close() 
soup = BeautifulSoup.BeautifulSoup(html_text) 
print soup.contents 
for a in soup.findAll('a'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('link'): 
    for x in a.attrs: 
     if x == 'href': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
for a in soup.findAll('script'): 
    for x in a.attrs: 
     if x == 'src': 
      temp = a[x] 
      a[x] = "http://www.esplanade.com.sg" + temp 
f = open("demo_result.html", "w") 
f.write(soup.prettify().encode("utf-8"))

來源

2013-07-23 10:00:34 duck

謝謝@ user1471175：D這樣做。我編輯了一段代碼雖然爲了擺脫編碼/解碼錯誤:) –

您的HTML代碼很混亂。您已關閉script標籤並且再次

<script type="text/javascript" src="/scripts/ddtabmenu.js" /></script>

關閉它它打破了DOM。單從<script type="text/javascript" src="/scripts/ddtabmenu.js" />

來源

2013-07-23 09:22:26 twil

誠實的問題，不是美麗的希望處理破碎的HTML？ – HolgerSchurig

真的這是一個問題，美麗的湯版本4，版本3工作正常 – duck

@twil我同意它很混亂。但是，由於我試圖抓取網頁，因此我無法控制HTML。當然，我可以解析HTML以使其更好，但是，我更喜歡如果我不需要:) –

，迴歸到BeautifulSoup 3月底刪除/刪除問題。此外，像這樣的URL將有問題的HTML錨和javascript引用，所以我改變了代碼：

import re 
import BeautifulSoup 

with open("demo.html", "r") as file_h: 
    soup = BeautifulSoup.BeautifulSoup(file_h.read()) 

url = "http://www.esplanade.com.sg/" 
health_check = lambda x: bool(re.search("^(?!javascript:|http://)[/\w]", x)) 
replacer = lambda x: re.sub("^(%s)?/?" % url, url, x) 

for soup_tag in soup.findAll(lambda x: x.name in ["a", "img", "link", "script"]): 

    if(soup_tag.has_key("href") and health_check(soup_tag["href"])): 
     soup_tag["href"] = replacer(soup_tag["href"]) 

    if(soup_tag.has_key("src") and health_check(soup_tag["src"])): 
     soup_tag["src"] = replacer(soup_tag["src"]) 

with open("demo_result.html", "w") as file_h: 
    file_h.write(soup.prettify().encode("utf-8"))

來源

2013-07-23 12:51:53 dilbert

嗨，謝謝你試圖改進我的代碼。但我對正則表達式不是很熟悉。所以，如果你還可以請詳細解釋那些正則表達式期望捕捉什麼樣的模式，我將非常感激。非常感謝:) –

當然，health_check是一個函數，它接收一個href字符串並檢查：a）它不是以「javascript：」或「http：//」開頭的;和b）它以正斜槓或字母數字字符開始。 replacer是一個在href或src字符串上執行正則表達式替換的函數;它從字符串的開始處取代，幷包括：a）url（如果存在）和b）正斜槓（如果存在），以及url本身。 – dilbert

解析HTML編輯鏈接

回答

相關問題