BS HTML解析 - 打印URL字符串時將忽略&

請考慮以下示例。BS HTML解析 - 打印URL字符串時將忽略&

htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\ 
      '<h3>Print/export</h3>',\ 
      '<div class="body">',\ 
      '<ul>',\ 
      '<li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&amp;bookcmd=book_creator&amp;referer=Main+Page">Create a book</a></li>',\ 
      '<li id="coll-download-as-rl"><a href="/w/index.php?title=Special:Book&amp;bookcmd=render_article&amp;arttitle=Main+Page&amp;oldid=560327612&amp;writer=rl">Download as PDF</a></li>',\ 
      '<li id="t-print"><a href="/w/index.php?title=Main_Page&amp;printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>',\ 
      '</ul>',\ 
      '</div>',\ 
      '</div>',\ 
      ] 

soup = __import__("bs4").BeautifulSoup("".join(htmlist), "html.parser") 

for x in soup("a"): 
    print(x) 
    print(x.attrs) 
    print(soup.a.get_text())

我期待這個簡短的腳本來打印a標籤等於x，其次是（每一種的名稱（如鑰匙）和內容（如鍵值））的x屬性的字典，結束與鏈接的文字。

相反輸出

<a href="/w/index.php?title=Special:Book&amp;bookcmd=book_creator&amp;referer=Main+Page">Create a book</a> 
{'href': '/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page'} 
Create a book 
<a href="/w/index.php?title=Special:Book&amp;bookcmd=render_article&amp;arttitle=Main+Page&amp;oldid=560327612&amp;writer=rl">Download as PDF</a> 
{'href': '/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl'} 
Create a book 
<a accesskey="p" href="/w/index.php?title=Main_Page&amp;printable=yes" title="Printable version of this page [p]">Printable version</a> 
{'href': '/w/index.php?title=Main_Page&printable=yes', 'title': 'Printable version of this page [p]', 'accesskey': ['p']} 
Create a book

我找到這個輸出的問題是：

print(soup.a.get_text())位始終打印第一個標籤的文本。
在字典輸出由print(x.attrs)，關鍵"href"的值丟失&amp.

缺少什麼我在這裏，我如何獲得所需的輸出？

來源

2017-08-27 Git Gud

爲什麼不使用'x.get_text（）'？ '&'是'＆'的html編碼版本，我不擔心它。 –

@ t.m.adam當然，我應該從'x'獲得文本，謝謝。不過，我仍然需要'&'部分。這是挑戰的一部分，我需要輸出匹配。 –

@ t.m.adam快速提問。正如你所看到的，我添加了一個替代＆&的解決方案，但我剛剛意識到這可能是不正確的，因爲鏈接可能有合法的＆符號。我的問題是： –

您可以使用cgi.escape或html.escape方法來對&字符進行html編碼。

import html 

for x in soup("a"): 
    print(x) 
    print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()}) 
    print(x.get_text())

來源

2017-08-27 16:00:42

BS HTML解析 - 打印URL字符串時將忽略&

回答

相關問題