使用lxml獲取HTML的所有鏈接

我想使用lxml從html頁面找出所有的url及其名稱。使用lxml獲取HTML的所有鏈接

我可以解析網址，可以找出這個東西，但有沒有什麼簡單的方法，我可以找到所有的URL鏈接使用lxml？

2012-04-30 sam

注意，HTML不是XML;如果由於缺少末尾元素或缺少屬性值引號而導致解析有問題，[美麗的湯]（http://www.crummy.com/software/BeautifulSoup/）可以幫助或者可能更適合。 –

from lxml.html import parse 
dom = parse('http://www.google.com/').getroot() 
links = dom.cssselect('a')

來源

2012-04-30 12:08:44 kev

很好的答案，只需要做一個'pip install cssselect'來解決問題。 – taystack

from lxml import etree, cssselect, html 

with open("/you/path/index.html", "r") as f: 
    fileread = f.read() 

dochtml = html.fromstring(fileread) 

select = cssselect.CSSSelector("a") 
links = [ el.get('href') for el in select(dochtml) ] 

links = iter(links) 
for n, l in enumerate(links): 
    print n, l

來源

2014-01-23 19:06:18 lmokto

請注意，cssselect現在是一個獨立的項目，不再使用lxml。用'pip install cssselect'安裝。去[這裏]（https://pythonhosted.org/cssselect/）瞭解更多信息。 – jheyse

使用lxml獲取HTML的所有鏈接

回答

相關問題