如何從meta標籤中可靠地提取屬性，內容？

我有例如。以下幾行HTML。我需要提取並獲取og:image和content屬性的列表。問題是，如果我這樣做簡單的string.split（），結果將不會是相同的下面的行，作爲第二行有content值有很多空格。如何從meta標籤中可靠地提取屬性，內容？

我該如何可靠地處理這樣的字符串行，並得到如下列表： ['og:image', 'http....whatever.jpg']和第二行相同？

<meta property="og:image" content="http://google.com/example.jpg"/> 
<meta property="og:title" content="Fant over 300 falske personer i skattelistene"/>

編輯：我解析像現在這樣：

tree = etree.HTML(xml) 
m = tree.xpath("//meta[@property]") 
for i in m: 
    og = etree.tostring(i) 
    print og # <meta property="og:image" content="http://google.com/example.jpg"/>

也許有一種方法可以直接使用XPath獲取內容/屬性到一個列表？

來源

2013-02-25 knutole

不能使用適當的HTML解析器呢？ – 2013-02-25 15:36:38

我已經添加了我正在使用的解析器... – knutole 2013-02-25 15:41:22

相反鑄造你的元素回字符串，剛剛經歷的每個元素的attrib映射搶屬性：

for i in m: 
    print (i.attrib['property'], i.attrib['content'])

來源

2013-02-25 15:43:56

謝謝，輝煌=） – knutole 2013-02-25 15:46:05

如何從meta標籤中可靠地提取屬性，內容？

回答

相關問題