我遇到了問題,在我的HTML上使用lxml的iterparse
。我試圖讓<title>
的文本,但這個簡單的功能上不完整的網頁工作:抓取<title>標籤與lxml的iterparse
def get_title(str):
titleIter = etree.iterparse(StringIO(str), tag="title")
try:
for event, element in titleIter:
return element.text
# print "Script goes here when it doesn't work"
except etree.XMLSyntaxError:
return None
此功能工作正常簡單的輸入,如‘<title>test</title>
’,但是當我給它一個完整的頁面無法提取標題。
更新:這是我的工作的HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="it" xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="icon" href="http://www.tricommerce.it/tricommerce.ico" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Tricommerce - Informazioni sulla privacy</title>
<meta name="description" content="Info sulla privacy" />
<meta name="keywords" content="Accessori notebook Alimentatori Case Cavi e replicatori Controllo ventole Lettori e masterizzatori Modding Pannelli & display Dissipatori Tastiere e mouse Ventole Griglie e filtri Hardware Accessori vari Box esterni Casse e cuffie Sistemi a liquido Paste termiche vendita modding thermaltake vantec vantecusa sunmbeam sunbeamtech overclock thermalright xmod aerocool arctic cooling arctic silver zalman colorsit colors-it sharkoon mitron acmecom Info sulla privacy" />
<meta name="robots" content="index, follow" />
<link rel="stylesheet" href="http://www.tricommerce.it/css/tricommerce.css" />
<link rel="stylesheet" href="css/static.css" />
<script type="text/javascript" src="http://www.tricommerce.it/javascript/vertical_scroll.js"></script>
<script type="text/javascript">
//<![CDATA[
function MM_preloadImages() { //v3.0
var d=document; if(d.images){ if(!d.MM_p) d.MM_p=new Array();
var i,j=d.MM_p.length,a=MM_preloadImages.arguments; for(i=0; i<a.length; i++)
if (a[i].indexOf("#")!=0){ d.MM_p[j]=new Image; d.MM_p[j++].src=a[i];}}
}
//]]>
</script>
<link rel="stylesheet" type="text/css" href="http://www.tricommerce.it/css/chromestyle.css" />
<script type="text/javascript" src="http://www.tricommerce.it/javascript/chrome.js">
/***********************************************
* AnyLink CSS Menu script- ? Dynamic Drive DHTML code library (www.dynamicdrive.com)
* This notice MUST stay intact for legal use
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
***********************************************/
</script>
</head>
</html>
而且,爲什麼我使用iterparse--那是因爲我不想在整個DOM加載一個快速的注意剛在文檔的早期獲取一個標籤。
有沒有辦法使用iterparse來做到這一點?不要試圖通過解析來加載整個文檔。 PS添加了示例HTML – babonk 2012-04-24 18:48:00
呃...當然。只需使用名稱空間前綴指定標題標籤即可。我用一個例子更新了答案。 – larsks 2012-04-24 19:59:20
Gotcha。如果我通過很多頁面爬行,是否每次都需要指定正確的名稱空間前綴,或者它通常會與'tag ='{http://www.w3.org/1999/xhtml} title' '? – babonk 2012-04-24 20:03:06