的Python：找到<title>

我有這樣的：的Python：找到<title>

response = urllib2.urlopen(url) 
html  = response.read() 

begin = html.find('<title>') 
end = html.find('</title>',begin) 
title = html[begin+len('<title>'):end].strip()

如果URL = http://www.google.com那麼標題都沒有問題，「谷歌」，

但如果URL = 「http://www.britishcouncil.org/learning-english-gateway」那麼標題成爲

"<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<HTML> 
<HEAD> 
<base href="http://www.britishcouncil.org/" /> 
<META http-equiv="Content-Type" Content="text/html;charset=utf-8"> 
<meta name="WT.sp" content="Learning;Home Page Smart View" /> 
<meta name="WT.cg_n" content="Learn English Gateway" /> 
<META NAME="DCS.dcsuri" CONTENT="/learning-english-gateway.htm">..."

究竟發生了什麼，爲什麼我不能返回「標題」？

來源

2010-05-20 Peter

該URL返回文檔<TITLE>...</TITLE>和find區分大小寫。我強烈建議你使用一個HTML解析器，如Beautiful Soup。

來源

2010-05-20 10:05:01

讓我們來分析爲什麼我們得到了答案。如果您打開網站並查看來源，我們注意到它沒有<title>...</title>。相反，我們有<TITLE>...</TITLE>。那麼發生了什麼事情找到2個電話？兩者將是-1！

begin = html.find('<title>') # Result: -1 
end = html.find('</title>') # Result: -1

然後begin+len('<title>')將爲-1 + 7 = 6，所以，你的最後一行將被提取html[6:-1]。事實證明，負指數實際上是指Python中合法的東西（出於很好的理由）。它意味着從後面算起。因此，這裏的-1表示html中的最後一個字符。所以你得到的是從第6個字符（包含）到最後一個字符（不包括）的子字符串。

那我們能做什麼呢？那麼，就其中之一而言，您可以使用忽略大小寫的正則表達式匹配器或使用正確的HTML解析器。如果這是一個一次性的事情，空間/性能沒有太大的關注，最快捷的辦法可能是創建html副本，下套管，整個字符串：

def get_title(html): 
    html_lowered = html.lower(); 
    begin = html_lowered.find('<title>') 
    end = html_lowered.find('</title>') 
    if begin == -1 or end == -1: 
     return None 
    else: 
     # Find in the original html 
     return html[begin+len('<title>'):end].strip()

來源

2010-05-20 10:27:47

工作與LXML解決方案和urllib使用Python 3

import lxml.etree, urllib.request 

def documenttitle(url): 
    conn = urllib.request.urlopen(url) 
    parser = lxml.etree.HTMLParser(encoding = "utf-8") 
    tree = lxml.etree.parse(conn, parser = parser) 
    return tree.find('.//title')

來源

2014-08-23 22:59:06

的Python：找到<title>

回答

相關問題