錯誤與美麗的湯

我必須從這個源中刪除在標題標籤的文本：錯誤與美麗的湯

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html dir="ltr" lang="en"> 
<head> 
    <title>Microsoft to acquire Nokia’s devices &amp; services business, license Nokia’s patents and mapping services</title> 
    <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" /> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" /> 
    </title>

我使用這個刪除文本：

opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

ourUrl = opener.open("http://www.thehindubusinessline.com/industry-and-economy/info-tech/nokia-cannot-license-brand-nokia-post-microsoft-deal/article5156470.ece").read() 

soup = BeautifulSoup(ourUrl) 
print soup 
dem = soup.findAll('p') 
hea = soup.findAll('title')

此代碼正確地提取該p但嘗試提取標題時失敗。謝謝。我只包含了部分代碼，不用擔心其餘的工作正常。

來源

2013-09-23 user2784753

我無法重現你的問題。那個頁面上的HTML被破壞了，但BeautifulSoup 3和所有3個用於BeautifulSoup 4的解析器插件都給了我正確的輸出，並且我可以很好地提取標題。 –

您使用的是什麼版本的BeautifulSoup？ 4.0系列有一些問題。另外，某些lxml + libxml2組合在某些HTML輸入方面存在問題。如果您使用的是BeautifulSoup 4，您是否安裝了lxml？ –

嗯，你會得到什麼錯誤？或者你得到一個空的列表？因爲我試過你的代碼（也是這個頁面），並且得到了** no **錯誤！ – JadedTuna

您的html代碼有錯誤！你有2個</title> endtags：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html dir="ltr" lang="en"> 
<head> 
    <title>Microsoft to acquire Nokia’s devices &amp; services business, license Nokia’s patents and mapping services</title> 
    <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" /> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" /> 
    </title> #You already have endtag of <title>

所以固定的代碼應該是這樣的：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html dir="ltr" lang="en"> 
<head> 
    <title>Microsoft to acquire Nokia’s devices &amp; services business, license Nokia’s patents and mapping services</title> 
    <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" /> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />

來源

2013-09-23 08:53:58 JadedTuna

HTML從外部URL加載，我懷疑OP可以更正頁面。我們的目標是從沒有修復的HTML源代碼中提取標題。 –

是的，我添加了最後因爲我無法複製整個來源。對不起，但爲什麼不找工作。順便說一下，它的BS4。 – user2784753

你知道嗎，試試'soup.find（「something」）' – JadedTuna

錯誤與美麗的湯

回答

相關問題