HTML刮使用LXML

我使用lxmlHTML刮使用LXML

這是一個職位

<article id="post-4855" class="post-4855 post type-post status-publish format-standard hentry category-uncategorized"> 


<header class="entry-header"> 
    <h1 class="entry-title"><a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark">Cybage..</a></h1> 
      <div class="entry-meta"> 
     <span class="byline"> Posted by <span class="author vcard"><a class="url fn n" href="http://aitplacements.com/author/tpoait/">TPO</a></span></span><span class="posted-on"> on <a href="http://aitplacements.com/uncategorized/cybage/" rel="bookmark"><time class="entry-date published updated" datetime="2017-09-13T11:02:32+00:00">September 13, 2017</time></a></span><span class="comments-link"> with <a href="http://aitplacements.com/uncategorized/cybage/#respond">0 Comment</a></span>  </div><!-- .entry-meta --> 
     </header><!-- .entry-header --> 

<div class="entry-content"> 
    <p>cybage placement details shared <a href="http://aitplacements.com/uncategorized/cybage/" class="read-more">READ MORE</a></p> 
     </div><!-- .entry-content -->

的檢查元素對於每一個這樣的帖子，我想提取標題，內容數據報廢發佈和發佈時間。

例如在上面，詳細情況將在

{title : "Cybage..", 
post : "cybage placement details shared" 
datetime="2017-09-13T11:02:32+00:00" 
}

直到現在什麼，我能夠實現：網站需要登錄，我在這樣做，全成。

用於提取信息：

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) 
Chrome/42.0.2311.90'} 
url = 'http://aitplacements.com/news/' 
page = requests.get(url,headers=headers) 
doc = html.fromstring(page.content) 
#print doC# it prints <Element html at 0x7f59c38d2260> 
raw_title = doc.xpath('//h1[@class="entry-title"]/a/@href/text()') 
print raw_title

的raw_title給空值[]？

我在做什麼錯了？

來源

2017-09-13 Mandeep Singh

你應該採取看看[美麗的湯]（https://www.crummy.com/softw是/ BeautifulSoup/BS4/DOC /）。這對你的需求很好。或者，如果你需要更先進的東西（例如蜘蛛），也可以使用scrapy。 – floatingpurr

我得到了空值，因爲我正在註銷，修復了問題 –

@href指href屬性的值：

In [14]: doc.xpath('//h1[@class="entry-title"]/a/@href') 
Out[14]: ['http://aitplacements.com/uncategorized/cybage/']

您希望<a>元素的文本，而不是：

In [16]: doc.xpath('//h1[@class="entry-title"]/a/text()') 
Out[16]: ['Cybage..']

因此，使用

raw_title = doc.xpath('//h1[@class="entry-title"]/a/text()') 
if len(raw_title) > 0: 
    raw_title = raw_title[0] 
else: 
    # handle the case of missing title 
    raise ValueError('Missing title')

來源

2017-09-13 13:26:15 unutbu

爲什麼我得到空'raw_title'，文檔確實會提取頁面？ –

如果您不確定'doc'解析了什麼，請打印出'LH.tostring（doc，pretty_print = True）'（或將其寫入文件並在其中檢查）。你得到一個空的'raw_title'的原因是'a/@ href/text（）'正在尋找附加到'href'屬性的文本。空無一人。該文本附在''元素上。 – unutbu

問題是我再次退出，解決了問題 –

回答

相關問題