2017-04-01 38 views
0

提取文本我在Python 2.7的腳本擦傷該頁面中的表: http://www.the-numbers.com/movie/budgets/all從鏈接在python

我想提取每一列的,問題是,我的代碼不承認有鏈接的列(第2和第3列)。

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 
htmlpage = etree.HTML(s) 
htmltable = htmlpage.xpath("//td[@class='data']/text()") 

使用此代碼htmltable [0]是排名,htmltable [1]是生產預算並從此繼續。 從我失蹤的人,我需要的文字不是鏈接。

+0

你能隨便抓文本不指定'類='data''?它看起來像其他TD沒有階級。 –

+0

不知道該怎麼做 – MovieBall

回答

1

你需要修改你的XPath,因爲不是所有的td元素都有class="data"。 試試這個xpath表達式://td//text()

import urllib 
from lxml import etree 

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 
htmlpage = etree.HTML(s) 
htmltable = htmlpage.xpath("//td//text()") 

輸出: enter image description here

1
import urllib 

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 

def find_between(s, first, last): 
    try: 
     start = s.index(first) + len(first) 
     end = s.index(last, start) 
     return s[start:end] 
    except ValueError: 
     return "" 

s = find_between(s, '<table>', '</table>') 

print s[:500] 
print '.............................................................' 
print s[-250:] 

Find string between two substrings

回報:

>>> 
<tr><th>&nbsp;</th><th>Release Date</th><th>Movie</th><th>Production Budget</th><th>Domestic Gross</th><th>Worldwide Gross</th></tr> 
<tr><td class="data">1</td> 
<td><a href="/box-office-chart/daily/2009/12/18">12/18/2009</a></td> 
<td><b><a href="/movie/Avatar#tab=summary">Avatar</a></td> 
<td class="data">$425,000,000</td> 
<td class="data">$760,507,625</td> 
<td class="data">$2,783,918,982</td> 
<tr> 
<tr><td class="data">2</td> 
<td><a href="/box-office-chart/daily/2015/12/18">12/18/2015</a></td> 
............................................................. 
</td> 
<td><a href="/box-office-chart/daily/2005/08/05">8/5/2005</a></td> 
<td><b><a href="/movie/My-Date-With-Drew#tab=summary">My Date With Drew</a></td> 
<td class="data">$1,100</td> 
<td class="data">$181,041</td> 
<td class="data">$181,041</td> 
<tr> 

enter image description here

......................................... 

enter image description here

我需要的文字不是鏈接。

通過http://www.convertcsv.com/html-table-to-csv.htm

Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross 
1,12/18/2009,Avatar,"$425,000,000","$760,507,625","$2,783,918,982" 
8/5/2005,My Date With Drew,"$1,100","$181,041","$181,041" 

可以使用beautifulsoup做同樣的,請參閱:

beautifulSoup html csv