從鏈接在python

提取文本我在Python 2.7的腳本擦傷該頁面中的表： http://www.the-numbers.com/movie/budgets/all 從鏈接在python

我想提取每一列的，問題是，我的代碼不承認有鏈接的列（第2和第3列）。

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 
htmlpage = etree.HTML(s) 
htmltable = htmlpage.xpath("//td[@class='data']/text()")

使用此代碼htmltable [0]是排名，htmltable [1]是生產預算並從此繼續。從我失蹤的人，我需要的文字不是鏈接。

來源

2017-04-01 MovieBall

你能隨便抓文本不指定'類='data''？它看起來像其他TD沒有階級。 –

不知道該怎麼做 – MovieBall

你需要修改你的XPath，因爲不是所有的td元素都有class="data"。試試這個xpath表達式：//td//text()。

import urllib 
from lxml import etree 

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 
htmlpage = etree.HTML(s) 
htmltable = htmlpage.xpath("//td//text()")

輸出：

來源

2017-04-01 23:35:38 vold

import urllib 

budgeturl = "http://www.the-numbers.com/movie/budgets/all" 
s = urllib.urlopen(budgeturl).read() 

def find_between(s, first, last): 
    try: 
     start = s.index(first) + len(first) 
     end = s.index(last, start) 
     return s[start:end] 
    except ValueError: 
     return "" 

s = find_between(s, '<table>', '</table>') 

print s[:500] 
print '.............................................................' 
print s[-250:]

Find string between two substrings

回報：

>>> 
<tr><th>&nbsp;</th><th>Release Date</th><th>Movie</th><th>Production Budget</th><th>Domestic Gross</th><th>Worldwide Gross</th></tr> 
<tr><td class="data">1</td> 
<td><a href="/box-office-chart/daily/2009/12/18">12/18/2009</a></td> 
<td><b><a href="/movie/Avatar#tab=summary">Avatar</a></td> 
<td class="data">$425,000,000</td> 
<td class="data">$760,507,625</td> 
<td class="data">$2,783,918,982</td> 
<tr> 
<tr><td class="data">2</td> 
<td><a href="/box-office-chart/daily/2015/12/18">12/18/2015</a></td> 
............................................................. 
</td> 
<td><a href="/box-office-chart/daily/2005/08/05">8/5/2005</a></td> 
<td><b><a href="/movie/My-Date-With-Drew#tab=summary">My Date With Drew</a></td> 
<td class="data">$1,100</td> 
<td class="data">$181,041</td> 
<td class="data">$181,041</td> 
<tr>

.........................................

我需要的文字不是鏈接。

通過http://www.convertcsv.com/html-table-to-csv.htm

Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross 
1,12/18/2009,Avatar,"$425,000,000","$760,507,625","$2,783,918,982" 
8/5/2005,My Date With Drew,"$1,100","$181,041","$181,041"

可以使用beautifulsoup做同樣的，請參閱：

beautifulSoup html csv

來源

2017-04-01 17:52:09 litepresence

從鏈接在python

回答

相關問題