import urllib
budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()
def find_between(s, first, last):
try:
start = s.index(first) + len(first)
end = s.index(last, start)
return s[start:end]
except ValueError:
return ""
s = find_between(s, '<table>', '</table>')
print s[:500]
print '.............................................................'
print s[-250:]
Find string between two substrings
回報:
>>>
<tr><th> </th><th>Release Date</th><th>Movie</th><th>Production Budget</th><th>Domestic Gross</th><th>Worldwide Gross</th></tr>
<tr><td class="data">1</td>
<td><a href="/box-office-chart/daily/2009/12/18">12/18/2009</a></td>
<td><b><a href="/movie/Avatar#tab=summary">Avatar</a></td>
<td class="data">$425,000,000</td>
<td class="data">$760,507,625</td>
<td class="data">$2,783,918,982</td>
<tr>
<tr><td class="data">2</td>
<td><a href="/box-office-chart/daily/2015/12/18">12/18/2015</a></td>
.............................................................
</td>
<td><a href="/box-office-chart/daily/2005/08/05">8/5/2005</a></td>
<td><b><a href="/movie/My-Date-With-Drew#tab=summary">My Date With Drew</a></td>
<td class="data">$1,100</td>
<td class="data">$181,041</td>
<td class="data">$181,041</td>
<tr>
.........................................
我需要的文字不是鏈接。
通過http://www.convertcsv.com/html-table-to-csv.htm
Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
1,12/18/2009,Avatar,"$425,000,000","$760,507,625","$2,783,918,982"
8/5/2005,My Date With Drew,"$1,100","$181,041","$181,041"
可以使用beautifulsoup做同樣的,請參閱:
beautifulSoup html csv
你能隨便抓文本不指定'類='data''?它看起來像其他TD沒有階級。 –
不知道該怎麼做 – MovieBall