2013-07-22 52 views
0

使用Python 2.5,我正在閱讀三種不同信息的HTML文件。我能夠找到信息的方式是找到與正則表達式 *匹配的匹配項,然後從匹配行中計算特定行數以獲取我正在查找的實際信息。 問題是我必須重新打開網站3次(每個信息我查找一個)。我認爲這樣做效率低下,希望能夠查看打開網站的所有三件事情。有沒有人有更好的方法或建議?比readlines更好的方法嗎?

* 我會學着更好的辦法,如BeautifulSoup,但現在,我需要速戰速決

代碼:

def scrubdividata(ticker): 
try: 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Annual Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Last Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Last Ex-Dividend Date:" in line: 
      s = str(lines[i+1]) 
      start = '>' 
      end = '</td>' 
      LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1) 
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate)) 
except: 
    if ticker not in errorlist: 
     errorlist.append(ticker) 
    else: 
     pass 
    pass 

感謝,

我發現了一個可行的解決方案!我刪除了兩個無關的urlopen和readline命令,只剩下一個用於循環(在我只刪除了urlopen命令之前,但留下了readlines之前)。這是我糾正代碼:

def scrubdividata(ticker): 
try: 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Annual Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    #lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Last Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    #lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Last Ex-Dividend Date:" in line: 
      s = str(lines[i+1]) 
      start = '>' 
      end = '</td>' 
      LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1) 
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate)) 
    print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@' 
print ticker,LastDiv,AnnualDiv,LastExDivDate 
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@' 
except: 
    if ticker not in errorlist: 
     errorlist.append(ticker) 
    else: 
     pass 
    pass 
+0

學習BeautifulSoup,這將爲您節省很多時間!而你不應該在HTML上做正則表達式...... – Paco

+0

_「問題是我必須重新打開網站3次」_爲什麼是這樣?在你第一次使用它之後,'lines'是否仍然包含你需要的所有數據?它看起來並不像其內容被擦除或任何東西。 – Kevin

+0

其實,凱文,你的問題讓我想到了一個解決方案...... – teachamantofish

回答

0
def scrubdividata(ticker): 
try: 
    end = '</td>' 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for i in range(0,len(lines)): 
     line = lines[i] 
     if "Annual Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
     if "Last Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 
     if "Last Ex-Dividend Date:" in line: 
      s = str(lines[i+1]) 
      start = '>' 
      LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1) 
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate)) 
except: 
    if ticker not in errorlist: 
     errorlist.append(ticker) 
    else: 
     pass 
    pass 
+0

我會用'for i,循環枚舉(行):'而不是在範圍和索引上循環。 – Blckknght

+0

對不起,以上是原始的快速和骯髒的破解,而不是一個完整的重構。 –

0

注意lines將包含你所需要的線,所以沒有必要再打電話f.readlines()。簡單地重用lines

小提示:您可以使用for line in lines

def scrubdividata(ticker): 
    try: 
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker)) 
    lines = f.readlines() 
    for line in lines: 
     if "Annual Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 

    for line in lines: 
     if "Last Dividend:" in line: 
      s = str(lines[i+1]) 
      start = '>\$' 
      end = '</td>' 
      LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1) 

    for line in lines: 
     if "Last Ex-Dividend Date:" in line: 
      s = str(lines[i+1]) 
      start = '>' 
      end = '</td>' 
      LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1) 
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate)) 
    except: 
    if ticker not in errorlist: 
     errorlist.append(ticker) 
    else: 
     pass 
    pass 
1

BeautifulSoup例子供參考(Python2從內存中:我只有它Python3這裏這麼一些語法可能會關閉一個位):

from BeautifulSoup import BeautifulSoup 
from urllib2 import urlopen 

yoursite = "http://...." 
with urlopen(yoursite) as f: 
    soup = BeautifulSoup(f) 

    for node in soup.findAll('td', attrs={'class':'descrip'}): 
     print node.text 
     print node.next_sibling.next_sibling.text 

輸出(用於樣品輸入 '歌'):

Last Close: 
$910.68 
Annual Dividend: 
N/A 
Pay Date: 
N/A 
Dividend Yield: 
N/A 
Ex-Dividend Date: 
N/A 
Years Paying: 
N/A 
52 Week Dividend: 
$0.00 
etc. 

BeautifulSoup可以很容易地在具有可預測模式的網站上使用。