2011-02-09 29 views
0

我想創建一個類的數據抓取文件,我必須刮的數據要求我使用While循環來獲取正確的數據到單獨的數組 - 即狀態和SAT平均等從列表創建字符串的屬性錯誤

但是,一旦我設置了while循環,我認爲正則表達式清除了大部分的HTML標籤從數據壞了,我得到的是一條錯誤:

Attribute Error: 'NoneType' object has no attribute 'groups'

我的代碼是:

import re, util 
from BeautifulSoup import BeautifulStoneSoup 

# create a comma-delineated file 
delim = ", " 

#base url for sat data 
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm" 

#get webpage object for site 
soup = util.mysoupopen(base) 

#get column headings 
colCols = soup.findAll("td", {"class":"vaTextBold"}) 

#get data 
dataCols = soup.findAll("td", {"class":"vaText"}) 

#append data to cols 
for i in range(len(dataCols)): 
    colCols.append(dataCols[i]) 

#open a csv file to write the data to 
fob=open("sat.csv", 'a') 

#initiate the 5 arrays 
states = [] 
participate = [] 
math = [] 
read = [] 
write = [] 

#split into 5 lists for each row 
for i in range(len(colCols)): 
    if i%5 == 0: 
     states.append(colCols[i]) 
i=1 
while i<=250: 
    participate.append(colCols[i]) 
    i = i+5 

i=2 
while i<=250: 
    math.append(colCols[i]) 
    i = i+5 

i=3 
while i<=250: 
    read.append(colCols[i]) 
    i = i+5 

i=4 
while i<=250: 
    write.append(colCols[i]) 
    i = i+5 

#write data to the file 
for i in range(len(states)): 
    states = str(states[i]) 
    participate = str(participate[i]) 
    math = str(math[i]) 
    read = str(read[i]) 
    write = str(write[i]) 

    #regex to remove html from data scraped 

    #remove <td> tags 
    line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<",  participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0] 

    #append data point to the file 
    fob.write(line) 

關於爲什麼這個錯誤突然出現的任何想法?正則表達式工作正常,直到我試圖將數據分成不同的列表。我已經嘗試在最後的「for」循環中打印各種字符串,以查看它們中的任何一個對於第一個i值(0)是否爲「無」,但它們都是它們應該是的字符串。

任何幫助將不勝感激!

回答

1

它看起來像正則表達式搜索失敗(其中之一)的字符串,所以它返回None而不是MatchObject

請嘗試以下,而不是很長#remove <td> tags行:

out_list = [] 
for item in (states, participate, math, read, write): 
    try: 
     out_list.append(re.search(">(.*)<", item).groups()[0]) 
    except AttributeError: 
     print "Regex match failed on", item 
     sys.exit() 
line = delim.join(out_list) 

這樣一來,就可以找出你的正則表達式失敗。

此外,我建議您使用.group(1)而不是.groups()[0]。前者更明確。