2011-09-06 24 views
1

我有2個變量,一個'最後一個',另一個'最後一個問題'。Python和美麗的湯 - 搜索標籤a,返回以下標籤b's,直到找到標籤A

我正在處理的HTML包含所有卷和問題的列表,最近的第一個。

我需要爲所有卷和問題返回比我存檔更新的href鏈接。

因此使用下面的示例,說我的最後一卷是13和最後一個問題是1,我需要返回HREF的第13卷,2卷和第14,1

我有一個困難時期與此,因爲體積是對自己...

這是我到目前爲止有:

HTML:

<ul class="bobby"> 
<li><strong>Volume 14</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>   
</li> 
<li><strong>Volume 13</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a> 
</li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a> 
</li> 
</ul> 

腳本剪斷:

results = soup.find('ul', attrs={'class' : 'bobby'}) 

#temp until I get it reading from file 
lastVol = '13' 
#find the last volume 
findlastVol = results.findNext('strong', text= re.compile('Volume ' + lastVol)) 

#temp until I get it reading from file 
lastIss = '2' 
#find the last issue 
findlastIss = findlastVol.findNext('a', text= re.compile('Issue ' + lastIss)) 

所以我能到最後量和發行上的文件標籤,但我已經在穿越備份,並在第一個問題停止過幾次失敗的嘗試......

或起始於頂部和橫貫下來,直到該量和發行條件得到滿足...

有人可以給我一些幫助嗎?謝謝。

回答

1

我認爲你正在尋找findPrevious,你可以用這樣的方式:

import BeautifulSoup 
import re 

content=''' 
<ul class="bobby"> 
<li><strong>Volume 14</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>   
</li> 
<li><strong>Volume 13</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a> 
</li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a> 
</li> 
</ul> 
''' 

last_volume=13 
last_issue=1 

soup=BeautifulSoup.BeautifulSoup(content) 
results = soup.find('ul', attrs={'class' : 'bobby'}) 
for a_string in results.findAll('a', text=re.compile('Issue')): 
    volume=a_string.findPrevious(text=re.compile('Volume')) 
    volume=int(re.search(r'(\d+)',volume).group(1)) 
    issue=int(re.search(r'(\d+)',a_string).group(1)) 
    href=a_string.parent['href'] 
    if (volume>last_volume) or (volume>=last_volume and issue>last_issue):  
     print(volume,issue,href) 

產生

(14, 1, u'/content/ben/cchts/2011/00000014/00000001') 
(13, 2, u'/content/ben/cchts/2010/00000013/00000002') 
+0

是的,我可以工作有!非常感謝你。我歡迎其他解決方案只是爲了獲得更多知識,但這正是我期望獲得的。其實我可能需要稍微修改一下,因爲有時量和問題可能包含一個字符,例如問題1補充1。 – Brad

0
from BeautifulSoup import BeautifulSoup 
content = '''<ul class="bobby"> 
<li><strong>Volume 14</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September  2011">Issue 1, September 2011</a>   
</li> 
<li><strong>Volume 13</strong></li> 
<li class=""> 
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a> 
</li> 
<li class=""> 
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a> 
</li> 
</ul> 
''' 
soup = BeautifulSoup(content) 
soup.prettify() 
last_vol = 13 
last_issue = 1 

res = soup.find('ul',{"class":"bobby"}) 
lis = res.findAll('li') 
for j in lis: 
    if(j.find('strong') != None): 
     vol = int(j.contents[0].string[7:]) 
    elif(vol > last_vol) or (vol == last_vol and int(j.contents[1]['href'][33:]) > last_issue): 
     print "Volume\t:%d" % vol 
     print j.contents[1].string 
     print "href\t:%s" % j.contents[1]['href'] 

給人

 
Volume :14 
Issue 1, September 2011 
href :/content/ben/cchts/2011/00000014/00000001 
Volume :13 
Issue 2, December 2010 
href :/content/ben/cchts/2010/00000013/00000002