2013-09-21 68 views
0

特定子鏈接(HREF)我有類似這樣的字符串:發現在BeautifulSoup

[<tr><td><big>Motion Picture Sound Editors, USA</big></td></tr>, <tr><th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th></tr>, <tr><td align="center" rowspan="2" valign="middle"><a href="/Sections/Awards/Motion_Picture_Sound_Editors_USA/2010">2010 </a></td><td align="center" rowspan="2" valign="middle"><b>Nominated</b></td><td align="center" rowspan="2" valign="middle">Golden Reel Award</td><td valign="top">Best Sound Editing - Dialogue and ADR in a Feature Film<a href="/name/nm0613398/">Piero Mura</a> (supervising sound editor)<a href="/name/nm0919527/">Christopher T. Welch</a> (supervising dialogue/adr editor)<a href="/name/nm0270704/">Julie Feiner</a> (dialogue editor)<a href="/name/nm0827953/">Beth Sterner</a> (dialogue editor)<a href="/name/nm2628443/">Judah Getz</a> (adr mixer)</td></tr>, <tr><td valign="top">Best Sound Editing - Music in a Feature Film<a href="/name/nm1084134/">Jen Monnar</a> (supervising music editor)</td></tr>, <tr><td colspan="4"> </td></tr>, <tr><td align="center" bgcolor="#ffffdb" colspan="4" valign="top"></td></tr>] 

我在哪裏拉的信息:

[[u'2010 '], [u'Nominated'], [u'Golden Reel Award'], [u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Piero Mura', u' (supervising sound editor)', u'Christopher T. Welch', u' (supervising dialogue/adr editor)', u'Julie Feiner', u' (dialogue editor)', u'Beth Sterner', u' (dialogue editor)', u'Judah Getz', u' (adr mixer)']] 

每個名字,我想只獲得鏈接nm#######的某個部分。任何想法,我可以做到這一點,但保持它,所以我可以將名稱與nm#關聯? (即Piero Muranm0613398關聯)

我已經撤出它這樣的:這個

(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Piero Mura') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Christopher T. Welch') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Julie Feiner') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Beth Sterner') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Dialogue and ADR in a Feature Film', u'Judah Getz') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Music in a Feature Film', u'Jen Monnar') 
(u'Motion Picture Sound Editors, USA', u'2010 ', u'Nominated', u'Golden Reel Award', u'Best Sound Editing - Music in a Feature Film', u' (supervising music editor)') 

award_rows = award_soup.findAll("tr") 
    award_data = [[td.findChildren(text=True) for td in tr.findAll("td")] for tr in award_rows] 
    for data in award_data: 
     categ = [] 
     if data == award_data[0]: 
      award_show = ''.join(data[0]) 
     if len(data) == 4 and data != award_data[0]: 
      categ = data[3] 
      for cat in categ: 
       if cat == '&nbsp;': 
        cat = '' 
       if cat != categ[0] and len(categ) != 1 and cat[0:2] != ' (': 
        award_shows.append(award_show) 
        years.append(''.join(data[0])) 
        results.append(''.join(data[1])) 
        awards.append(''.join(data[2])) 
        categories.append(''.join(categ[0].replace('&nbsp;',''))) 
        recipients.append(cat) 
        print data 
       elif cat != categ[0] and len(categ) == 1: 
        award_shows.append(award_show) 
        years.append(''.join(data[0])) 
        results.append(''.join(data[1])) 
        awards.append(''.join(data[2])) 
        categories.append(''.join(categ[0].replace('&nbsp;',''))) 
        recipients.append('') 

回答

1

你可以爲所有<a>鏈接搜索與nm一個子加上數字。提取部分,並保存爲一個哈希:

from bs4 import BeautifulSoup 
import re 

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml') 

data = [] 
for a in soup.find_all('a', attrs={"href": re.compile("nm\d+")}): 
    s = re.search(r'nm\d+', a['href']).group(0) 
    data.append({a.text: s}) 

print(data) 

它產生:

[{'Piero Mura': 'nm0613398'}, 
{'Christopher T. Welch': 'nm0919527'}, 
{'Julie Feiner': 'nm0270704'}, 
{'Beth Sterner': 'nm0827953'}, 
{'Judah Getz': 'nm2628443'}, 
{'Jen Monnar': 'nm1084134'}] 
+0

工作精美:-)謝謝! – rjbogz