2013-07-16 154 views
2

我已經成功地提取HREF URI的使用beautifulsoup從頁面的源代碼,但是我現在想提取從下面的例子中的多個實例的UID值:蟒蛇提取ID

<a href="test.html?uid=5444974"> 
<a href="test.html?uid=5444972"> 
<a href="test.html?uid=54444972"> 

幫助將不勝感激!

+0

如果你可以提取'href'屬性,然後選擇[裏urlparse(HTTP://docs.python。 org/2/library/urlparse.html)會幫助您 –

+0

http://stackoverflow.com/a/11281019/594589,因爲@DanLecocq建議 – dm03514

回答

1
>>> html 
'<a href="test.html?uid=5444974">\n<a href="test.html?uid=5444972">\n<a href="test.html?uid=54444972">' 
>>> soup = BeautifulSoup(html) 
>>> ass = soup.find_all('a') 
>>> r = re.compile('uid=(\d+)') 
>>> uids = [] 
>>> for a in ass: 
...  uids.append(r.search(a['href']).group(1)) 
... 
>>> uids 
['5444974', '5444972', '54444972'] 
>>> 
1

使用urlparseparse_qs

html = """<a href="test.html?uid=5444974"> 
<a href="test.html?uid=5444972"> 
<a href="test.html?uid=54444972"> 
""" 

from bs4 import BeautifulSoup as BS 
from urlparse import urlparse, parse_qs 
soup = BS(html) 
for a in soup('a', href=True): 
    print parse_qs(urlparse(a['href']).query)['uid'][0] 

輸出:

5444974 
5444972 
54444972