2
我已經成功地提取HREF URI的使用beautifulsoup從頁面的源代碼,但是我現在想提取從下面的例子中的多個實例的UID值:蟒蛇提取ID
如
<a href="test.html?uid=5444974">
<a href="test.html?uid=5444972">
<a href="test.html?uid=54444972">
幫助將不勝感激!
我已經成功地提取HREF URI的使用beautifulsoup從頁面的源代碼,但是我現在想提取從下面的例子中的多個實例的UID值:蟒蛇提取ID
如
<a href="test.html?uid=5444974">
<a href="test.html?uid=5444972">
<a href="test.html?uid=54444972">
幫助將不勝感激!
>>> html
'<a href="test.html?uid=5444974">\n<a href="test.html?uid=5444972">\n<a href="test.html?uid=54444972">'
>>> soup = BeautifulSoup(html)
>>> ass = soup.find_all('a')
>>> r = re.compile('uid=(\d+)')
>>> uids = []
>>> for a in ass:
... uids.append(r.search(a['href']).group(1))
...
>>> uids
['5444974', '5444972', '54444972']
>>>
使用urlparse
和parse_qs
:
html = """<a href="test.html?uid=5444974">
<a href="test.html?uid=5444972">
<a href="test.html?uid=54444972">
"""
from bs4 import BeautifulSoup as BS
from urlparse import urlparse, parse_qs
soup = BS(html)
for a in soup('a', href=True):
print parse_qs(urlparse(a['href']).query)['uid'][0]
輸出:
5444974
5444972
54444972
如果你可以提取'href'屬性,然後選擇[裏urlparse(HTTP://docs.python。 org/2/library/urlparse.html)會幫助您 –
http://stackoverflow.com/a/11281019/594589,因爲@DanLecocq建議 – dm03514