捕獲IDS與XPath在Python從URL源

想象我有內容，如：捕獲IDS與XPath在Python從URL源

cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>.... 
"""

我想要什麼：

id1='test1' 
id2='test2' 
idn='testn'

你能糾正我？

if '<a id=' in cont: 
    ....?

我一定要使用正則表達式在 Python或有通過的XPath的方法來抓住他們？

注：我只希望在標籤

來源

2014-11-06 MLSC

爲什麼不使用類似Bsoup或lxml的東西？ – 2014-11-06 08:11:35

Beautifulsoup似乎確實是一個簡單的方法來做到這一點：http://www.crummy.com/software/BeautifulSoup/bs4/doc/ – 2014-11-06 08:12:43

@Vincent Beltman如果你知道一個可靠的方法，它會受到歡迎... – MLSC 2014-11-06 08:12:45

下載BS4這裏所有ID：http://www.crummy.com/software/BeautifulSoup/

文檔：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

這應該工作：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(cont) 
for a in soup.select('a'): # Or soup.find_all('a') if you prefer 
    if a.get('id') is not None: 
     print a.get('id')

或者用理解得到清單：

ids = [a.get('id') for a in BeautifulSoup(cont).select('a') if a.get('id') is not None]

來源

2014-11-06 08:15:42

應該將html更改爲「cont」。我做了：'湯= BeautifulSoup（續）;對於soup.find_all（'a'）中的ids：print（ids.get（'id'））'並且可以很好地工作 – MLSC 2014-11-06 08:18:15

@MortezaLSC，但它只顯示值。 'test1'，'test2'不'ID1 ='test1'' – 2014-11-06 08:19:56

@Avinash拉吉，謝謝你......沒問題，我想我應該把它們放入一個列表，並使用它們 – MLSC 2014-11-06 08:22:26

通過列表理解和BeautifulSoup。

>>> from bs4 import BeautifulSoup 
>>> cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>.... 
""" 
>>> soup = BeautifulSoup(cont) 
>>> [i.get('id') for i in soup.findAll('a') if i.get('id') != None] 
['test1', 'test2'] 
>>> [i['id'] for i in soup.findAll('a') if i['id'] != None] 
['test1', 'test2']

來源

2014-11-06 08:26:03

但是有一個問題...！我怎麼能否認沒有類型的ID ...？只需打印test1和test2？我的結果是現在：'[ '測試1'， '無'， 'test2的'， '無']' – MLSC 2014-11-06 08:57:17

如果嘗試這種'[我[ '身份證']因爲我在soup.findAll（ 'A'）我[」 id']！='None'] ' – 2014-11-06 09:01:25

它返回錯誤。所以我把它改爲：'如果i.get ['id']！='None']'print [i.get（'id'）for soup.findAll（'a'）''錯誤 – MLSC 2014-11-06 09:04:04

捕獲IDS與XPath在Python從URL源

回答

相關問題