2016-10-06 128 views
2

後鏈接這是一個find_all('a')的結果(這是很長):解析HTML與美麗的湯讓HREF

</a>, <a class="btn text-default text-dark clear_filters pull-right group-ib" href="#" id="export_dialog_close" title="Cancel"><span class="glyphicon glyphicon-remove"></span><span>Cancel</span></a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:SHIPNAME/direction:asc">Vessel Name</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:TIMESTAMP_UTC/direction:asc">Timestamp</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:PORT_NAME/direction:asc">Port</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:MOVE_TYPE_NAME/direction:asc">Port Call type</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:ELAPSED/direction:asc">Time Elapsed</a>, <a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" title="View details for: SIDER LUCK">SIDER LUCK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/163/port_name:MILAZZO/_:3525d580eade08cfdb72083b248185a9" title="View details for: MILAZZO">MILAZZO</a>, <a href="/en/ais/details/ships/shipid:288753/imo:9389693/mmsi:249474000/vessel:OOCL%20ISTANBUL/_:3525d580eade08cfdb72083b248185a9" title="View details for: OOCL ISTANBUL">OOCL ISTANBUL</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/17436/port_name:AMBARLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: AMBARLI">AMBARLI</a>, <a href="/en/ais/details/ships/shipid:754480/imo:9045613/mmsi:636014098/vessel:TK%20ROTTERDAM/_:3525d580eade08cfdb72083b248185a9" title="View details for: TK ROTTERDAM">TK ROTTERDAM</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/3504/port_name:DILISKELESI/_:3525d580eade08cfdb72083b248185a9" title="View details for: DILISKELESI">DILISKELESI</a>, <a href="/en/ais/details/ships/shipid:412277/imo:9039585/mmsi:353430000/vessel:SEA%20AEOLIS/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEA AEOLIS">SEA AEOLIS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/1/port_name:PIRAEUS/_:3525d580eade08cfdb72083b248185a9" title="View details for: PIRAEUS">PIRAEUS</a>, <a href="/en/ais/details/ships/shipid:346713/imo:7614599/mmsi:273327300/vessel:SOLIDAT/_:3525d580eade08cfdb72083b248185a9" title="View details for: SOLIDAT">SOLIDAT</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/883/port_name:SEVASTOPOL/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEVASTOPOL">SEVASTOPOL</a>, <a href="/en/ais/details/ships/shipid:752974/imo:9195298/mmsi:636011072/vessel:OCEANPRINCESS/_:3525d580eade08cfdb72083b248185a9" title="View details for: OCEANPRINCESS">OCEANPRINCESS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/21780/port_name:EREGLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: EREGLI">EREGLI</a>, <a href="/en/ais/details/ships/shipid:201260/imo:9385075/mmsi:235102768/vessel:EMERALD%20BAY/_:3525d580eade08cfdb72083b248185a9" title="View details for: EMERALD BAY">EMERALD BAY</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ships/shipid:418956/imo:9102746/mmsi:356579000/vessel:MSC%20DON%20GIOVANNI/_:3525d580eade08cfdb72083b248185a9" title="View details for: MSC DON GIOVANNI">MSC DON GIOVANNI</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/67/port_name:CONSTANTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: CONSTANTA">CONSTANTA</a>, <a href="/en/ais/details/ships/shipid:748395/imo:9460734/mmsi:622121422/vessel:WADI%20SAFAGA/_:3525d580eade08cfdb72083b248185a9" title="View details for: WADI SAFAGA">WADI SAFAGA</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/997/port_name:DAMIETTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: DAMIETTA">DAMIETTA</a> 

我要拉出來,與開頭的字符串所/en/ais/details/ships/shipid:如:

<a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" 

我能夠複製這些例子(Find specific link w/ beautifulsoupHow to get Beautiful Soup to get link from href and class?),但我寧願不使用正則表達式。

到目前爲止,我有:

for i in ase: #ase is where the html is sotred 
    print(i.get('href')) #prints everysingle href. 

總之,我的問題是我怎麼只保留有沒有使用正則表達式我感興趣的字符串href的?

回答

3

@elethan's answer不是最好的。它會找到你所有的鏈接,然後才把它們過濾出來。爲什麼我們不只是讓我們沒有額外的過濾直所需的鏈接 - BeautifulSoup很能幹的是:

prefix = "/en/ais/details/ships/shipid" 
[a["href"] for a in soup("a", href=lambda x: x and x.startswith(prefix))] 

或者,而不是function的,你可以傳遞一個regular expression pattern檢查一個字符串「開始與」所需的子字符串:

pattern = re.compile(r"^/en/ais/details/ships/shipid") 
[a["href"] for a in soup("a", href=pattern)] 

^在這裏表示一個字符串的開頭。

或者,我們甚至可以使用CSS選擇:

[a["href"] for a in soup.select('a[href^="/en/ais/details/ships/shipid"]')] 

^=是一個 「開始,以」 選擇。

3

試試下面的列表理解:

[h.get('href') for h in ase if 'string' in h.get('href', '')] 

這會給你只包含包含子'string'的鏈接列表。

更新:

由於@PadraicCunningham在評論中指出,'string' in h.get('href')(這是我原來的答覆的一部分)將引發TypeError如果h沒有一個關鍵'href' - 不太可能,因爲h會一個<a>標籤的表示,但也肯定是一個不平凡的可能性。爲了考慮到這種可能性,只需傳遞.get()即可返回''的缺省參數,而不是None,此時密鑰不存在。

此外,我沒有聲稱我的解決方案是最好的;它可能不是特別有效或優雅。但是,從我對OP問題的理解中,這個解決方案將起作用,很小,並且易於理解。

+0

等一下,你真的可以做'element.get('attr')'而不是'element.attrs.get('attr')'?!看起來好多了! e:檢查過的文檔,它在那裏提到。不知道我爲什麼錯過了這麼久。 – n1c9

+1

@ n1c9很好,不是嗎?有一個原因叫做「美麗」湯,哈哈。 – elethan

+1

None - > error'中的'string',你只需要使用.get,如果你希望在你調用的節點上不存在href,考慮到你可以設置'href = True'並且使用一個css選擇器,正則表達式等等。沒有理由永遠需要使用.get,特別是調用它兩次。此外,字符串中的子字符串與以子字符串開頭的字符串不同。 –

0

我仍然建議你使用正則表達式,因爲它更簡潔,併爲你節省了另一個循環。

import re 
find_all('a', href=re.compile("/en/ais/details/ships/shipid:")) 

documentation你找到一個類似的解決方案這一點。