從Python中的href標記中刪除不需要的html

我希望能夠刮出鏈接列表。由於html的結構方式，我無法直接使用BeautifulSoup。從Python中的href標記中刪除不需要的html

start_list = soup.find_all(href=re.compile('id=')) 

print(start_list) 

[<a href="/movies/?id=actofvalor.htm"><b>Act of Valor</b></a>, 
<a href="/movies/?id=actionjackson.htm"><b>Action Jackson</b></a>]

我正在尋找只拉href信息。我正在考慮某種過濾器，我可以將所有粗體代碼放入列表中，然後將其從包含上述信息的另一個列表中過濾出來。

start_list = soup.find_all('a', href=re.compile('id=')) 

start_list_soup = BeautifulSoup(str(start_list), 'html.parser') 

things_to_remove = start_list_soup.find_all('b')

的想法是能夠遍歷things_to_remove和START_LIST刪除其內容全部出現

來源

2017-01-02 Chace Mcguyer

發佈您想要的輸出。 –

start_list = soup.find_all(href=re.compile('id=')) 

href_list = [i['href'] for i in start_list]

href是標籤的attrbute，如果使用find_all GET一堆標籤，只是遍歷它並使用tag['href']來訪問該屬性。

要理解爲什麼使用[]，您應該知道標記的屬性存儲在字典中。 Document：

標籤可以具有任何數量的屬性。標籤<b class="boldest"> 具有其值「大膽」的屬性「類」。
tag['class'] 
# u'boldest' 
您可以直接訪問該字典作爲.attrs：您可以通過處理標籤像一本字典訪問標籤的屬性
tag.attrs 
# {u'class': u'boldest'} 

列表理解很簡單，你可以參考這個PEP，在這種情況下，它可以在for循環中完成：

href_list = [] 
for i in start_list: 
    href_list.append(i['href'])

來源

2017-01-02 02:53:40

這正是我需要的，你可以向我解釋列表理解嗎？ –

具體來說：這部分我['href']爲什麼它在括號內？ –

@ Chace Mcguyer請接受此答案來關閉此問題。 –

從Python中的href標記中刪除不需要的html

回答

相關問題