Python - BeautifulSoup，在標籤內獲取標籤

如何獲得有關在標籤內獲取標籤的信息？Python - BeautifulSoup，在標籤內獲取標籤

出了td標籤在這裏：

<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a></td>

我想其中的href標記的價值，主要是HTM鏈接：

<a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a>

我有標籤這樣的：

<tr> 
<td scope="row">1</td> 
<td scope="row">10-K</td> 
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a></td> 
<td scope="row">10-K</td> 
<td scope="row">2724989</td> 
</tr> 
<tr class="blueRow"> 
<td scope="row">2</td> 
<td scope="row">EXHIBIT 21.1</td> 
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/exhibit211q42016.htm">exhibit211q42016.htm</a></td> 
<td scope="row">EX-21.1</td> 
<td scope="row">21455</td> 
</tr> 
<tr> 
<td scope="row">3</td> 
<td scope="row">EXHIBIT 23.1</td> 
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/exhibit231q42016.htm">exhibit231q42016.htm</a></td> 
<td scope="row">EX-23.1</td> 
<td scope="row">4354</td> 
</tr>

查看所有標籤的代碼：

base_url = "https://www.sec.gov/Archives/edgar/data/1085621/000108562117000004/" \ 
       "0001085621-17-000004-index.htm" 
    response = requests.get(base_url) 
    base_data = response.content 
    base_soup = BeautifulSoup(base_data, "html.parser")

來源

2017-07-06 Theo

您可以使用find_all先得到所有td標籤，然後將這些標籤中搜索錨：

links = [] 
for tag in base_soup.find_all('td', {'scope' : 'row'}): 
    for anchor in tag.find_all('a'): 
     links.append(anchor['href']) 

print(links)

輸出：

['/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm', 
'/Archives/edgar/data/1085621/000108562117000004/exhibit211q42016.htm', 
... 
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_lab.xml', 
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml']

你可以寫一點點過濾器刪除那些非htm鏈接，如果你想：

filtered_links = list(filter(lambda x: x.endswith('.htm'), links))

要獲得第一個鏈接，這裏有一個稍微不同的版本，適合您的用例。

link = None 
for tag in base_soup.find_all('td', {'scope' : 'row'}): 
    children = tag.findChildren() 
    if len(children) > 0: 
     try: 
      link = children[0]['href'] 
      break 
     except: 
      continue 

print(link)

這打印出'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml'。

來源

2017-07-06 20:52:31

這是一個非常好的解決方案，謝謝。無論如何不循環做兩次？有什麼辦法可以減少到只有一個for循環。比如像base_soup.find_all（'td'，{'scope'：'row'{a}}）。 – Theo

我只想要第一個htm，'/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm' – Theo

@Theo給我幾分鐘，會更新。 –

Python - BeautifulSoup，在標籤內獲取標籤

回答

相關問題