在dt dd中標記數據其中包含鏈接的數據

其實我想從網站「https://www.crunchbase.com/organization/ani-technologies#/entity」中獲取數據，其中我的數據出現在dt和dd標籤內，並且因爲bot不允許在網站上使用。所以我保存的頁面，並通過這種方式應用beautifulsoup救頁面模塊，雖然我在代碼中所提到的實際的URL在dt dd中標記數據其中包含鏈接的數據

soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())

import requests 
ctr=1 
file=requests.get("https://www.crunchbase.com/organization/ani-technologies#/entity") 
soup = BeautifulSoup(file).read() 
dl_data = soup.find_all("dd") 
for dlitem in dl_data: 
    print(ctr,dlitem.string) 
    ctr+=1

實際輸出：

0 3 Acquisitions 
1 None 
2 Bengaluru, Karnataka 
3 Ola is a mobile app for cab booking in India. 
4 None 
5 None 
6 olacab link 
7 None 
8 December 3, 2010 
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
10 [email protected] 
11 None

在幾個地方，我得到了無，由於事實，有超鏈接的內容。例如在頁面「https://www.crunchbase.com/organization/ani-technologies#/entity」類別選項卡有5個類別名稱：電子商務，互聯網，交通，應用程序和移動，每一個連接到超鏈接，所以我不能得到我想要的文本，即這5個類別。

我想作爲作爲輸出什麼：

0 3 Acquisitions 
1 (All that text (though not important to me)) 
2 Bengaluru, Karnataka 
3 Ola is a mobile app for cab booking in India. 
4 (all that text(though not important to me)) 
==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 
6 olacab link 
7 (all that text(though not important to me)) 
8 December 3, 2010 
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
10 [email protected] 
11 (all that text(though not important to me))

這將是最有幫助的，如果我能得到的字典是這樣的：

{"Headquarters":["Bengaluru,Karnataka"], 
"Description":["Ola is a mobile app for cab booking in India."], 
"Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}

來源

2017-07-05 Nimish Bansal

不，實際上我得到了無文本實際存在的位置，但由於某些嵌套的標籤（由於超鏈接），我無法提取該文本 –

問題。 ..我不能得到我想要的文本...如果我可以得到字典...

所有<dd><a href=...>text</dd>獲取text/href，聚集成一個dict，例如：

from collections import OrderedDict 
os_dict = OrderedDict() 

for div_class in ['definition-list-container', 'details definition-list']: 
    divs = soup.find_all("div", class_=div_class) 
    key = '?' 
    for div in divs: 
     for child in div.findChildren(): 
      if child.name == 'dt': 
       key = child.text[:-1] 
      if child.name == 'dd': 
       if child.select('a[href]'): 
        a_list = child.find_all("a") 
        if key in ['Social:']: 
         os_dict[key] = [a['href'] for a in a_list] 
        elif len(a_list) == 1: 
         os_dict[key] = a_list[0].text 
        else: 
         os_dict[key] = [a.text for a in a_list] 
       else: 
        os_dict[key] = child.text 

for n, key in enumerate(os_dict, 1): 
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))

Outuput：

1:   Acquisition: 3 Acquisitions 
2: Total Equity Fundin: ['11 Rounds', '24 Investors'] 
3:   Headquarters: Bengaluru, Karnataka 
4:   Description: Ola is a mobile app for cab booking in India. 
5:    Founders: ['Bhavish Aggarwal', 'Ankit Bhati'] 
6:   Categories: ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile'] 
7:    Website: http://www.olacabs.com 
8:    Social:: ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com'] 
9:    Founded: December 3, 2010 
10:    Aliases: ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
11:    Contact: [email protected] 
12:   Employees: 8 in Crunchbase

美麗的湯文檔：find-all
簽名：find_all（姓名，ATTRS，遞歸，串，限制，** kwargs）

dl_data = soup.find_all("dd") 
for n, dlitem in enumerate(dl_data, 1): 
    if dlitem.select('a[href]'): 
     a_text = [a.text for a in dlitem.find_all("a")] 
     print('{}: {}'.format(n, a_text)) 
    else: 
     print('{}: {}'.format(n, dlitem.text))

與Python測試：3.4.2 - BS4：4.6。 0

來源

2017-07-06 19:50:41 stovfl

在dt dd中標記數據其中包含鏈接的數據

回答

相關問題