2017-07-05 26 views
0

其實我想從網站「https://www.crunchbase.com/organization/ani-technologies#/entity」中獲取數據,其中我的數據出現在dt和dd標籤內,並且因爲bot不允許在網站上使用。所以我保存的頁面,並通過這種方式應用beautifulsoup救頁面模塊,雖然我在代碼中所提到的實際的URL在dt dd中標記數據其中包含鏈接的數據

soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read()) 

import requests 
ctr=1 
file=requests.get("https://www.crunchbase.com/organization/ani-technologies#/entity") 
soup = BeautifulSoup(file).read() 
dl_data = soup.find_all("dd") 
for dlitem in dl_data: 
    print(ctr,dlitem.string) 
    ctr+=1 

實際輸出:

0 3 Acquisitions 
1 None 
2 Bengaluru, Karnataka 
3 Ola is a mobile app for cab booking in India. 
4 None 
5 None 
6 olacab link 
7 None 
8 December 3, 2010 
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
10 [email protected] 
11 None 

在幾個地方,我得到了無,由於事實,有超鏈接的內容。例如在頁面「https://www.crunchbase.com/organization/ani-technologies#/entity」類別選項卡有5個類別名稱:電子商務,互聯網,交通,應用程序和移動,每一個連接到超鏈接,所以我不能得到我想要的文本,即這5個類別。

我想作爲作爲輸出什麼:

0 3 Acquisitions 
1 (All that text (though not important to me)) 
2 Bengaluru, Karnataka 
3 Ola is a mobile app for cab booking in India. 
4 (all that text(though not important to me)) 
==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 
6 olacab link 
7 (all that text(though not important to me)) 
8 December 3, 2010 
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
10 [email protected] 
11 (all that text(though not important to me)) 

這將是最有幫助的,如果我能得到的字典是這樣的:

{"Headquarters":["Bengaluru,Karnataka"], 
"Description":["Ola is a mobile app for cab booking in India."], 
"Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]} 
+0

不,實際上我得到了無文本實際存在的位置,但由於某些嵌套的標籤(由於超鏈接),我無法提取該文本 –

回答

0

問題。 ..我不能得到我想要的文本...如果我可以得到字典...

所有<dd><a href=...>text</dd>獲取text/href,聚集成一個dict,例如:

from collections import OrderedDict 
os_dict = OrderedDict() 

for div_class in ['definition-list-container', 'details definition-list']: 
    divs = soup.find_all("div", class_=div_class) 
    key = '?' 
    for div in divs: 
     for child in div.findChildren(): 
      if child.name == 'dt': 
       key = child.text[:-1] 
      if child.name == 'dd': 
       if child.select('a[href]'): 
        a_list = child.find_all("a") 
        if key in ['Social:']: 
         os_dict[key] = [a['href'] for a in a_list] 
        elif len(a_list) == 1: 
         os_dict[key] = a_list[0].text 
        else: 
         os_dict[key] = [a.text for a in a_list] 
       else: 
        os_dict[key] = child.text 

for n, key in enumerate(os_dict, 1): 
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key])) 

Outuput

1:   Acquisition: 3 Acquisitions 
2: Total Equity Fundin: ['11 Rounds', '24 Investors'] 
3:   Headquarters: Bengaluru, Karnataka 
4:   Description: Ola is a mobile app for cab booking in India. 
5:    Founders: ['Bhavish Aggarwal', 'Ankit Bhati'] 
6:   Categories: ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile'] 
7:    Website: http://www.olacabs.com 
8:    Social:: ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com'] 
9:    Founded: December 3, 2010 
10:    Aliases: ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 
11:    Contact: [email protected] 
12:   Employees: 8 in Crunchbase 

美麗的湯文檔:find-all
簽名:find_all(姓名,ATTRS,遞歸,串,限制,** kwargs)

dl_data = soup.find_all("dd") 
for n, dlitem in enumerate(dl_data, 1): 
    if dlitem.select('a[href]'): 
     a_text = [a.text for a in dlitem.find_all("a")] 
     print('{}: {}'.format(n, a_text)) 
    else: 
     print('{}: {}'.format(n, dlitem.text)) 

與Python測試:3.4.2 - BS4:4.6。 0

相關問題