其實我想從網站「https://www.crunchbase.com/organization/ani-technologies#/entity」中獲取數據,其中我的數據出現在dt和dd標籤內,並且因爲bot不允許在網站上使用。所以我保存的頁面,並通過這種方式應用beautifulsoup救頁面模塊,雖然我在代碼中所提到的實際的URL在dt dd中標記數據其中包含鏈接的數據
soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())
import requests
ctr=1
file=requests.get("https://www.crunchbase.com/organization/ani-technologies#/entity")
soup = BeautifulSoup(file).read()
dl_data = soup.find_all("dd")
for dlitem in dl_data:
print(ctr,dlitem.string)
ctr+=1
實際輸出:
0 3 Acquisitions 1 None 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 None 5 None 6 olacab link 7 None 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 [email protected] 11 None
在幾個地方,我得到了無,由於事實,有超鏈接的內容。例如在頁面「https://www.crunchbase.com/organization/ani-technologies#/entity」類別選項卡有5個類別名稱:電子商務,互聯網,交通,應用程序和移動,每一個連接到超鏈接,所以我不能得到我想要的文本,即這5個類別。
我想作爲作爲輸出什麼:
0 3 Acquisitions 1 (All that text (though not important to me)) 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 (all that text(though not important to me)) ==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 6 olacab link 7 (all that text(though not important to me)) 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 [email protected] 11 (all that text(though not important to me))
這將是最有幫助的,如果我能得到的字典是這樣的:
{"Headquarters":["Bengaluru,Karnataka"],
"Description":["Ola is a mobile app for cab booking in India."],
"Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}
不,實際上我得到了無文本實際存在的位置,但由於某些嵌套的標籤(由於超鏈接),我無法提取該文本 –