2017-07-07 24 views
0

Python和BeautifulSoup的新手。任何幫助,高度讚賞我將如何去獲取鏈接列表中的信息,然後將它們轉儲到JSON對象中?

我有一個想法如何建立一個列表的公司信息,但這是在點擊一個鏈接後。

import requests 
from bs4 import BeautifulSoup 


url = "http://data-interview.enigmalabs.org/companies/" 
r = requests.get(url) 

soup = BeautifulSoup(r.content) 

links = soup.find_all("a") 

link_list = [] 

for link in links: 
    print link.get("href"), link.text 

g_data = soup.find_all("div",{"class": "table-responsive"}) 

for link in links: 
    print link_list.append(link) 

誰能給的如何去第1鏟的鏈接,然後建立所有站點的公司列表數據的JSON的想法?

我附加了示例圖像以獲得更好的可視化效果。

如何在不需要點擊每個單獨鏈接的情況下刮掉網站並構建如下示例的JSON?

例預期輸出:

all_listing = [ {"Dickens-Tillman":{'Company Detail': 
{'Company Name': 'Dickens-Tillman', 
    'Address Line 1 ': '7147 Guilford Turnpike Suit816', 
    'Address Line 2 ': 'Suite 708', 
    'City': 'Connfurt', 
    'State': 'Iowa', 
    'Zipcode ': '22598', 
    'Phone': '00866539483', 
    'Company Website ': 'lockman.com', 
    'Company Description': 'enable robust paradigms'}}}, 
`{'"Klein-Powlowski" ':{'Company Detail': 
{'Company Name': 'Klein-Powlowski', 
    'Address Line 1 ': '32746 Gaylord Harbors', 
    'Address Line 2 ': 'Suite 866', 
    'City': 'Lake Mario', 
    'State': 'Kentucky', 
    'Zipcode ': '45517', 
    'Phone': '1-299-479-5649', 
    'Company Website ': 'marquardt.biz', 
'Company Description': 'monetize scalable paradigms'}}}] 

print all_listing` 

enter image description here

enter image description here

enter image description here

+0

嗯...你會爲我們提供實際的網址嗎? –

+0

@cᴏʟᴅsᴘᴇᴇᴅ是沒問題的實際的網址是[鏈接](http://data-interview.enigmalabs.org/companies/) – Vash

+0

呃,這看起來像一個硒+ bs4的工作。 –

回答

1

這是我給我問這個問題最終的解決方案。

import bs4, urlparse, json, requests,csv 
from os.path import basename as bn 

links = [] 
data = {} 
base = 'http://data-interview.enigmalabs.org/' 

#Approach 
#1. Each individual pages, collect the links 
#2. Iterate over each link in a list 
#3. Before moving on each the list for links if correct move on, if not review step 2 then 1 
#4. Push correct data to a JSON file 



def bs(r): 
    return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table') 

for i in range(1,11): 
    print 'Collecting page %d' % i 
    links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')] 
# Search a the given range of "a" on each page 

# Now that I have collected all links into an list,iterate over each link 
# All the info is within a html table, so search and collect all company info in data 
for link in links: 
    print 'Processing %s' % link 
    name = bn(link) 
    data[name] = {} 
    for row in bs(link).findAll('tr'): 
     desc, cont = row.findAll('td') 
     data[name][desc.text.encode()] = cont.text.encode() 

print json.dumps(data) 

# Final step is to have all data formating 
json_data = json.dumps(data, indent=4) 
file = open("solution.json","w") 
file.write(json_data) 
file.close() 
相關問題