2016-10-13 135 views
0

我需要一些幫助。我的輸出看起來不對。我怎樣才能正確追加dept,job_title,job_location的值。並且存在具有dept值的html標籤。我如何刪除這些標籤。python append()並刪除html標籤

我的代碼

response = requests.get("http://hortonworks.com/careers/open-positions/") 
soup = BeautifulSoup(response.text, "html.parser") 

jobs = [] 


div_main = soup.select("div#careers_list") 


for div in div_main: 
    dept = div.find_all("h4", class_="department_title") 
    div_career = div. find_all("div", class_="career") 
    title = [] 
    location = [] 
    for dv in div_career: 
     job_title = dv.find("div", class_="title").get_text().strip() 
     title.append(job_title) 
     job_location = dv.find("div", class_="location").get_text().strip() 
     location.append(job_location) 

    job = { 
     "job_location": location, 
     "job_title": title, 
     "job_dept": dept 
    } 
    jobs.append(job) 
pprint(jobs) 

它應該看起來像

{ 'job_dept':諮詢,

'job_location': '芝加哥,IL'

'JOB_TITLE':SR顧問 - 中央'

每個變量的1個值。

+1

請出示你的輸出... –

+0

輸出將顯示,job_dept:所有部門,工作_location:所有位置,job_title:所有標題 –

回答

0

HTML的結構是連續的,不分層,所以你必須通過你的工作清單和更新部門標題重複,當您去:

import requests 
from bs4 import BeautifulSoup, Tag 
from pprint import pprint 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20130331 Firefox/21.0'} 
response = requests.get("http://hortonworks.com/careers/open-positions/", headers=headers) 

soup = BeautifulSoup(response.text, "html.parser") 

jobs = [] 


div_main = soup.select("div#careers_list") 


for div in div_main: 
    department_title = "" 
    for element in div: 
     if isinstance(element, Tag) and "class" in element.attrs: 
      if "department_title" in element.attrs["class"]: 
       department_title = element.get_text().strip() 
      elif "career" in element.attrs["class"]: 
       location = element.select("div.location")[0].get_text().strip() 
       title = element.select("div.title")[0].get_text().strip() 
       job = { 
        "job_location": location, 
        "job_title": title, 
        "job_dept": department_title 
       } 
       jobs.append(job) 

pprint(jobs) 
+0

我有這個錯誤,當我運行這個。如果isinstance(element,Tag)和element.attrs.has_key(「class」): AttributeError:'dict'對象沒有屬性'has_key' –

+0

我更新了我的答案,所以它可以與python3一起使用。 – nullop

+0

哇。驚人。它運作良好。輸出是正確的..我使用pycharm。部分「job_dept」:department_title。 department_title被突出顯示。它說:名稱'department_title'可以不定義 –