2016-11-30 55 views
2

我目前正在處理一項數據處理任務。我有兩個python腳本,每個腳本都實現了一個單獨的函數,但它們對相同的數據進行操作,我認爲它們可以合併爲一個單獨的工作流,但我想不出實現這一點的最合理的方法。將兩個python數據處理腳本合併爲一個工作流程

數據文件是here,它是JSON,但它有兩個不同的組件。

,第一部分是這樣的:

import json 
from collections import defaultdict 
from pprint import pprint 

with open('data-science.txt') as data_file: 
    data = json.load(data_file) 

locations = defaultdict(int) 

for item in data['data']: 
    location = item['relationships']['location']['data']['id'] 
    locations[location] += 1 

pprint(locations) 

呈現這種形式的數據:

  1: 6, 
     2: 20, 
     3: 2673, 
     4: 126, 
     5: 459, 
     6: 346, 
     8: 11, 
     9: 68, 
     10: 82, 

{ 
    "links": { 
     "self": "http://localhost:2510/api/v2/jobs?skills=data%20science" 
    }, 
    "data": [ 
     { 
      "id": 121, 
      "type": "job", 
      "attributes": { 
       "title": "Data Scientist", 
       "date": "2014-01-22T15:25:00.000Z", 
       "description": "Data scientists are in increasingly high demand amongst tech companies in London. Generally a combination of business acumen and technical skills are sought. Big data experience ..." 
      }, 
      "relationships": { 
       "location": { 
        "links": { 
         "self": "http://localhost:2510/api/v2/jobs/121/location" 
        }, 
        "data": { 
         "type": "location", 
         "id": 3 
        } 
       }, 
       "country": { 
        "links": { 
         "self": "http://localhost:2510/api/v2/jobs/121/country" 
        }, 
        "data": { 
         "type": "country", 
         "id": 1 
        } 
       }, 

它是由第一個python腳本,在這裏工作的這些位置是"id" s以及分配給該位置的記錄數。

JSON對象的另一部分看起來像這樣:

"included": [ 
    { 
     "id": 3, 
     "type": "location", 
     "attributes": { 
      "name": "Victoria", 
      "coord": [ 
       51.503378, 
       -0.139134 
      ] 
     } 
    }, 

,並通過此Python文件處理:

import json 
from collections import defaultdict 
from pprint import pprint 

with open('data-science.txt') as data_file: 
    data = json.load(data_file) 

locations = defaultdict(int) 

for record in data['included']: 
    id = record.get('id', None) 
    name = record.get('attributes', {}).get('name', None) 
    coord = record.get('attributes', {}).get('coord', None) 
    print(id, name, coord) 

它以這種格式輸出數據:

3 Victoria [51.503378, -0.139134] 
1 United Kingdom None 
71 data science None 
32 None None 
3 Victoria [51.503378, -0.139134] 
1 United Kingdom None 
1 data mining None 
22 data analysis None 
33 sdlc None 
38 artificial intelligence None 
39 machine learning None 
40 software development None 
71 data science None 
93 devops None 
63 None None 
52 Cubitt Town [51.505199, -0.018848] 

我真正喜歡的是最終的輸出看起來像這樣:

3, Victoria, [51.503378, -0.139134], 2673 

其中2673引用第一個腳本的作業計數。

如果它沒有任何座標,例如[51.503378, -0.139134]我可以把它扔掉。

我確定將這些腳本組合起來並獲得輸出是可能的,但我不是一個如此全面的思考者,我無法弄清楚如何去做。

所有真實的項目文件live here

回答

1

使用functions是組合這兩個腳本的一種方式,畢竟它們處理相同的數據。所以,你應該爲每個處理邏輯塊的功能,然後在最終合併結果:

import json 
from collections import defaultdict 
from pprint import pprint 

def process_locations_data(data): 
    # processes the 'data' block 
    locations = defaultdict(int) 
    for item in data['data']: 
     location = item['relationships']['location']['data']['id'] 
     locations[location] += 1 
    return locations 

def process_locations_included(data): 
    # processes the 'included' block 
    return_list = [] 
    for record in data['included']: 
     id = record.get('id', None) 
     name = record.get('attributes', {}).get('name', None) 
     coord = record.get('attributes', {}).get('coord', None) 
     return_list.append((id, name, coord)) 
    return return_list # return list of tuples 

# load the data from file once 
with open('data-science.txt') as data_file: 
    data = json.load(data_file) 

# use the two functions on same data 
locations = process_locations_data(data) 
records = process_locations_included(data) 

# combine the data for printing 
for record in records: 
    id, name, coord = record 
    references = locations[id] # lookup the references in the dict 
    print id, name, coord, references 

功能可以有更好的名字,但這應該達到你正在尋找的統一。

+1

這個腳本可以工作,但是當你試圖將它輸出到一個輸出文件時,你會得到錯誤UnicodeEncodeError:'ascii'編解碼器不能在位置8編碼字符u'\ xfc':序號不在範圍(128) – CMorales

+0

這與輸入數據有關。它可以處理,閱讀例如在這裏:http://stackoverflow.com/questions/5760936/handle-wrongly-encoded-character-in-python-unicode-string 但你也可以保存到文件,而不是使用最後一個循環中的'print'。 – sal

+0

如果這回答了原始問題,請將其標記爲可接受的解決方案。 – sal