2015-10-07 70 views
1

我試圖將下面的打印命令輸出到字典中(沒有成功),以便隨後將其導出爲CSV。Python 3 - 將變量導入字典

我怎樣才能得到parseddata(輸出下面的打印)到一個字典?

樣本輸入文件:

<html> 
<body> 
<p>{ success:true ,results:3,rows:[{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"N‌​on-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cu‌​mulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cum‌​ulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}</p> 
</body> 
</html> 

我的代碼:

import requests 
import re 
from bs4 import BeautifulSoup 
url = requests.get("http://. . .") 
soup = BeautifulSoup(url.text, "lxml") 
parseddata = soup.string.split(':[', 1)[1].lstrip(']') 
print(parseddata) 

print(parseddata)輸出爲:

{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]} 
+0

但是'parseddata'看起來像什麼? – yurib

+0

yurib,我已編輯帖子以顯示parseddata的樣子。謝謝 –

+0

@zs_python:你能提供一個樣本輸入文件來處理,以便人們可以運行測試用例。 –

回答

0

這看起來像一個鍵 - 值映射,與ISIN鍵和"INE134E01011"值。但它不是JSON,因爲鑰匙中沒有報價,也不是YAML因爲普通標鍵(即字符串不帶引號必須是followed by colon + space:

如果你打破部分輸出字符串¹:

test_str = (
    '{ISIN:"INE134E01011",Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"14-Aug-2015 15:39",' 
    'SeqNumber:"1001577"},' 
    '{ISIN:"INE134E01011",' # new mapping starts 
    'Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"30-May-2015 14:37",' 
    'SeqNumber:"129901"},' 
    '{ISIN:"INE134E01011",' # new mapping starts 
    'Ind:"-",' 
    'Audited:"Un-Audited",' 
    'Cumulative:"Non-cumulative",' 
    'Consolidated:"Non-Consolidated",' 
    'FilingDate:"17-Feb-2015 14:57",' 
    'SeqNumber:"126171"}]}' 
) 

測試它等於你輸入:

test_org = '{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"14-Aug-2015 15:39",SeqNumber:"1001577"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"30-May-2015 14:37",SeqNumber:"129901"},{ISIN:"INE134E01011",Ind:"-",Audited:"Un-Audited",Cumulative:"Non-cumulative",Consolidated:"Non-Consolidated",FilingDate:"17-Feb-2015 14:57",SeqNumber:"126171"}]}' 
assert test_str == test_org 

這分裂清楚其實有3名映射,並有一個尾隨]}的。表示存在一個列表,這與使用逗號分隔的3個映射一致。匹配[失蹤,因爲你在':['分裂後,你lstrip()它。

您可以輕鬆地操作字符串,YAML可以分析它,但結果是一個列表²:

import ruamel.yaml 
test_str = '[' + test_str.replace(':"', ': "').rstrip('}') 

data = ruamel.yaml.load(test_str) 
print(type(data)) 

打印:

<class 'list'> 

而且,由於該名單包括有http://stardict.sourceforge.net/Dictionaries.php下載共同的鑰匙你不能只是結合那些沒有丟失的信息。

您可以此列表映射到某個鍵(有一個冒號在split和輸出具有後}跡象表明是在XML),也可以採取與唯一值的鍵(SeqNumber)和提升價值的關鍵在字典替換名單:

ddata = {} 
for elem in data: 
    k = elem.pop('SeqNumber') 
    ddata[k] = elem 

,但我沒有看到一個原因,從列表中去的字典,如果你的最終目標是一個CSV文件。如果你從YAML解析器的輸出,你可以這樣做:

import csv 
with open('output.csv', 'w', newline='') as fp: 
    csvwriter = csv.writer(fp) 
    csvwriter.writerow(data[0].keys()) # header of common dict keys 
    for elem in data: 
     csvwriter.writerow(elem.values()) # values 

得到一個CSV與以下內容的文件:

ISIN,Ind,Consolidated,Cumulative,Audited,FilingDate 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,14-Aug-2015 15:39 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,30-May-2015 14:37 
INE134E01011,-,Non-Consolidated,Non-cumulative,Un-Audited,17-Feb-2015 14:57 

¹而是與\逃逸的新行的,我用括號使多行定義成一個字符串,這使我可以更容易地對行發表評論
²而不是重新添加'[',你當然不應該將它放在首位

+0

謝謝安東恩,這是完美的,只是爲我做了工作!真的很感謝你所做的所有努力,我也向我解釋。謝謝@ShadowRanger,你的effo rts已經添加到我的python學習中,並且也非常有幫助。這個noob被你們爲幫助我學習而付出的努力所淹沒。謝謝你! –

+0

@zs_python如果這解決了您的問題,請考慮接受答案(通過單擊此答案頂部旁邊的標記)。這向其他人表明你的問題已經解決(他們可能不會一直讀到你的評論),並在數據庫中標記爲這樣。 – Anthon

+0

感謝@anthon手握,已經接受了指導的答案。很快見到你們:) –

2

除了雜散靠近支架/支架, 這是有效的JSON這是有效的YAML(我做了上午在我最初的答案中採用;可以在不引用屬性的情況下聲明JavaScript對象,但JSON便攜式格式不允許這樣做; YAML)。

按照說明here使用PyYAML解析數據。手冊split -ing和lstrip正在傷害你,使它比需要的更難。剛剛拿到text,然後用yaml解析(這是必須單獨安裝第三方模塊):

import requests 
import yaml 
from bs4 import BeautifulSoup 

url = requests.get("http://. . .") 
soup = BeautifulSoup(url.text, "lxml") 
# Use safe_load over load to avoid opening security holes; YAML can do 
# a lot of unsafe things if the input isn't trusted, but handling JS 
# object literals can be done safely with safe_load 
response_object = yaml.safe_load(soup.string.strip()) 
data_rows = response_object['rows'] 

for row in data_rows: 
    ... do stuff with each returned row ... 

你可以閱讀更多的PyYAML tutorial

+0

感謝ShadowRanger,我猜「末尾流浪的緊支撐/支架」是問題,請問我該如何擺脫它? –

+1

@zs_python:在你問之前預期並添加了一個例子。 :-) – ShadowRanger

+2

可能性是,原始數據是有效的'json',只有你感興趣的對象是一個只有一個屬性(包含一個元素數組)的對象的數組屬性中的唯一條目。你可能只需要'json.loads'整個事情,然後訪問並分配'data_as_dict = whole_thing_as_dict ['name_of_singleton_key'] [0]'並且避免顯式的'拆分'和'lstrip'。 – ShadowRanger