2016-10-09 46 views
0

我提前了很長的帖子道歉,但我確信它很容易跟蹤和非常明確的。的Python:如何創建嵌套的字典出名單的具體值

我的問題是這樣的:

如何創建一個嵌套的字典出清單,以指定的重複鍵?

這裏是想我做,用數據對一個虛構的新聞文章的例子:

{'http://www.SomeNewsWebsite.com/Article12345': 
{'Title': 'Trump Does Another Ridiculous Thing', 
    'Source': 'Some News Website', 
    'Thumbnail': 'SomeNewsWebsite.com/image12345'}} 

閱讀類似post,我看到有人做類似的事情,但一直在努力端口這些想法融入我自己的工作中。

這是我的問題的結束。下面,我已經發布了由代碼生成的代碼和示例列表,這是我用來製作嵌套字典的內容。這也可以在我的Github

到目前爲止,我可以使用下面的代碼來獲取數據,切出的重要位,然後做兩個lists--一個網址,一個標題。然後它使用Zip將它們組合成一個整潔的字典。

url = "http://www.reuters.com" 

source = "Reuters" 

thumbnail = "http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png" 


def soup(): 
    """ Fetches HTML from site and turns it into a bs4 object. """ 
    get_html = requests.get(url) 
    html = get_html.text 
    make_soup = BeautifulSoup(html, 'html.parser') 
    return make_soup 


# Tell bs4 where to find the important information (headlines, URLs) 
important_data = (soup().select(".story-content > .story-title > a")) 


# Turn that important data into a string so it may be parsed using RegEx 
stringed_data = ' || '.join(str(v) for v in important_data) 


def get_headline(): 
    """ Uses Regular Expressions to find headlines. Returns a list. """ 
    headline = re.findall(r'(?<=">)(.*?)(?=</a>)', stringed_data) 
    return headline 


def get_link(): 
    """ Uses Regular Expressions to find links. Returns a list. """ 
    link = re.findall(r'(?<=<a href=")(.*?)(?=")', stringed_data) 
    return link 

def build_dict(): 
    """ Combine everything into a tidy dictionary. """ 
    full_urls = [i if i.startswith('http') else url + i for i in get_link()] 
    reuters_dictionary = dict(zip(get_headline(), full_urls)) 
    return full_urls 

get_link() 
get_headline() 
soup() 
build_dict() 

運行時,此代碼將創建2個列表,然後是一個字典。實施例的數據顯示如下:

List of titles:(29 items long) 
['Trump strikes defiant tone ahead of debate', 'Matthew swamps North Carolina, still dangerous as it heads out to sea', "Tesla's Musk says will not have to raise funds in fourth-quarter", 'Suspect arrested in fatal shooting of two California police officers', 'Russia says U.S. actions threaten its national security', 'Western-backed coalition under pressure over Yemen raid', "Fed's Fischer says job gains solid, expects growth to pick up", "Thai king's condition unstable after hemodialysis treatment: palace", 'Pope names new group of cardinals, adding to potential successors', 'Palestinian kills two people in Jerusalem, then shot dead: police', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'", 'Earnings season begins as White House race heats up', 'Russia expects OPEC to ask non members to consider joining output curb', 'Banks ponder the meaning of life as Deutsche agonizes', 'IMF says still engaged with Greece, no decision yet on bailout role', 'Pound slump exacerbates Brexit impact for German exporters: DIHK', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources', 'Ukraine military postpones withdrawal from town, cites rebel shelling', 'German police make new raid in hunt for refugee planning bomb attack', "South African President Zuma's rape accuser dies: family", 'Xi says China must speed up plans for domestic network technology', 'UberEats to expand to Berlin in 2017: Tagesspiegel', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services', 'Pressure on Trump likely to be intense at second debate with Clinton', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.", 'Evangelical leaders stick with Trump, focus on defeating Clinton', 'Citi sells its Argentinian consumer business to Banco Santander', "Itaú to pay $220 million for Citigroup's Brazil assets", 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico'] 


List of URLs: (29 items long) 
['/article/us-usa-election-idUSKCN1290JZ', '/article/us-storm-matthew-idUSKCN129063', '/article/us-tesla-equity-solarcity-idUSKCN1290QW', '/article/us-california-police-shooting-idUSKCN1280YH', '/article/us-russia-usa-idUSKCN1290DP', '/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', '/article/us-usa-fed-fischer-idUSKCN1290JB', '/article/us-thailand-king-idUSKCN1290R8', '/article/us-pope-cardinals-idUSKCN1290C9', '/article/us-israel-palestinians-violence-idUSKCN129070', '/article/us-society-entertainment-film-idUSKCN127229', '/article/us-usa-stocks-weekahead-idUSKCN1272HS', '/article/us-oil-opec-russia-idUSKCN1290KD', '/article/us-imf-g20-banks-idUSKCN1290DX', '/article/us-imf-g20-greece-idUSKCN1290R6', '/article/us-britain-eu-germany-idUSKCN1290TZ', '/article/us-oil-opec-istanbul-idUSKCN1290N2', '/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', '/article/us-germany-bomb-idUSKCN1290D2', '/article/us-safrica-zuma-idUSKCN1290SX', '/article/us-china-internet-security-idUSKCN1290LA', '/article/us-uber-germany-eats-idUSKCN1290OB', '/article/us-china-regulations-ride-hailing-idUSKCN1280EL', '/article/us-usa-election-debate-idUSKCN1290AS', '/article/us-usa-election-clinton-idUSKCN1280Z9', '/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', '/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', '/article/us-citibank-brasil-m-a-itau-unibco-hldg-idUSKCN1280HM', '/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU'] 

Dictionary of titles and URLs: (29 items long) 
{'Banks ponder the meaning of life as Deutsche agonizes': 'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX', 'German police make new raid in hunt for refugee planning bomb attack': 'http://www.reuters.com/article/us-germany-bomb-idUSKCN1290D2', 'Suspect arrested in fatal shooting of two California police officers': 'http://www.reuters.com/article/us-california-police-shooting-idUSKCN1280YH', 'Evangelical leaders stick with Trump, focus on defeating Clinton': 'http://www.reuters.com/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', 'Xi says China must speed up plans for domestic network technology': 'http://www.reuters.com/article/us-china-internet-security-idUSKCN1290LA', "Australia's Rinehart and China's Shanghai CRED agree on deal for Kidman cattle empire": 'http://www.reuters.com/article/us-australia-china-landsale-dakang-p-f-idUSKCN12908O', 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico': 'http://www.reuters.com/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU', 'Citi sells Argentinian consumer unit a day after Brazil sale': 'http://www.reuters.com/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services': 'http://www.reuters.com/article/us-china-regulations-ride-hailing-idUSKCN1280EL', 'Pope names new group of cardinals, adding to potential successors': 'http://www.reuters.com/article/us-pope-cardinals-idUSKCN1290C9', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'": 'http://www.reuters.com/article/us-society-entertainment-film-idUSKCN127229', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources': 'http://www.reuters.com/article/us-oil-opec-istanbul-idUSKCN1290N2', "South African President Zuma's rape accuser dies: family": 'http://www.reuters.com/article/us-safrica-zuma-idUSKCN1290SX', 'Palestinian kills two people in Jerusalem, then shot dead: police': 'http://www.reuters.com/article/us-israel-palestinians-violence-idUSKCN129070', 'Matthew swamps North Carolina, still dangerous as it heads out to sea': 'http://www.reuters.com/article/us-storm-matthew-idUSKCN129063', 'Western-backed coalition under pressure over Yemen raid': 'http://www.reuters.com/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', 'Trump strikes defiant tone ahead of debate': 'http://www.reuters.com/article/us-usa-election-idUSKCN1290JZ', 'Russia says U.S. actions threaten its national security': 'http://www.reuters.com/article/us-russia-usa-idUSKCN1290DP', 'Pressure on Trump likely to be intense at second debate with Clinton': 'http://www.reuters.com/article/us-usa-election-debate-idUSKCN1290AS', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.": 'http://www.reuters.com/article/us-usa-election-clinton-idUSKCN1280Z9', "Tesla's Musk says will not have to raise funds in fourth-quarter": 'http://www.reuters.com/article/us-tesla-equity-solarcity-idUSKCN1290QW', "Fed's Fischer says job gains solid, expects growth to pick up": 'http://www.reuters.com/article/us-usa-fed-fischer-idUSKCN1290JB', 'Ukraine military postpones withdrawal from town, cites rebel shelling': 'http://www.reuters.com/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', "Thai king's condition unstable after hemodialysis treatment: palace": 'http://www.reuters.com/article/us-thailand-king-idUSKCN1290R8', 'Earnings season begins as White House race heats up': 'http://www.reuters.com/article/us-usa-stocks-weekahead-idUSKCN1272HS', 'IMF says still engaged with Greece, no decision yet on bailout role': 'http://www.reuters.com/article/us-imf-g20-greece-idUSKCN1290R6', 'Pound slump exacerbates Brexit impact for German exporters: DIHK': 'http://www.reuters.com/article/us-britain-eu-germany-idUSKCN1290TZ', 'Russia expects OPEC to ask non members to consider joining output curb': 'http://www.reuters.com/article/us-oil-opec-russia-idUSKCN1290KD', 'UberEats to expand to Berlin in 2017: Tagesspiegel': 'http://www.reuters.com/article/us-uber-germany-eats-idUSKCN1290OB'} 

爲了清楚起見,我想使用該數據來創建一個字典標題和URL的每個配對,如下所示:

{'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX': 
{'Title': 'Banks ponder the meaning of life as Deutsche agonizes', 
    'Source': 'Reuters', 
    'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}} 

由於一噸花時間閱讀,並提前感謝您的幫助。

+0

@Jaiden_Dechon看看這個[使用python中的重複鍵製作詞典](https://stackoverflow.com/questions/10664856/make-dictionary-with-duplicate-keys-in-python) – user2728397

回答

1

考慮一個字典解析:

newsdict = {v: {'Title': k, 
       'Source': 'Reuters', 
       'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'} 
      for k, v in reuters_dictionary.items()} 
+0

只是小澄清,字典是否允許字典中的重複鍵? – user2728397

+0

否。新值將覆蓋舊值。 – wonce

+0

網址是不是唯一的?粗略閱讀您的文章,我沒有看到任何。一些頁面是否有多個標題? – Parfait

0

這應該給你你想要的結果:

def build_dict(): 
    """ Combine everything into a tidy dictionary. """ 
    full_urls = [i if i.startswith('http') else url + i for i in get_link()] 
    reuters_dictionary = {} 
    for (headline, url) in zip(get_headline(), full_urls): 
     reuters_dictionary[url] = { 
      'Title': headline, 
      'Source': source, 
      'Thumbnail' : thumbnail 
     } 
    return full_urls # <- I think you want to do "return reuters_dictionary" here(?) 

然而,沒有什麼關於這裏重複鍵。爲什麼你覺得需要重複鍵?

而且你應該重構,以消除那些全局變量。

最後,如果你已經使用BeatifulSoup,你爲什麼然後回落到正則表達式後來呢?我認爲在任何地方使用BeautifulSoup應該更加健壯。

+0

通過'重複鍵',我指的是每個字典將如何包含相同的鍵:標題,來源和縮略圖,以及如何確保每個詞典中都存在這些鍵。我曾經想過把所有東西都重構成類,但是我很高興能夠找出所有背後的邏輯,所以很難不先把注意力集中在那個部分上!至於使用bs4做所有事情,你可以使用'soup().select('#selectors')'來找到一個特定的標籤,但是這個標籤仍然包含元數據,當它很短時可以很容易地解析。 –