2017-09-16 35 views
1

我試圖從pypi中提取pip包的許可信息,然後加載到熊貓數據框中。我之前做過一個例子,爲PD加載列表解析。但我無法弄清楚這一個...將數據加載到熊貓

到目前爲止,我已經寫了。

from requests import get 

import pandas as pd 

import pip 

url = 'https://pypi.python.org/pypi' 

# packages_list = ['numpy','twisted'] 

installed_packages = pip.get_installed_distributions() 
installed_packages_list = sorted(["%s==%s" % (i.key, i.version) 
    for i in installed_packages]) 

packages = [] 
licenses = [] 
summarys = [] 

for index, package in enumerate(installed_packages_list): 
    package = package.split("==")[0] 
    full_url = url+'/'+ package +'/json' 
    #print 'url is ' + full_url 
    page = get(url+'/'+package+'/json').json() 


    #print 'Package: ' + package + ', license is:' + page['info']['license'] + '. ' + page['info']['summary'] 
    packages.append(package) 
    licenses.append(page['info']['license']) 
    summarys.append(page['info']['summary']) 


print packages 


pd_packages = pd.DataFrame(
    { 
    "packages":[packages], 
    "licenses":[licenses], 
    "summarys":[summarys] 
    }) 

print pd_packages 
+1

什麼這是個問題嗎? –

+0

它顯示類似於0 [MIT,,MPL-2.0,LGPL,UNKNOWN,BSD-like,BSD,... packages \ 0 [beautifulsoup4,bs4,certifi,chardet,get,i ... summarys 0 [屏幕抓取庫,虛擬包是... – vkk07

+0

我想獲取這種數據在桌子的種類和轉儲到使用熊貓csv – vkk07

回答

2

試試這個:

def get_pkg_info(pkg, url_pat='https://pypi.python.org/pypi/{}/json'): 
    r = requests.get(url_pat.format(pkg)) 
    if r.status_code != requests.codes.ok: 
     return [pkg, None, None] 
    d = r.json() 
    if d and 'info' in d: 
     return [pkg, d['info'].get('license'), d['info'].get('summary')] 
    else: 
     return [pkg, None, None] 

data = [get_pkg_info(x.split('==')[0]) for x in installed_packages_list] 

df = pd.DataFrame(data, columns=['package','license','summary']) 

演示:

In [166]: pd.options.display.max_rows = 15 

In [167]: df = pd.DataFrame(data, columns=['package','license','summary']) 

In [168]: df 
Out[168]: 
       package  license           summary 
0    alabaster   None  A configurable sidebar-enabled Sphinx theme 
1  anaconda-client  UNKNOWN   Anaconda Cloud command line client library 
2 anaconda-navigator Proprietary 
3  anaconda-project   None            None 
4   asn1crypto   MIT Fast ASN.1 parser and serializer with definiti... 
5    astroid   LGPL A abstract syntax tree for Python with inferen... 
6    astropy   BSD   Community-developed python astronomy tools 
..     ...   ...            ... 
216    xarray  Apache   N-D labeled arrays and datasets in Python 
217    xlrd   BSD Library for developers to extract data from Mi... 
218   xlsxwriter   BSD  A Python module for creating Excel XLSX files. 
219    xlwings BSD 3-clause Make Excel fly: Interact with Excel from Pytho... 
220    xlwt   BSD Library to create spreadsheet files compatible... 
221   xmltodict   MIT Makes working with XML feel like you are worki... 
222    yapsy   BSD       Yet another plugin system 

[223 rows x 3 columns] 
0

我認爲這個問題源於你的DataFrame(pd_packages)的創建。包,許可證和摘要已經列出,因此[packages]使它成爲一份列表,它解釋了您在下面的評論中的輸出。

所以不是這個

pd_packages = pd.DataFrame(
    { 
    "packages":[packages], 
    "licenses":[licenses], 
    "summarys":[summarys] 
    }) 

試試這個

pd.DataFrame(
    { 
    "packages":packages, 
    "licenses":licenses, 
    "summarys":summarys 
    }) 
+0

感謝鮑勃。這就是我在向名稱中添加[]之前所做的事情......我得到一個錯誤「如果使用所有標量值,則必須傳遞索引」。這就是爲什麼我添加[] – vkk07

+0

這很奇怪。即使列表是空的,我也不會期望這個錯誤 –

相關問題