2014-10-12 144 views
8

我注意到這是一個issue on GitHub already。有沒有人有任何代碼將熊貓數據框轉換爲橙色表格?將Pandas DataFrame轉換爲橙色表

明確地說,我有下表。

 user hotel star_rating user home_continent gender 
0   1  39   4.0  1    2 female 
1   1  44   3.0  1    2 female 
2   2  63   4.5  2    3 female 
3   2  2   2.0  2    3 female 
4   3  26   4.0  3    1 male 
5   3  37   5.0  3    1 male 
6   3  63   4.5  3    1 male 
+0

橙色格式看起來並不難,只要輸出繼電器:http://docs.orange.biolab.si/reference/rst/Orange.data.formats.html也是它支持導入CSV文件和猜測的數據類型,你有嘗試過什麼嗎? – EdChum 2014-10-12 08:54:48

+0

所以我可以理解數據如何保存到*中。標籤文件,但具體來說,是否有一個函數或一系列的調用,你可以讓你轉換熊貓數據幀到橙色表? (Side評論:這個頁面如何談論數據如何存儲在外部文件中,但並沒有談到如何從文件中保存/加載,這很有趣)我個人認爲Orange沒有很好的文檔記錄。) – hlin117 2014-10-12 13:19:04

+0

這樣一個工作流Pandas中的表格作爲文件,然後在Orange工作中導入文件?還是太多了?我猜測字段數據類型可能不會很好地傳遞。 – BKay 2014-10-16 19:01:00

回答

17

Orange軟件包的文檔沒有涵蓋所有細節。根據lib_kernel.cppTable._init__(Domain, numpy.ndarray)僅適用於intfloat

他們確實應該爲pandas.DataFrames或至少支持numpy.dtype("str")提供一個C級接口。

更新:添加table2df,df2table通過對int和float使用numpy大大提高了性能。

將這段腳本保存在您的橙色Python腳本集合中,現在您在橙色環境中配備了熊貓。

使用a_pandas_dataframe = table2df(a_orange_table)a_orange_table = df2table(a_pandas_dataframe)

注意:此腳本只能在Python 2.x中,參考@DustinTang的answer爲Python 3.x的兼容腳本。

import pandas as pd 
import numpy as np 
import Orange 

#### For those who are familiar with pandas 
#### Correspondence: 
#### value <-> Orange.data.Value 
####  NaN <-> ["?", "~", "."] # Don't know, Don't care, Other 
#### dtype <-> Orange.feature.Descriptor 
####  category, int <-> Orange.feature.Discrete # category: > pandas 0.15 
####  int, float <-> Orange.feature.Continuous # Continuous = core.FloatVariable 
####             # refer to feature/__init__.py 
####  str <-> Orange.feature.String 
####  object <-> Orange.feature.Python 
#### DataFrame.dtypes <-> Orange.data.Domain 
#### DataFrame.DataFrame <-> Orange.data.Table = Orange.orange.ExampleTable 
####        # You will need this if you are reading sources 

def series2descriptor(d, discrete=False): 
    if d.dtype is np.dtype("float"): 
     return Orange.feature.Continuous(str(d.name)) 
    elif d.dtype is np.dtype("int"): 
     return Orange.feature.Continuous(str(d.name), number_of_decimals=0) 
    else: 
     t = d.unique() 
     if discrete or len(t) < len(d)/2: 
      t.sort() 
      return Orange.feature.Discrete(str(d.name), values=list(t.astype("str"))) 
     else: 
      return Orange.feature.String(str(d.name)) 


def df2domain(df): 
    featurelist = [series2descriptor(df.icol(col)) for col in xrange(len(df.columns))] 
    return Orange.data.Domain(featurelist) 


def df2table(df): 
    # It seems they are using native python object/lists internally for Orange.data types (?) 
    # And I didn't find a constructor suitable for pandas.DataFrame since it may carry 
    # multiple dtypes 
    # --> the best approximate is Orange.data.Table.__init__(domain, numpy.ndarray), 
    # --> but the dtype of numpy array can only be "int" and "float" 
    # --> * refer to src/orange/lib_kernel.cpp 3059: 
    # --> * if (((*vi)->varType != TValue::INTVAR) && ((*vi)->varType != TValue::FLOATVAR)) 
    # --> Documents never mentioned >_< 
    # So we use numpy constructor for those int/float columns, python list constructor for other 

    tdomain = df2domain(df) 
    ttables = [series2table(df.icol(i), tdomain[i]) for i in xrange(len(df.columns))] 
    return Orange.data.Table(ttables) 

    # For performance concerns, here are my results 
    # dtndarray = np.random.rand(100000, 100) 
    # dtlist = list(dtndarray) 
    # tdomain = Orange.data.Domain([Orange.feature.Continuous("var" + str(i)) for i in xrange(100)]) 
    # tinsts = [Orange.data.Instance(tdomain, list(dtlist[i]))for i in xrange(len(dtlist))] 
    # t = Orange.data.Table(tdomain, tinsts) 
    # 
    # timeit list(dtndarray) # 45.6ms 
    # timeit [Orange.data.Instance(tdomain, list(dtlist[i])) for i in xrange(len(dtlist))] # 3.28s 
    # timeit Orange.data.Table(tdomain, tinsts) # 280ms 

    # timeit Orange.data.Table(tdomain, dtndarray) # 380ms 
    # 
    # As illustrated above, utilizing constructor with ndarray can greatly improve performance 
    # So one may conceive better converter based on these results 


def series2table(series, variable): 
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"): 
     # Use numpy 
     # Table._init__(Domain, numpy.ndarray) 
     return Orange.data.Table(Orange.data.Domain(variable), series.values[:, np.newaxis]) 
    else: 
     # Build instance list 
     # Table.__init__(Domain, list_of_instances) 
     tdomain = Orange.data.Domain(variable) 
     tinsts = [Orange.data.Instance(tdomain, [i]) for i in series] 
     return Orange.data.Table(tdomain, tinsts) 
     # 5x performance 


def column2df(col): 
    if type(col.domain[0]) is Orange.feature.Continuous: 
     return (col.domain[0].name, pd.Series(col.to_numpy()[0].flatten())) 
    else: 
     tmp = pd.Series(np.array(list(col)).flatten()) # type(tmp) -> np.array(dtype=list (Orange.data.Value)) 
     tmp = tmp.apply(lambda x: str(x[0])) 
     return (col.domain[0].name, tmp) 

def table2df(tab): 
    # Orange.data.Table().to_numpy() cannot handle strings 
    # So we must build the array column by column, 
    # When it comes to strings, python list is used 
    series = [column2df(tab.select(i)) for i in xrange(len(tab.domain))] 
    series_name = [i[0] for i in series] # To keep the order of variables unchanged 
    series_data = dict(series) 
    print series_data 
    return pd.DataFrame(series_data, columns=series_name) 
+0

所以你似乎提供了一個非常徹底的答覆,謝謝!這些功能是否適用於每個Orange桌面/ Panda DataFrame? – hlin117 2014-10-19 16:15:59

+0

希望是的,我測試了我自己的數據集,但是可能需要更多的測試。 – TurtleIzzy 2014-10-19 16:20:04

+0

這對我在Python3和Orange3中沒有效果。但是,謝謝! – 2016-07-06 01:26:53

1

像這樣?

table = Orange.data.Table(df.as_matrix()) 

Orange中的列將獲得通用名稱(a1,a2 ...)。如果要從數據框中複製名稱和類型,請從數據框中構建Orange.data.Domain對象(http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data.Domain.init),並將其作爲上面的第一個參數傳遞。

請參閱http://docs.orange.biolab.si/reference/rst/Orange.data.table.html中的構造函數。

+0

我嘗試此操作時出現域錯誤。 「TypeError:構造函數無效(域或示例或兩者都有)」。你能提供一些代碼來添加一個域嗎? – hlin117 2014-10-17 18:48:07

+1

假設你有'df = DataFrame({「A」:[1,2,3,4],「B」:[8,7,6,5]})'。使用'domain = Orange.data.Domain([Orange.feature.Continuous(name)for name in df.columns])'然後'table = Orange.data.Table(domain,df.as_matrix())構建一個域。 ' – JanezD 2014-10-18 14:56:50

+0

哦,如果它不起作用:你的數據框是什麼樣的?如果'df.as_matrix()。dtype'是'object',Orange將不會接受它。您必須將分類數據轉換爲索引。 – JanezD 2014-10-18 15:04:33

2

爲了將pandas DataFrame轉換爲橙色表,您需要構建一個指定列類型的域。

對於連續變量,您只需提供變量的名稱,但對於離散變量,還需要提供所有可能值的列表。

下面的代碼將構造一個域名爲您的數據幀,並將其轉換爲橙色表:

import numpy as np 
from Orange.feature import Discrete, Continuous 
from Orange.data import Domain, Table 
domain = Domain([ 
    Discrete('user', values=[str(v) for v in np.unique(df.user)]), 
    Discrete('hotel', values=[str(v) for v in np.unique(df.hotel)]), 
    Continuous('star_rating'), 
    Discrete('user', values=[str(v) for v in np.unique(df.user)]), 
    Discrete('home_continent', values=[str(v) for v in np.unique(df.home_continent)]), 
    Discrete('gender', values=['male', 'female'])], False) 
table = Table(domain, [map(str, row) for row in df.as_matrix()]) 

地圖(STR,行)所需步驟,橙色知道,數據中包含的離散特徵值(而不是值列表中的值的索引)。

+0

這很好用!我對它進行了測試,似乎我可以按性別對錶格進行排序,所以我會假定大部分其他表函數都可以工作。 – hlin117 2014-10-18 18:02:18

+0

如果你想描述一個特徵是一個ID,那麼沒有其他的數據類型嗎? (例如,一個用戶ID) – hlin117 2014-10-19 16:17:46

2

此代碼從@TurtleIzzy修改爲Python3。

import numpy as np 
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable 


def series2descriptor(d): 
    if d.dtype is np.dtype("float") or d.dtype is np.dtype("int"): 
     return ContinuousVariable(str(d.name)) 
    else: 
     t = d.unique() 
     t.sort() 
     return DiscreteVariable(str(d.name), list(t.astype("str"))) 

def df2domain(df): 
    featurelist = [series2descriptor(df.iloc[:,col]) for col in range(len(df.columns))] 
    return Domain(featurelist) 

def df2table(df): 
    tdomain = df2domain(df) 
    ttables = [series2table(df.iloc[:,i], tdomain[i]) for i in range(len(df.columns))] 
    ttables = np.array(ttables).reshape((len(df.columns),-1)).transpose() 
    return Table(tdomain , ttables) 

def series2table(series, variable): 
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"): 
     series = series.values[:, np.newaxis] 
     return Table(series) 
    else: 
     series = series.astype('category').cat.codes.reshape((-1,1)) 
     return Table(series) 
相關問題