2017-08-29 116 views
1

我有一個由字符串組成的熊貓數據框,即'P1','P2','P3',...,null。熊貓數據框用NaN替換字符串使用pd.concat

當我嘗試連接這個數據框與另一個時,所有的字符串被替換爲'NaN'。

看我下面的代碼:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 
descriptions['desc'] = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f1=pd.DataFrame(descriptions['desc']) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 
bugPrior['priority'] = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f2=pd.DataFrame(bugPrior['priority']) 

df = pd.concat([f1,f2]) 
print(df.head()) 

輸出如下:

   desc          priority 
0 Usability issue with external editors (1GE6IRL)  NaN 
1    API - VCM event notification (1G8G6RR)  NaN 
2 Would like a way to take a write lock on a tea...  NaN 
3 getter/setter code generation drops "F" in ".....  NaN 
4 Create Help Index Fails with seemingly incorre...  NaN 

任何想法,我怎麼可能會停止這種情況的發生?

最終,我的目標是將所有內容都放在一個數據框中,以便我可以刪除所有具有「空」值的行。這也有助於後面的代碼。

謝謝。

回答

2

假設您想要水平連接這些列,您需要將axis=1傳遞給pd.concat,因爲默認情況下,連接是垂直的。

df = pd.concat([f1,f2], axis=1) 

要刪除那些NaN行,你應該能夠使用df.dropna。之後致電df.reset_index

df = pd.concat([f1, f2], 1) 
df = df.dropna().reset_index(drop=True) 
print(df.head(10)) 
               desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2 
5 CCE in RenameResourceAction while renaming ele...  P3 
6 Option to prevent cursor from moving off end o...  P3 
7  Tasks section in the user doc is very stale  P3 
8 Importing existing project with different case...  P3 
9 Workspace in use --> choose new workspace but ...  P3 

打印出來df.priority.unique(),我們看到有5個獨特的工作重點:

print(df.priority.unique()) 
array(['P3', 'P2', 'P4', 'P1', 'P5'], dtype=object) 
+0

謝謝你的幫助,這個數據集已經在驅動m個堅果了,這只是數據導入! – JohnWayne360

2

我認爲最好不存在從列創建DataFrames:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 

#get Series to f1 
f1 = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f1.head()) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 

#get Series to f2 
f2 = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f2.head()) 

然後使用相同的解決方案cᴏʟᴅsᴘᴇᴇᴅ答案:

df = pd.concat([f1,f2], axis=1).dropna().reset_index(drop=True) 
print (df.head()) 
              short_desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2 
+0

這正是我的答案。 :) –

+0

沒關係。您不必進行編輯,但謝謝,我很感激。 –

+1

@jezrael感謝您的回答。我想我可能會應用您的建議並創建專欄。 – JohnWayne360

相關問題