2012-10-20 49 views
0

我要創建在Python 2-d陣列像這樣:如何在Python中創建一個二維數組?

 n1 n2 n3 n4 n5 

w1 1 4 0 1 10 

w2 3 0 7 0 3 

w3 0 12 9 5 4 

w4 9 0 0 9 7 

凡W1,W2,...是不同的詞和N1 N2 N3是不同的博客。
我該如何做到這一點?

+0

有什麼用呢? - 看起來更像是你想要一個2元組(word,blog):freq dict ... –

+0

我是一個數據挖掘項目,在某個地方我想收集這樣的數據。我知道單詞每個博客的計數[n1 n1 ..]。我只是創建了spradesheet,我已經有了這個值。我知道在python中創建這個的方法。 – Target

回答

1

假設每個博客文字可作爲一個字符串,並且您有一個blogs中可用的字符串列表,這是您創建矩陣的方式。

import re 
# Sample input for the following code. 
blogs = ["This is a blog.","This is another blog.","Cats? Cats are awesome."] 
# This is a list that will contain dictionaries counting the wordcounts for each blog 
wordcount = [] 
# This is a list of all unique words in all blogs. 
wordlist = [] 
# Consider each blog sequentially 
for blog in blogs: 
    # Remove all the non-alphanumeric, non-whitespace characters, 
    # and then split the string at all whitespace after converting to lowercase. 
    # eg: "That's not mine." -> "Thats not mine" -> ["thats","not","mine"] 
    words = re.sub("\s+"," ",re.sub("[^\w\s]","",blog)).lower().split(" ") 
    # Add a new dictionary to the list. As it is at the end, 
    # it can be referred to by wordcount[-1] 
    wordcount.append({}) 
    # Consider each word in the list generated above. 
    for word in words: 
     # If that word has been encountered before, increment the count 
     if word in wordcount[-1]: wordcount[-1][word]+=1 
     # Else, create a new entry in the dictionary 
     else: wordcount[-1][word]=1 
     # If it is not already in the list of unique words, add it. 
     if word not in wordlist: wordlist.append(word) 

# We now have wordlist, which has a unique list of all words in all blogs. 
# and wordcount, which contains len(blogs) dictionaries, containing word counts. 
# Matrix is the table that you need of wordcounts. The number of rows will be 
# equal to the number of unique words, and the number of columns = no. of blogs. 
matrix = [] 
# Consider each word in the unique list of words (corresponding to each row) 
for word in wordlist: 
    # Add as many columns as there are blogs, all initialized to zero. 
    matrix.append([0]*len(wordcount)) 
    # Consider each blog one by one 
    for i in range(len(wordcount)): 
     # Check if the currently selected word appears in that blog 
     if word in wordcount[i]: 
      # If yes, increment the counter for that blog/column 
      matrix[-1][i]+=wordcount[i][word] 

# For printing matrix, first generate the column headings 
temp = "\t" 
for i in range(len(blogs)): 
    temp+="Blog "+str(i+1)+"\t" 

print temp 
# Then generate each row, with the word at the starting, and tabs between numbers. 

for i in range(len(matrix)): 
    temp = wordlist[i]+"\t" 
    for j in matrix[i]: temp += str(j)+"\t" 
    print temp 

現在,matrix[i][j]將包含次字wordlist[i]出現在博客blogs[j]數量。

+0

你好Kaustubh,請你可以添加commet線,以便我更好地理解。我爲同樣的答案.plz評論代碼。如果我想打印它然後如何打印它,一件事? – Target

+0

我完成了它,我知道:P 無論如何...複製粘貼整個事情在一個交互式控制檯,看看它在行動。 –

+0

謝謝Kaustubh ...! – Target

0

如果在一個列表或字典的元組不會做,可以考慮使用pandas

from pandas import * 
In [554]: print DataFrame({'n1':[1,3,0,9], 'n2':[4,0,12,0], 'n3':[0,7,9,0], 'n4':[1,0,5,9], 'n5':[10,3,4,7]},index=['w1','w2','w3','w4']) 
    n1 n2 n3 n4 n5 
w1 1 4 0 1 10 
w2 3 0 7 0 3 
w3 0 12 9 5 4 
w4 9 0 0 9 7 
+0

而不使用這個熊貓? – Target

+0

看@Jon Clement的評論。但答案取決於你想要用這些數據做什麼。但請注意,雖然有多種方式來保存數據,但它們可能不會與您的示例相同。 – root

0

我不會創建任何名單,也不要爲一個2-d陣列,而是創建一個字典這是由你的x和y標題,作爲一個元組。如在:

data["w1", "n1"] = 1 

這可以被認爲是一種「稀疏矩陣」表示。根據你想要在數據上執行什麼操作,你可能需要一個dict字典,其中外部dict的鍵是xheader或yheader,而內部鍵是相反的。

假設元組,作爲密鑰表示,考慮你的數據表作爲輸入:

text = """\ 
    n1 n2 n3 n4 n5 

w1 1 4 0 1 10 

w2 3 0 7 0 3 

w3 0 12 9 5 4 

w4 9 0 0 9 7 
""" 

data = {} 
lines = text.splitlines() 
xheaders = lines.pop(0).split() 
for line in lines: 
    if not line.strip(): 
     continue 
    elems = line.split() 
    yheader = elems[0] 
    for (xheader, datum) in zip(xheaders, elems[1:]): 
     data[xheader, yheader] = int(datum) 
print data 
print sorted(data.items()) 

打印生產:

{('n3', 'w4'): 0, ('n4', 'w2'): 0, ('n2', 'w2'): 0, ('n1', 'w4'): 9, ('n3', 'w3'): 9, ('n2', 'w3'): 12, ('n3', 'w2'): 7, ('n2', 'w4'): 0, ('n5', 'w3'): 4, ('n2', 'w1'): 4, ('n4', 'w1'): 1, ('n5', 'w2'): 3, ('n5', 'w1'): 10, ('n4', 'w3'): 5, ('n4', 'w4'): 9, ('n1', 'w3'): 0, ('n1', 'w2'): 3, ('n5', 'w4'): 7, ('n1', 'w1'): 1, ('n3', 'w1'): 0} 
[(('n1', 'w1'), 1), (('n1', 'w2'), 3), (('n1', 'w3'), 0), (('n1', 'w4'), 9), (('n2', 'w1'), 4), (('n2', 'w2'), 0), (('n2', 'w3'), 12), (('n2', 'w4'), 0), (('n3', 'w1'), 0), (('n3', 'w2'), 7), (('n3', 'w3'), 9), (('n3', 'w4'), 0), (('n4', 'w1'), 1), (('n4', 'w2'), 0), (('n4', 'w3'), 5), (('n4', 'w4'), 9), (('n5', 'w1'), 10), (('n5', 'w2'), 3), (('n5', 'w3'), 4), (('n5', 'w4'), 7)] 
0

一種方法是使用numpy

>>> from numpy import array 
>>> array([ (1,4,0,1,10), (3,0,7,0,3), (0,12,9,5,4), (9,0,0,9,7) ]) 
array([[ 1, 4, 0, 1, 10], 
    [ 3, 0, 7, 0, 3], 
    [ 0, 12, 9, 5, 4], 
    [ 9, 0, 0, 9, 7]]) 
0

如果你只是想二維數組沒有任何解析,你可以寫這樣的:

a = [ 
    [1, 4, 0, 1, 10], 
    [3, 0, 7, 0, 3], 
    [0, 12, 9, 5, 4], 
    [9, 0, 0, 9, 7] 
]