Python的文件進行預處理（從值離散範圍到值的連續範圍的轉換的列。）

我有以下形式的數據集：Python的文件進行預處理（從值離散範圍到值的連續範圍的轉換的列。）

user_id::item_id1::rating::timestamp 
user_id::item_id2::rating::timestamp 
user_id::item_id3::rating::timestamp 
user_id::item_id4::rating::timestamp

我所需要的ITEM_IDS（有n以排序的順序不同的項目ID 。後面的行可以具有相同的項目ID或不同，但它的保證是排序）是連續的從1到n，並且它們目前範圍從1至k

對於k >> N

我有下面的代碼，但它不是很正確，並一直在它幾個小時，所以真的很感謝任何關於這方面的幫助，或者如果有一個更簡單的方式來做到這一點在Python中，我真的很感謝關於這方面的指導。

我現在有以下代碼：

def reOrderItemIds(inputFile,outputFile): 
     #This is a list in the range of 1 to 10681. 
     itemIdsRange = set(range(1,10682)) 
     #currKey = 1 
     currKey = itemIdsRange.pop() 
     lastContiguousKey=1 
     #currKey+1 
     contiguousKey=itemIdsRange.pop() 
     f = open(inputFile) 
     g = open(outputFile,"w") 
     oldKeyToNewKeyMap = dict() 
     for line in f: 
       if int(line.split(":")[1]) == currKey and int(line.split(":")[1])==lastContiguousKey: 
         g.write(line) 
       elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])!=contiguousKey: 
         oldKeyToNewKeyMap[line.split(":")[1]]=contiguousKey 
         lastContiguousKey=contiguousKey 
         #update current key to the value of the current key. 
         currKey=int(line.split(":")[1]) 
         contiguousKey=itemIdsRange.pop() 
         g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3]) 
       elif int(line.split(":")[1])==currKey and int(line.split(":")[1])!=contiguousKey: 
         g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3]) 

       elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])==contiguousKey: 
         currKey = int(line.split(":")[1]) 
         lastContiguousKey=contiguousKey 
         oldKeyToNewKeyMap[line.split(":")[1]] = lastContiguousKey 
         contiguousKey=itemIdsRange.pop() 
         g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3]) 
     f.close() 
     g.close()

實施例：

1::1::3::100 
10::1::5::104 
20::2::3::110 
1::5::2::104

我所需要的輸出爲以下形式：

1::1::3::100 
10::1::5::104 
20::2::3::110 
1::3::2::104

所以只有ITEM_IDS柱變化其他一切都保持不變。

任何幫助將不勝感激！

來源

2014-04-13 anonuser0428

你的例子是你的函數輸出的例子還是數據集的例子？ – wwii

您的預期輸出的第三行是否存在拼寫錯誤？它應該是「20 :: 2 :: 3 :: 110」嗎？ – wwii

隨着我的道歉嚴重誤讀了你的問題在第一時間，假設data是包含

1::1::3::100 
10::1::5::104 
20::2::3::110 
30::5::3::121 
40::9::7::118 
50::10::2::104

文件（如果您的數據不能全部被轉換爲整數，這可能被修改。）

>>> with open('data', 'r') as datafile: 
... dataset = datafile.read().splitlines() 
... 
>>> ids = {0} 
>>> for i, line in enumerate(dataset): 
... data = list(map(int, line.split('::'))) 
... if data[1] not in ids: 
...  data[1] = max(ids) + 1 
...  ids.add(data[1]) 
... dataset[i] = '::'.join((str(d) for d in data)) 
... 
>>> print('\n'.join(dataset)) 
1::1::3::100 
10::1::5::104 
20::2::3::110 
30::3::3::121 
40::4::7::118 
50::5::2::104

同樣，如果您的數據集很大，可以設計出更快的解決方案。

來源

2014-04-13 00:47:03

感謝您的回覆，但我認爲所有這些都是通過item_id對文件進行排序，在我的情況下，行已經按項目id排序，我需要將離散值設置爲連續，就像我上面指出的示例一樣。 – anonuser0428

誤解你的問題陳述的道歉。我已經用適用的代碼恢復了我的答案（假設我現在理解正確）。 –

因爲您的數據已經按item_id排序 - 您可以使用itertools.groupby()，這使得解決方案變得簡單。

from operator import itemgetter 
from itertools import groupby 

item_id = itemgetter(1) 
def reOrderItemIds(inputFile,outputFile): 
    n = 1 
    with open(inputFile)as infile, open(outputFile,"w") as outfile: 
     dataset = (line.split('::') for line in infile) 
     for key, group in groupby(dataset, item_id): 
      for line in group: 
       line[1] = str(n) 
       outfile.write('::'.join(line)) 
      n += 1

來源

2014-04-13 02:10:04 wwii

我認爲這會改變ID爲連續的，但不會留下重複ID的單獨。注意預期輸出中的第二列是「1，1，2，3」。 –

'''不會留下重複ids''' - 嗯，我不明白，你試過嗎？ OP指定*只有item_ids列更改，其他部分保持不變*。我的輸出與OP的預期輸出相匹配。 '''groupby（）'''按列表/行中的第二項進行分組。 – wwii

啊，我的道歉。我現在看到你正在迭代組中的每一行。 –

Python的文件進行預處理（從值離散範圍到值的連續範圍的轉換的列。）

回答

相關問題