我有以下形式的數據集:Python的文件進行預處理(從值離散範圍到值的連續範圍的轉換的列。)
user_id::item_id1::rating::timestamp
user_id::item_id2::rating::timestamp
user_id::item_id3::rating::timestamp
user_id::item_id4::rating::timestamp
我所需要的ITEM_IDS(有n以排序的順序不同的項目ID 。後面的行可以具有相同的項目ID或不同,但它的保證是排序)是連續的從1到n,並且它們目前範圍從1至k
對於k >> N
我有下面的代碼,但它不是很正確,並一直在它幾個小時,所以真的很感謝任何關於這方面的幫助,或者如果有一個更簡單的方式來做到這一點在Python中,我真的很感謝關於這方面的指導。
我現在有以下代碼:
def reOrderItemIds(inputFile,outputFile):
#This is a list in the range of 1 to 10681.
itemIdsRange = set(range(1,10682))
#currKey = 1
currKey = itemIdsRange.pop()
lastContiguousKey=1
#currKey+1
contiguousKey=itemIdsRange.pop()
f = open(inputFile)
g = open(outputFile,"w")
oldKeyToNewKeyMap = dict()
for line in f:
if int(line.split(":")[1]) == currKey and int(line.split(":")[1])==lastContiguousKey:
g.write(line)
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])!=contiguousKey:
oldKeyToNewKeyMap[line.split(":")[1]]=contiguousKey
lastContiguousKey=contiguousKey
#update current key to the value of the current key.
currKey=int(line.split(":")[1])
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])==currKey and int(line.split(":")[1])!=contiguousKey:
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
elif int(line.split(":")[1])!=currKey and int(line.split(":")[1])==contiguousKey:
currKey = int(line.split(":")[1])
lastContiguousKey=contiguousKey
oldKeyToNewKeyMap[line.split(":")[1]] = lastContiguousKey
contiguousKey=itemIdsRange.pop()
g.write(line.split(":")[0]+":"+str(lastContiguousKey)+":"+line.split(":")[2]+":"+line.split(":")[3])
f.close()
g.close()
實施例:
1::1::3::100
10::1::5::104
20::2::3::110
1::5::2::104
我所需要的輸出爲以下形式:
1::1::3::100
10::1::5::104
20::2::3::110
1::3::2::104
所以只有ITEM_IDS柱變化其他一切都保持不變。
任何幫助將不勝感激!
你的例子是你的函數輸出的例子還是數據集的例子? – wwii
您的預期輸出的第三行是否存在拼寫錯誤?它應該是「20 :: 2 :: 3 :: 110」嗎? – wwii