我有兩個.csv
與Twitter數據相關的文件。一個有推文文本,另一個有這些推文的ID。帶有ID的文件是其他文件中的推文被採樣的總體。我正在嘗試編寫一個腳本來讀取文本,在另一個文件中搜索相應的ID,然後編寫一個新的.csv
文件,該文件同時包含較小示例中推文的ID和文本。使用CSVs:以正確的順序讀取和寫入數據
這是我到目前爲止有:
import csv
# creates empty dictionary in which to store tweetIDs and tweet text
originals_data = {}
# declares an empty list to hold tweet text from coded datafile
# will be used to compare against the dictionary created earlier
coded_data = []
coded_all = [] # for all, not just text
# list to hold the IDs belonging to coded tweets for the round
tweet_IDs_for_coded = []
with open('first20.csv', 'rt') as round_in, open('gg_originals.csv', 'rt') as original_in:
# reader object for gg_originals
readOrigin = csv.reader(original_in, delimiter=',')
# adds values from .csv file into the dictionary
for row in readOrigin:
originals_data[row[0]] = row[1]
# reader object for round_x data
readRound = csv.reader(round_in, delimiter=",")
# appends the tweet text to a list
for row in readRound:
coded_data.append(row[0])
# iterates over id:text dictionary
for tweet_id in originals_data:
# iterates over coded_data
for tweet in coded_data:
# When tweet in list matches text in dict, sends key to list
if tweet == originals_data[tweet_id]:
tweet_IDs_for_coded.append(tweet_id)
with open('first20.csv', 'rt') as round_in, open('test2.csv', 'wt') as output:
# reader object for round_x data
readRound = csv.reader(round_in, delimiter=",")
# creates writer object to write new csv file with IDs
writeNew = csv.writer(output, delimiter=",")
# list that holds everything that's going into the csv file
everything = []
# sets row to equal a single row from round data
row = next(readRound)
row.insert(0, 'ID')
# appends ID and then all existing data to list of rows
everything.append(row)
for i, row in enumerate(readRound):
everything.append([str(tweet_IDs_for_coded[i])] + row)
writeNew.writerows(everything)
人口文件中的數據(gg_originals.csv)看起來是這樣的:
tweet_id_str,text
534974890168700930,abcd
534267820071084033,abce
539572102441877504,abcf
539973576108294145,abcg
529278820876943361,abch
529583601244176384,abci
535172191743397888,abcj
532195210059874304,abck
537812033895669760,abcl
,
,
的純文本文件,該文件是一個子集的人口看起來像這樣:
text
abcl
abci
abcd
我到目前爲止運行,似乎得到了公司正確的ID,甚至會將它們寫入新的.csv
文件中的新列。但是,新文件中的ID不在正確的行中 - 它們顯示在文本的行中,它們實際上並不對應,這很糟糕!
新的文件應該是這個樣子:
ID,text
537812033895669760,abcl
529583601244176384,abci
534974890168700930,abcd
相反,它最終是這樣的:
ID,text
529583601244176384,abcl
537812033895669760,abci
534974890168700930,abcd
正確的ID已經找到,但他們已經被寫入到錯誤的行。
請發佈樣本數據集。 (來自兩個文件) – Saleem
包含示例輸入,實際得到*作爲輸出以及您期望得到的內容將會很有幫助。請參閱[如何創建最小,完整和可驗證示例](http://stackoverflow.com/help/mcve)。歡迎來到StackOverflow! –
我認爲你嚴重濫用你的字典,但一個數據樣本會有所幫助。您只需遍歷'coded_data',然後在每次迭代中執行'tweet_IDs_for_coded.append(coded_data [tweet])'(如果在字典中沒有找到,可能會以某種方式處理異常)。但我認爲你需要將鳴叫本身作爲字典鍵,而不是ID?這將需要示例數據以獲得進一步的幫助。 – roganjosh