蟒蛇：ValueError異常：無效的字面INT（）基數爲10：「」

我有包含了像蟒蛇：ValueError異常：無效的字面INT（）基數爲10：「」

70154::308933::3 
UserId::ProductId::Score

進入我寫這個程序來讀取文本文件：（對不起了indendetion是有點亂在這裏）

def generateSyntheticData(fileName): 
dataDict = {} 
# rowDict = [] 
innerDict = {} 


try: 
    # for key in range(5): 
    # count = 0 
    myFile = open(fileName) 
    c = 0 
     #del innerDict[0:len(innerDict)] 

    for line in myFile: 
     c += 1 
     #line = str(line) 
     n = len(line) 
     #print 'n: ',n 
     if n is not 1: 
     # if c%100 ==0: print "%d: "%c, " entries read so far" 
     # words = line.replace(' ','_') 
      words = line.replace('::',' ') 

      words = words.strip().split() 


      #print 'userid: ', words[0] 
      userId = int(words[0]) # i get error here 
      movieId = int (words[1]) 
      rating =float(words[2]) 
      print "userId: ", userId, " productId: ", movieId," :rating: ", rating 
      #print words 
      #words = words.replace('_', ' ') 
      innerDict = dataDict.setdefault(userId,{}) 
      innerDict[movieId] = rating 
      dataDict[userId] = (innerDict) 
      innerDict = {} 
except IOError as (errno,strerror): 
    print "I/O error({0}) :{1} ".format(errno,strerror) 

finally: 
    myFile.close() 
print "total ratings read from file",fileName," :%d " %c 
return dataDict

，但我得到的錯誤：

ValueError: invalid literal for int() with base 10: ''

有趣的是，它是worki ng只是很好的閱讀相同的格式數據從其他文件.. 其實發布此問題時，我注意到一些奇怪的.. 條目70154 :: 308933 :: 3 每個數字有一個space.in之間像7空間0空間1空間5空間4空間::空間3 ... 該文本文件看起來不錯.. :(複製粘貼只顯示這種性質.. 反正..但任何線索怎麼回事。感謝

來源

2011-09-16 Fraz

如何閱讀文本文件？發佈您的代碼。 – jozzas

你看到的「空格」似乎是NUL（「\ x00」）。您的文件有99.9％的機會以UTF-16，UTF-16LE或UTF-16BE編碼。如果這是一次性文件，只需用記事本打開並保存爲「ANSI」，而不是「Unicode」而不是「Unicode生物傳感器」。但是，如果您需要按原樣處理它，則需要知道/檢測編碼是什麼。要找出哪個，這樣做：

print repr(open("yourfile.txt", "rb").read(20))

，並與下面的比較輸出的srtart：

>>> ucode = u"70154:" 
>>> for sfx in ["", "LE", "BE"]: 
...  enc = "UTF-16" + sfx 
...  print enc, repr(ucode.encode(enc)) 
... 
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00' 
UTF-16LE '7\x000\x001\x005\x004\x00:\x00' 
UTF-16BE '\x007\x000\x001\x005\x004\x00:' 
>>>

您可以通過檢查前2個字節進行檢測這對你來說已經足夠好：

[pseudocode] 
if f2b in `"\xff\xfe\xff"`: UTF-16 
elif f2b[1] == `"\x00"`: UTF-16LE 
elif f2b[0] == `"\x00"`: UTF-16BE 
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.

你可能避免硬編碼的備用編碼：

>>> import locale 
>>> locale.getpreferredencoding() 
'cp1252'

你行讀取代碼如下所示：

rawbytes = open(myFile, "rb").read() 
enc = detect_encoding(rawbytes[:2]) 
for line in rawbytes.decode(enc).splitlines(): 
    # whatever

哦，和線將unicode對象......如果說給你一個問題，問另外一個問題。

來源

2011-09-16 03:40:43

ah man !!!很好..這解決了這個問題..非常感謝！ – Fraz

@Fraz：很高興我能幫忙;哪一部分有幫助 - 記事本技巧或Python代碼？ –

嗯..我實際上只想得到這個工作..所以而是做了一些像words = words.replace（'\ x00'，''）:) – Fraz

調試101：簡單地改變行：

words = words.strip().split()

到：

words = words.strip().split() 
print words

，看看什麼出來。

我會提到一些事情。如果文件中的文字爲UserId::...，並且您嘗試處理該文件，則不會嘗試將其轉換爲整數。

而且...不尋常的路線：

if n is not 1:

我可能會寫爲：

if n != 1:

如果您在您的評論表明，你最終看到：

['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']

然後我會檢查你的inpu t文件用於二進制（非文本）數據。如果你只是閱讀文本和修剪/分割，你不應該結束這個二進制信息。

而且由於您聲明數字之間似乎有空格，您應該執行文件的十六進制轉儲以查明真正存在的內容。例如，它可能是一個UTF-16 Unicode字符串。

來源

2011-09-16 00:45:52 paxdiablo

我得到以下輸出：['\ x007 \ x000 \ x001 \ x005 \ x004 \ x00'，'\ x003 \ x000 \ x008 \ x009 \ x003 \ x003 \ x00'，'3'] 其實你可以看到，這是一個嵌套的字典..所以我懷疑它給了我的密鑰的地址？ – Fraz

@Fraz，如果'line'來自_text_文件，你不應該以'words'結尾。它應該是文本的。你需要檢查你的輸入文件。 – paxdiablo

@Fraz我同意paxdiablo，問題可能是您輸入數據的編碼。 @pax - 你正確的使用'！='但'不是'會在CPython上工作，因爲0-255被實現，因此所有的'1'都具有相同的身份。 – agf

蟒蛇：ValueError異常：無效的字面INT（）基數爲10：「」

回答

相關問題