我無法切片UTF-8編碼文件。在使用編解碼器打開之後,由於開始時會導致移位的字節順序標記(BOM)字符,因此對部分進行切片變得困難。如何有效切片UTF-8編碼文件
查看下面我的嘗試的詳細信息。
def readfiles(filepaf):
with codecs.open(filepaf,'r', 'utf-8') as f:
g=f.read()
q=' '.join(g.split())
return q
q=readfiles(c:xxx)
q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door...
>>> q[0:100]
u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'
>>> q[0:100].encode('utf-8')
'\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'
唯一準確的結果來通過直接印刷切片部分,但我的程序使用切片部分的,而不是印刷,最經常的切片部分是不準確的,由於在開始換檔。
理想輸出
凱瑟琳打開她的嘴脣,她深吸了一口氣,彷彿用同樣的力度來回答,當shuttin
上怎麼給任何建議,但不會在一開始有BOM字符?
備註:「開始時不需要的字符」的合適名稱是[BOM](http://en.wikipedia.org/wiki/Byte_order_mark) –