0
我有一個代碼從網站獲取鏈接。問題很少。該代碼在第二次運行時會生成正確的輸出。第一次沒有輸出。第二我得到一個錯誤。Python的網頁抓取 - UnicodeEncodeError
錯誤消息:
traceback (most recent call last):
File "C:\Users\dell\AppData\Local\Programs\Python\Python35\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 326, in RunScript
exec(codeObject, __main__.__dict__)
File "C:\Users\dell\AppData\Local\Programs\Python\Python35\ld.py", line 25, in <module>
wr.writerows([tag.text])
File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x91' in position 410: character maps to <undefined>
代碼:
import csv
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dhr.gov.in/"
urls = [url]
visited = [url]
resultFile = open('t5.csv','w', newline ='')
wr = csv.writer(resultFile, delimiter = ' ')
while len(urls)>0:
try:
htmltext = urllib.request.urlopen(urls[0]).read()
except:
pass
soup = BeautifulSoup(htmltext)
urls.pop(0)
for tag in soup.findAll('a', href = True):
tag['href'] = urllib.parse.urljoin(url,tag['href'])
if url in tag['href']and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
print([tag['href']])
print([tag.text])
wr.writerows([tag['href']])
wr.writerows([tag.text])
resultFile.close()
我正在使用python 3.5。也想知道如何設置搜索深度。 –
錯誤已解決。這是由於印地文字符。添加了異常處理程序。僅在第二次運行時纔會填充結果的問題仍然存在。 –
另一個問題是文件以只讀模式打開。這是爲什麼?我很抱歉提出愚蠢的問題。這是我第一次用Python編碼。 –