我需要解析數百個存檔在服務器上的HTML文件。這些文件通過UNC訪問,然後使用pathlib的as_uri()方法將UNC路徑轉換爲URI。Python 3.6.3 urlopen從URI中刪除服務器名稱以存儲在遠程服務器上的html文件
例如低於完整UNC路徑:\\ dmsupportfs \〜圖像\沙箱\的test.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, pathlib
source_path = os.path.normpath('//dmsupportfs/~images/sandbox/') + os.sep
filename = 'test.html'
full_path = source_path + filename
url = pathlib.Path(full_path).as_uri()
print('URL -> ' + url)
url_html = urlopen(url).read()
所以URI(L)我傳遞到的urlopen是:文件:// dmsupportfs/%7Eimages/sandbox/test.html
我可以將其插入任何Web瀏覽器並返回頁面,但是,當urlopen去閱讀頁面時,它將忽略/刪除URI中的服務器名稱(dmsupportfs),並且所以讀取失敗,無法找到文件。我認爲這與urlopen方法如何處理URI有關,但我很困惑(可能是快速且容易解決的問題......對不起,Python有點新鮮)。如果我將UNC位置映射到一個驅動器號,然後使用映射的驅動器號而不是UNC路徑,則此操作沒有任何問題。我想不必依靠映射驅動器來完成這個。有什麼建議?
下面是從上面的代碼顯示錯誤的輸出:
Traceback (most recent call last):
File "C:\Anaconda3\lib\urllib\request.py", line 1474, in open_local_file
stats = os.stat(localfile)
FileNotFoundError: [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "url_test.py", line 10, in <module>
url_html = urlopen(url).read()
File "C:\Anaconda3\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Anaconda3\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Anaconda3\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Anaconda3\lib\urllib\request.py", line 1452, in file_open
return self.open_local_file(req)
File "C:\Anaconda3\lib\urllib\request.py", line 1491, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'>
UPDATE:那麼,通過上面的回溯和實際方法挖掘,我發現這一點,它實際上告訴我什麼我想處理文件:// URI不適用於遠程服務器。
def file_open(self, req):
url = req.selector
if url[:2] == '//' and url[2:3] != '/' and (req.host and
req.host != 'localhost'):
if not req.host in self.get_names():
raise URLError("file:// scheme is supported only on localhost")
任何想法,然後如何讓這個工作沒有映射驅動器?