子類beautifulsoup HTML解析器，得到錯誤類型

我使用beautifulsoup偉大的HTML解析器子類beautifulsoup HTML解析器，得到錯誤類型

最近我試圖通過類屬性來提高代碼，並直接提供包裝類的所有beautifulsoup方法（而不是寫了一個小包裝），我認爲繼承美麗的解析器將是實現這一目標的最佳方式。

這裏是類：

class ScrapeInputError(Exception):pass 
from BeautifulSoup import BeautifulSoup 

class Scrape(BeautifulSoup): 
    """base class to be subclassed 
    basically a subclassed BeautifulSoup wrapper that providers 
    basic url fetching with urllib2 
    and the basic html parsing with beautifulsoup 
    and some basic cleaning of head,scripts etc'""" 

    def __init__(self,file): 
     self._file = file 
     #very basic input validation 
     import re 
     if not re.search(r"^http://",self._file): 
      raise ScrapeInputError,"please enter a url that starts with http://" 

     import urllib2 
     #from BeautifulSoup import BeautifulSoup 
     self._page = urllib2.urlopen(self._file) #fetching the page 
     BeautifulSoup.__init__(self,self._page) 
     #self._soup = BeautifulSoup(self._page) #calling the html parser

這樣我就可以開始與

x = Scrape("http://someurl.com")

類，並能遍歷樹x.elem或x.find

這個工程與一些美麗的方法wonderfull（見上文），但與其他人失敗 - 那些使用迭代器像「for e in x：」

錯誤消息：

Traceback (most recent call last): 
    File "<pyshell#86>", line 2, in <module> 
    print e 
    File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__ 
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs) 
    File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall 
    seq = self.asynccall(oid, methodname, args, kwargs) 
    File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall 
    self.putmessage((seq, request)) 
    File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage 
    s = pickle.dumps(message) 
    File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex 
    raise TypeError("a class that defines __slots__ without " 
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

我研究了錯誤消息，但無法找到任何東西，我可以一起工作 - becasue我不想BeautifulSoup內植入玩（和誠實我不知道或理解__slot__或__getstate__ ..）我只是想使用的功能。

，而不是子類我試圖從類的__init__返回beautifulsoup對象，但__init__方法返回None

要高興的任何幫助這裏。

來源

2011-10-07 alonisser

旁註：不要使用're'測試一個字符串的子開始，這是矯枉過正。改爲使用'str.startswith（）'。（'如果不是file.startswith（「http：//」）：'）。 –

感謝費迪南德！ – alonisser

另一個旁註：你真的想禁止'https：//'嗎？（或者'ftp：//'，或者'file：//'？）你可能想依靠'urlopen'自己的驗證;它會在無效URL上引發'urllib2.URLError'。 –

BeautifulSoup代碼中沒有發生該錯誤。相反，您的IDLE無法檢索並打印對象。改爲嘗試print str(e)。

無論如何，子類BeautifulSoup在你的情況可能不是最好的主意。你真的想繼承所有的解析方法（如convert_charref,handle_pi或error）嗎？更糟糕的是，如果你重寫BeautifulSoup使用的東西，它可能會以難以找到的方式破解。

我不知道你的情況，但我建議preferring composition over inheritance（即在屬性中有一個BeautifulSoup對象）。您可以輕鬆地（如果在一個稍微哈克的方式）公開具體方法是這樣的：

class Scrape(object): 
    def __init__(self, ...): 
     self.soup = ... 
     ... 
     self.find = self.soup.find

來源

2011-10-07 09:23:40

感謝petr viktorin！我會嘗試構圖的方式！ – alonisser

此方法是否也適用於__iter__和__key__方法？ – alonisser

[No]（http://docs.python.org/reference/datamodel.html#special-method-lookup-for-new-style-classes），但你仍然可以做'def __iter __（self）：return iter （self.soup）'。 –

子類beautifulsoup HTML解析器，得到錯誤類型

回答

相關問題