使用嵌套幀和javascript的網頁抓取

我想從一個在線聊天機器人獲得答案。 http://talkingbox.dyndns.org:49495/braintalk？（？在屬於鏈接）使用嵌套幀和javascript的網頁抓取

要發送一個問題，你只需要發送一個簡單的要求：

http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=[Here the question]

來源是這樣的：

<frameset cols="*,185" frameborder="no" border="0" framespacing="0"> 
<frameset rows="100,*,82" frameborder="no" border="0" framespacing="0"> 
    <frame src="http://thebot.de/bt_banner.html" marginwidth="0" name="frtop" scrolling="no" marginheight="0" frameborder="no"> 
    <frame src="out?id=3B9054BC032E53EF691A9A1803040F1C" name="frout" marginwidth="0" marginheight="0"> 
    <frameset rows="100%,*" border="0" framespacing="0" frameborder="no"> 
     <frame src="bt_in?id=3B9054BC032E53EF691A9A1803040F1C" name="frin" scrolling="no" marginwidth="0" marginheight="0" noresize> 
     <frame src="" name="frempty" marginwidth="0" marginheight="0" scrolling="auto" frameborder="no" > 
    </frameset> 
</frameset> 
<frameset frameborder="no" border="0" framespacing="0" rows="82,*"> 
    <frame src="stats?" name="fr1" scrolling="no" marginwidth="0" marginheight="0" frameborder="no"> 
    <frame src="http://thebot.de/bt_rechts.html" name="fr2" scrolling="auto" marginwidth="0" marginheight="0" frameborder="no" > 
</frameset> 
</frameset>

我所用「機械化」和美麗的網頁刮，但我想機械化不支持動態網頁。

如何在這種情況下得到答案？

我也在尋找一個在Windows和Linux上運行良好的解決方案。

來源

2014-01-15 user3175993

您可以嘗試selenuim，它擅長瀏覽器自動化以及phantomjs（爲無頭Webkit提供JS API，Webkit是渲染引擎）的綁定。 http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/#.UtYORpDtn4w –

什麼是動態網頁？這些框架只知道http請求，並且您分享的鏈接不可訪問 –

鏈接已死亡。 –

我會使用Requests這樣的任務。

import requests 

r = requests.get("http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=" + your_question)

對於不包含動態內容的網頁，r.text是您想要的。

由於您沒有提供更多關於動態網頁的信息，因此您無需多言。

來源

2014-01-15 04:43:33 laike9m

不管是BeautifulSoup，mechanize，Requests還是Scrapy，加載這個動態頁面都必須通過你寫的另一個步驟完成。

例如，使用scrapy這可能看起來像：

class TheBotSpider(BaseSpider): 
    name = 'thebot' 
    allowed_domains = ['thebot.de', 'talkingbox.dyndns.org'] 

    def __init__(self, *a, **kw): 
     super(TheBotSpider, self).__init__(*a, **kw) 
     self.domain = 'http://talkingbox.dyndns.org:49495/' 
     self.start_urls = [self.domain + 
          'in?id=3B9054BC032E53EF691A9A1803040F1C&msg=' + 
          self.question] 

    def parse(self, response): 
     sel = Selector(response) 
     url = sel.xpath('//frame[@name="frout"]/@src').extract()[0] 
     yield Request(url=url, callback=dynamic_page) 

    def dynamic_page(self, response): 
     .... xpath to scrape answer

來看，它與作爲參數的疑問：

scrapy crawl thebot -a question=[Here the question]

有關如何使用的詳細信息scrapy看到scrapy tutorial

來源

2014-01-15 05:32:59

使用嵌套幀和javascript的網頁抓取

回答

相關問題