2012-07-11 133 views
0

我試圖從網站頁面和頁面URL(其中包含這些輸入)中提取輸入字段並將它們存儲到數據庫中...好吧
*** code works fine with no errors , but this isn't the desired output i wantscrapy sql或sqlite ...不能得到所需的輸出

蜘蛛代碼:

class MySpider(CrawlSpider): 

    name = 'isa_spider' 
    allowed_domains = ['testaspnet.vulnweb.com'] 
    start_urls = ['http://testaspnet.vulnweb.com'] 

    rules = (
       Rule(SgmlLinkExtractor(allow=('/*')),callback='parse_item'),) 

    def parse_item(self, response): 

     hxs = HtmlXPathSelector(response) 
     item=IsaItem() 
     item['response_fld']=response.url 

     res = hxs.select("//input[(@id or @name) and (@type = 'text')]/@id ").extract() 
     item['text_input'] = res[0] if res else None # None is default value in case no field found 

     res = hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract() 
     item['pass_input'] = res[0] if res else None # None is default value in case no field found 

     res = hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract() 
     item['file_input'] = res[0] if res else None # None is default value in case no field found 

     return item 

管道代碼

class SQLiteStorePipeline(object): 


     def __init__(self): 
      self.conn = sqlite3.connect('./project.db') 
      self.cur = self.conn.cursor() 


     def process_item(self, item, spider): 

      self.cur.execute("insert into inputs (input_name) values(?)", (item['text_input'],)) 
      self.cur.execute("insert into inputs (input_name) values(?)", (item['pass_input'],)) 
      self.cur.execute("insert into inputs (input_name) values(?)", (item['file_input'],)) 
      self.cur.execute("insert into links (link) values(?)", (item['response_fld'],)) 
      self.conn.commit() 
      return item 

數據庫模式picture

所需的輸出picture
(對不起,直接從我的名聲不插入圖片小於10)

+0

@warwaru k'當前結果'[圖片](https://docs.google.com/drawings/d/10ewTKAE1ryuf0-aGqysQMp2E2BkQZBi9YpxhWkaHyaA/edit) – 2012-07-11 12:19:04

回答

0

沒有測試:

class SQLiteStorePipeline(object): 


    def __init__(self): 
     self.conn = sqlite3.connect('./project.db') 
     self.cur = self.conn.cursor() 


    def process_item(self, item, spider): 

     cursor = self.cur 

     target_id = ? # determine target id 
     cursor.execute("insert into links (target, link) values(?, ?)", (target_id, item['response_fld'],)) 
     link_id = cursor.lastrowid # find out just inserted link id 

     cursor.execute("insert into inputs (link_id, input_name, input_type) values(?, ?, ?)", (link_id, item['text_input'], 1)) 
     cursor.execute("insert into inputs (link_id, input_name, input_type) values(?, ?, ?)", (link_id, item['pass_input'], 2)) 
     cursor.execute("insert into inputs (link_id, input_name, input_type) values(?, ?, ?)", (link_id, item['file_input'], 3)) 

     self.conn.commit() 
+0

非常感謝,我很抱歉在回覆中遲到,這超出了我因爲有些任務是由於我今天會檢查你的建議......謝謝 – 2012-07-15 20:44:07