提取正確的值形式輸入標籤..提供的圖像:)

我正在使用scrapy抓取蜘蛛並試圖解析輸出頁面來選擇一些輸入標籤參數（類型，id，名稱），每個數據類型被選中到一個項目因此，這將是存儲在數據庫以後類似的東西：提取正確的值形式輸入標籤..提供的圖像:)

Database Table_1 
╔════════════════╗ 
║  text  ║ 
╠════════════════╣ 
║ id │ name ║ 
╟──────┼─────────╢ 
║  │   ║ 
╟──────┼─────────╢ 
║  │   ║ 
╚══════╧═════════╝

同樣會在密碼和文件，但

我面對的是，XPath的提取整個標籤的問題！

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 
from isa.items import IsaItem 


class MySpider(CrawlSpider): 
    name = 'example.com' 
    allowed_domains = ['testaspnet.vulnweb.com'] 
    start_urls = ['http://testaspnet.vulnweb.com'] 


    rules = (
      Rule(SgmlLinkExtractor(allow=('/*')),callback='parse_item'),) 

    def parse_item(self, response): 
     self.log('%s' % response.url) 

     hxs = HtmlXPathSelector(response) 
     item=IsaItem() 
     text_input=hxs.select("//input[(@id or @name) and (@type = 'text')]").extract() 
     pass_input=hxs.select("//input[(@id or @name) and (@type = 'password')]").extract()  
     file_input=hxs.select("//input[(@id or @name) and (@type = 'file')]").extract() 

     print text_input , pass_input ,file_input 
     return item

輸出

[email protected]:~/isa/isa$ scrapy crawl example.com -L INFO -o file_nfffame.csv -t csv 
2012-07-02 12:42:02+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: isa) 
2012-07-02 12:42:02+0200 [example.com] INFO: Spider opened 
2012-07-02 12:42:02+0200 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
[] [] [] 
[] [] [] 
[] [] [] 
[u'<input name="tbUsername" type="text" id="tbUsername" class="Login">'] [u'<input name="tbPassword" type="password" id="tbPassword" class="Login">'] [] 
[] [] [] 
[u'<input name="tbUsername" type="text" id="tbUsername" class="Login">'] [u'<input name="tbPassword" type="password" id="tbPassword" class="Login">'] [] 

[] [] [] 
2012-07-02 12:42:08+0200 [example.com] INFO: Closing spider (finished)

來源

2012-07-02 right.sowrd

正確的輸出應該是什麼樣子？ –

@stav for type text >> [id，name]，用於輸入密碼[id，name]，正好[「tbUsername」，「tbUsername」]，[「tbPassword」，「tbPassword」]，我知道有一個重複值，但是這是因爲這個表單ID =名稱 –

如果我理解你的權利，你想從輸入提取屬性值。

您當前的XPath會爲您提供整個節點，因爲這就是您要求的。 XPath選擇器轉到節點的位置，但不超出它到該節點的特定屬性。

獲取一個節點的id屬性，而不是節點本身：

some/xpath/query/@id

來源

2012-07-02 12:39:34 Utkanos

你可以將你的解決方案適用於我的xpath嗎？ –

我只是告訴你原則。要提取屬性值，請將'/ @ attr_name'附加到當前的XPath。 – Utkanos

使用：

//yourCurrentExpression/@id

獲得id屬性。

使用：

//yourCurrentExpression/text()

獲得的任何通過yourCurrentExpression元素選定的文本節點孩子。

最後，你可以在兩個表達式組合成一個單一的一個：

//yourCurrentExpression/@id | //yourCurrentExpression/text()

這將產生一個節點列表，其項目是有序的，如：(id-attribute, text-node)*，換句話說，選擇的節點在代表文件順序。

來源

2012-07-02 13:09:05

$ text_input = hxs.select（「//輸入[（@ id和@name）和（@type ='text'）]/@ id」）。extract（） pass_input = hxs.select（「// input （@id和@name）和（@type ='password'）]/@ id「）。extract（）\t \t file_input = hxs.select（」// input [（@ id and @name）and @type ='file'）]/@ id「）。extract（）>>這給了我唯一的id [u'tbUsername'] [u'tbPassword'] [] –

對不起，我試圖格式化我的答案，但我可以't :( –

@ right.sowrd：爲了能夠將XPath表達式替換爲更一般的XPath表達式，您至少需要知道XPath的最小值。我認爲您錯誤地進行了替換，但是很難從您的評論。您可以編輯您的問題並將其放在那裏 - 或者提出一個新問題。正如我所提到的，如果您在提出問題之前學習最低XPath是最好的 - 否則您將無法理解並應用答案 - 正如本案例清楚顯示的那樣。 –

提取正確的值形式輸入標籤..提供的圖像:)

回答

相關問題