2011-08-05 44 views
2

我是PowerShell的新手,已經達到了我的知識水平。我正在編寫一個腳本來從內部網頁上刮取備份數據,然後從刮取信息中提取信息進行操作,然後在Excel中顯示。Powershell:ScreenScraping http並將特定行作爲變量返回

$Yesterday = [DateTime]::Now.AddDays(-1) 
$datestr = $Yesterday.ToString("dd-MMM-yyyy") 
$WebClient = New-Object System.Net.WebClient 
$Results = $WebClient.DownloadString("http://fakeurl") 

這導致了大量的含輸出HTTP代碼,以及在我感興趣的數據,但所有集束在一起。然後我這樣做:

[StringSplitOptions]$option = "None" 
[string[]]$separator = "</td>" 
$SPL = $Results.Split($separator, $option) 

這會將數據拆分爲更易讀的格式。以下是我對$ SPL感興趣的部分 的一小節。

<tr><td headers="HOST_NAME" class="t13dataalt">server01 
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01 
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23 
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51 
<td headers="BACKUP_TYPE" class="t13dataalt">differential 
<td headers="SCHEDULE" class="t13dataalt">daily 
<td align="right" headers="SIZE_MB" class="t13dataalt">  2,091.18 
<td headers="IMAGES" class="t13dataalt">1 
<td headers="EXIT_STATUS" class="t13dataalt">0 
</tr><tr><td headers="HOST_NAME" class="t13data">server02 
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02 
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22 
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41 
<td headers="BACKUP_TYPE" class="t13data">differential 
<td headers="SCHEDULE" class="t13data">daily 
<td align="right" headers="SIZE_MB" class="t13data">  2,496.31 
<td headers="IMAGES" class="t13data">1 
<td headers="EXIT_STATUS" class="t13data">0 

從這個我需要提取的開始和結束時間,制定出經過的時間,並且也可以返回最近的備份的EXIT_STATUS。我試過以下,但我覺得我可能會找錯了樹:

$Position = select-string -inputobject $SPL -pattern $datestr 

$ Position.matches導致:

PS C:\Scripts> $Position.matches 

Groups : {03-Aug-2011} 
Success : True 
Captures : {03-Aug-2011} 
Index : 12056 
Length : 11 
Value : 03-Aug-2011 

我的理論是做使用索引添加一個子到長度來提取日期後的時間值,但我不知道該怎麼做。我也認爲這有點重要。必須有一種更簡單的方式來返回我需要的變量信息,而不必指向現場,然後撕掉其餘部分?


好的,因爲我不確定如何在頁面底部添加這樣的部分,我將在此處添加它。

這是我目前的腳本,它沒有任何錯誤地運行,但不返回任何結果。

# Get yesterdays date and convert it to the required search format 
    $Yesterday = [DateTime]::Now.AddDays(-1) 
    $datestr = $Yesterday.ToString("dd-MMM-yyyy") 

# Scrape the webpage 
    $url = "http://fake-url" 
    $WebClient = New-Object System.Net.WebClient 
    $Results = $WebClient.DownloadString($url) 

# Determine if the previous day is listed in the backups 
    $IsDateThere = $Results.Contains($datestr) 
     If ($IsDateThere){ 
      # split the data into rows 
      [StringSplitOptions]$option = "None" 
      [string[]]$separator = "</td>" 
      $SPL = $Results.Split($separator, $option) 

      #strip the data into a hash table 
      $SPL | 
       Foreach-Object { 
        where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
         ForEach-Object { 
         @{ 
           $matches[1] = ($matches[2]).trim() 
          } 
         } 
       }   
     } 
     Else{ 
      Write-Host "Yesterday's date not found" 
     } 

任何想法?我不確定接下來要做什麼來獲取最新備份和退出代碼的開始時間和結束時間作爲變量。

回答

3

我想接近它是這樣的

$html = @" 
<tr><td headers="HOST_NAME" class="t13dataalt">server01 
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01 
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23 
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51 
<td headers="BACKUP_TYPE" class="t13dataalt">differential 
<td headers="SCHEDULE" class="t13dataalt">daily 
<td align="right" headers="SIZE_MB" class="t13dataalt">  2,091.18 
<td headers="IMAGES" class="t13dataalt">1 
<td headers="EXIT_STATUS" class="t13dataalt">0 
</tr><tr><td headers="HOST_NAME" class="t13data">server02 
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02 
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22 
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41 
<td headers="BACKUP_TYPE" class="t13data">differential 
<td headers="SCHEDULE" class="t13data">daily 
<td align="right" headers="SIZE_MB" class="t13data">  2,496.31 
<td headers="IMAGES" class="t13data">1 
<td headers="EXIT_STATUS" class="t13data">0 
"@ 

$html -split "`r`n" | where {$_ -match 'start_time|end_time'} | 
    ForEach { 
     $pos = $_.IndexOf("headers") 
     $begin = $pos+9 
     $end = $_.IndexOf('"', $begin) 

     new-object PSObject -Property @{ 
      Key = $_.SubString($begin, $end-$begin) 
      Value = Get-Date($_.SubString($_.IndexOf(">")+1)) 
     } 
    } 

結果

Key  Value    
---  -----    
START_TIME 8/1/2011 9:23:00 PM 
END_TIME 8/1/2011 9:51:00 PM 
START_TIME 7/31/2011 9:22:00 PM 
END_TIME 7/31/2011 9:41:00 PM 
1

這不是原單答案 - 道格的使用REG前的只是一個替代版本來捕獲所有的數據:

$html -split "`n" | where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
    % { 
     @{ 
       $matches[1] = ($matches[2]).trim() 
      } 
    } 

編輯:使用questi中的代碼於:

$Yesterday = [DateTime]::Now.AddDays(-1) 
$datestr = $Yesterday.ToString("dd-MMM-yyyy") 
$WebClient = New-Object System.Net.WebClient 
$Results = $WebClient.DownloadString("http://fakeurl") 

[StringSplitOptions]$option = "None" 
[string[]]$separator = "</td>" 
$SPL = $Results.Split($separator, $option) 

$SPL | 
    Foreach-Object { 
     where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
      % { 
      @{ 
        $matches[1] = ($matches[2]).trim() 
       } 
      } 
    } 

編輯2:

$SPL | 
     Foreach-Object { 
      where {$_ -match 'headers="(.*)" class.*>(.*)'} | 
       % { 
if (($matches[2]).trim() -eq $datestr) { "$($matches[1]) is yesterday's back up" } 
       } 
     } 
+0

感謝所有幫助。我會在今天測試這個,並讓你知道我如何繼續。我可以將$ SPL變量傳遞給哈希表而不是上面的字符串嗎?@Doug Finke – jok5r

+0

是的,這應該可以工作(我會在上面更改以顯示我相信它會起作用) – Matt

+0

如何在我的原始問題下擴展您的答案? – jok5r

相關問題