2014-12-25 16 views
0

我需要從Powershell中的outerHTML下面提取項目名稱,項目製造商,項目實際。通過HTML中的相關標籤解析

<DIV class=row> 
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A> 
    <DIV class=text-small>2 ml</DIV> 
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV> 
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV> 
    <DIV class="col-sm-2 col-xs-4 text-right"> 
    <DIV class=item-actual>Rs. 6</DIV> 
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI> 
    <LI class="list-item item js-drug"> 
    <DIV class=row> 
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A> 
    <DIV class=text-small>28 Tablets</DIV> 
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV> 
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV> 
    <DIV class="col-sm-2 col-xs-4 text-right"> 
    <DIV class=item-actual>Rs. 5.72</DIV> 
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI> 
    <LI class="list-item item js-drug"> 

渲染輸出看起來是這樣的:

Spasmonil (20mg) - Cipla Limited - Rs. 6 
Sprintas (75mg) - Intas Laboratories Pvt - Rs. 5.72 

我這樣做的相當的高效的方式,我得到不同的txt文件,我4個輸出(drugsname,drugsquan,drugspric,drugsmanu)之後手動合併。有人能以一種優雅的方式幫助我做到這一點。

$regex1 = 'item-name.*?>(.*?)</A>' 
$regex2 = 'text-small>(.*?)</DIV>' 
$regex3 ='"item-manufacturer visible-xs">(.*?)</DIV>' 
$regex4 ='item-actual>(.*?)</DIV>' 

$drugsname = $ie.Document.body.outerHTML -split "`r`n" | 
    ForEach-Object{ 
    If($_ -match $regex1){ 
     $matches[1]  
    } 
    } 

$drugsquan = $ie.Document.body.outerHTML -split "`r`n" | 
    ForEach-Object{ 
    If($_ -match $regex2){ 
     $matches[1]  
    } 
    } 

$drugsmanu = $ie.Document.body.outerHTML -split "`r`n" | 
    ForEach-Object{ 
    If($_ -match $regex3){ 
     $matches[1]  
    } 
    } 

$drugspric = $ie.Document.body.outerHTML -split "`r`n" | 
    ForEach-Object{ 
    If($_ -match $regex4){ 
     $matches[1]  
    } 
    } 

$drugsname > "d:\users\desktop\HKD\($control)drugsname.txt" 
$drugsquan > "d:\users\desktop\HKD\($control)drugsquan.txt" 
$drugsmanu > "d:\users\desktop\HKD\($control)drugsmanu.txt" 
$drugspric > "d:\users\desktop\HKD\($control)drugspric.txt" 
+0

謝謝你指出。如果我們不使用正則表達式來解析HTML,那麼應該使用什麼? – Yogesh

+0

您可以將輸入解釋爲XML:[xml] $ data = $ contentFromWeb – TGlatzer

回答

2

在下面的字符串使用多線/單線正則表達式(又名「在一個可以大蝦」):

$data = 
@' 
<DIV class=row> 
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A> 
    <DIV class=text-small>2 ml</DIV> 
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV> 
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV> 
    <DIV class="col-sm-2 col-xs-4 text-right"> 
    <DIV class=item-actual>Rs. 6</DIV> 
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI> 
    <LI class="list-item item js-drug"> 
    <DIV class=row> 
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A> 
    <DIV class=text-small>28 Tablets</DIV> 
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV> 
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV> 
    <DIV class="col-sm-2 col-xs-4 text-right"> 
    <DIV class=item-actual>Rs. 5.72</DIV> 
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI> 
    <LI class="list-item item js-drug"> 
'@ 

[regex]$regex = 
@' 
(?ms).*?<DIV class=row>.*? 
.+?item-name href=".+?>(.+?)</A>.*? 
.+?text-small>(.+?)</DIV>.*? 
.+?item-manufacturer.+?>(.+?)</DIV></DIV>.*? 
.+?item-actual>(.+?)</DIV> 
'@ 

$regex.Matches($data) | 
foreach { 
      [PSCustomObject]@{ 
      Name = $_.Groups[1].value 
      Quantity = $_.Groups[2].Value 
      Manufacturer = $_.Groups[3].Value 
      Price = $_.Groups[4].Value 
     } 
} 

Name      Quantity     Manufacturer    Price      
----      --------     ------------    -----      
Spasmonil (20mg)   2 ml      Cipla Limited    Rs. 6      
Sprintas (75mg)   28 Tablets     Intas Laboratories Pvt Ltd Rs. 5.72     

現在你有一個對象集合,你可以進行排序,過濾,格式化和導出以滿足您的需求。

+0

這真的很高明,所以我刪除了我的答案。你的選票得到了。 –

+0

謝謝。對不起,因爲天真,但我需要這樣的東西:$ data = @'get-content $ ie.Document.body.outerHTML'。由於上述所有HTML數據都通過$ ie.Document.body.outerHTML傳遞到我的程序中。你能告訴我如何改變「$ data =」 – Yogesh

+0

@MickyBalladelli--謝謝! Yogesh - 它可以將所有數據(多行)作爲一個單一的多行字符串進行處理。它應該消耗$ ie.Document.body.outerHTML而無需修改 - 只需將其替換爲$ data即可。如果它位於文件中,請使用Get-Content-Ray將文件讀取爲單個多行字符串。 – mjolinor