2014-12-13 52 views
-2

這是我過去4天一直在努力解決的問題。我閱讀谷歌和SOF的教程,但沒有人能幫助我。我在這裏拋出它作爲一個問題,以便其他人可以嘗試並幫助我解決它。我已經用粗糙的方法解決了這個問題,但考慮是否有更聰明的方法。所以有一個文件列出了球軸承及其屬性。它看起來像這樣:匹配模式,將其保存在一個變量中,並使用sed/awk/grep將其追加到行尾

<li class="odd first"> 
    <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003030&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=1">33030</a> 
    |<strong>Product category: </strong> <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003030&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=1&amp;isTableView=true" class="product-table-link">Tapered roller bearings single row</a> 

     |<strong>Width: </strong> 59 mm 
     |<strong>Bore diameter: </strong> 150 mm 
     |<strong>Outside diameter: </strong> 225 mm 
     |<strong>Source: </strong> - 


     |<strong>Limiting speed: </strong> 2600 r/min 
     |<strong>Reference speed: </strong> 2000 r/min 


</li> 
<li class="even "> 
    <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310000230&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=2">30230</a> 
    |<strong>Product category: </strong> <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310000230&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=2&amp;isTableView=true" class="product-table-link">Tapered roller bearings single row</a> 

     |<strong>Width: </strong> 49 mm 
     |<strong>Bore diameter: </strong> 150 mm 
     |<strong>Outside diameter: </strong> 270 mm 
     |<strong>Source: </strong> - 


     |<strong>Limiting speed: </strong> 2400 r/min 
     |<strong>Reference speed: </strong> 1800 r/min 


</li> 
<li class="odd "> 
    <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003024&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=3">33024</a> 
    |<strong>Product category: </strong> <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003024&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=3&amp;isTableView=true" class="product-table-link">Tapered roller bearings single row</a> 

     |<strong>Width: </strong> 48 mm 
     |<strong>Bore diameter: </strong> 120 mm 
     |<strong>Outside diameter: </strong> 180 mm 
     |<strong>Source: </strong> - 


     |<strong>Limiting speed: </strong> 3400 r/min 
     |<strong>Reference speed: </strong> 2600 r/min 


</li> 
<li class="even "> 
    <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003022&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=4">33022</a> 
    |<strong>Product category: </strong> <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003022&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=4&amp;isTableView=true" class="product-table-link">Tapered roller bearings single row</a> 

     |<strong>Width: </strong> 47 mm 
     |<strong>Bore diameter: </strong> 110 mm 
     |<strong>Outside diameter: </strong> 170 mm 
     |<strong>Source: </strong> - 


     |<strong>Limiting speed: </strong> 3600 r/min 
     |<strong>Reference speed: </strong> 2600 r/min 


</li> 
<li class="odd "> 
    <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003220&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=5">33220</a> 
    |<strong>Product category: </strong> <a href="/productcatalogue/prodlink.html?lang=en&amp;imperial=false&amp;prodid=1310003220&amp;pubid=21&amp;WT.oss=&amp;WT.z_oss_boost=0&amp;WT.z_oss_ref=ProductSearch&amp;WT.z_oss_rank=5&amp;isTableView=true" class="product-table-link">Tapered roller bearings single row</a> 

     |<strong>Width: </strong> 63 mm 
     |<strong>Bore diameter: </strong> 100 mm 
     |<strong>Outside diameter: </strong> 180 mm 
     |<strong>Source: </strong> - 


     |<strong>Limiting speed: </strong> 3600 r/min 
     |<strong>Reference speed: </strong> 2400 r/min    
</li> 

現在,如果你看看HTML(而不是html本身)的響應。我想解析它,提取href鏈接中的參數(在第一個條目中,href鏈接中有prodid參數,prodid = 1310003030)。如果可能的話,我想在每行的末尾附加整個鏈接。

我想提取它並追加在每行的末尾,使條目看起來像這樣。

33030 |Product category: Tapered roller bearings single row |Width: 59 mm |Bore diameter: 150 mm |Outside diameter: 225 mm |Source: - |Limiting speed: 2600 r/min |Reference speed: 2000 r/min | 1310003030 
30230 |Product category: Tapered roller bearings single row |Width: 49 mm |Bore diameter: 150 mm |Outside diameter: 270 mm |Source: - |Limiting speed: 2400 r/min |Reference speed: 1800 r/min | 1310000230 
33024 |Product category: Tapered roller bearings single row |Width: 48 mm |Bore diameter: 120 mm |Outside diameter: 180 mm |Source: - |Limiting speed: 3400 r/min |Reference speed: 2600 r/min | 1310003024 
33022 |Product category: Tapered roller bearings single row |Width: 47 mm |Bore diameter: 110 mm |Outside diameter: 170 mm |Source: - |Limiting speed: 3600 r/min |Reference speed: 2600 r/min | 1310003022 
+0

將HTML刮到機器可讀的輸出中是徒勞的練習。看看你是否可以連接到任何產生HTML的人。 – tripleee 2014-12-13 13:22:42

+0

嗯,我想從我的分析網站刮一些數據。所以我無法控制源代碼。 – 2014-12-13 14:27:34

+1

用正確的工具節省你的時間,試試'xmllint'來完成這項工作 – BMW 2014-12-13 21:55:50

回答

0

這是sed版本。我不得不承認,使用sed交換不同行上的單詞順序並不容易;

sed -nre ' 
/^ *<a/{ 
    h;s/^.*prodid=([0-9]+).*$/ |\1/;x;s_^.*>([0-9]+)</a.*$_\1_ 
    :back 
    N 
    s/\n.*(Product category:).*\">(.*)<.*$/ |\1 \2/ 
    s_\n.*strong>(.*)</strong>(.*)$_ |\1 \2_ 
    /<\/li>$/ !bback 
    /<\/li>$/ { 
     s/<\/li>$//;G;s/\n//g;s/ */ /g;p 
    } 
} 
' file 
0

的UNIX工具,通用文本操作是awk

$ cat tst.awk 
BEGIN { 
    FS = "[[:space:]]*<[^>]+>[[:space:]]*" 
    OFS = " |" 
} 

/^[[:space:]]*<a href/{ 
    split($0,a,/.*prodid=|&.*/) 
    prodid = a[2] 
    prodnr = $(NF-1) 
} 

/<strong>/ { 
    name = $2 
    value = ($NF == "" ? $(NF-1) : $NF) 
    sub(/[[:space:]]+$/,"",value) 
    n2v[name] = value 
    if (!seen[name]++) { 
     names[++numNames] = name 
    } 
} 

/<\/li>/ { 
    printf "%s%s", prodnr, OFS 
    for (nameNr=1; nameNr<=numNames; nameNr++) { 
     name = names[nameNr] 
     value = n2v[name] 
     printf "%s %s%s", name, value, OFS 
    } 
    print " " prodid 
} 

$ awk -f tst.awk file 
33030 |Product category: Tapered roller bearings single row |Width: 59 mm |Bore diameter: 150 mm |Outside diameter: 225 mm |Source: - |Limiting speed: 2600 r/min |Reference speed: 2000 r/min | 1310003030 
30230 |Product category: Tapered roller bearings single row |Width: 49 mm |Bore diameter: 150 mm |Outside diameter: 270 mm |Source: - |Limiting speed: 2400 r/min |Reference speed: 1800 r/min | 1310000230 
33024 |Product category: Tapered roller bearings single row |Width: 48 mm |Bore diameter: 120 mm |Outside diameter: 180 mm |Source: - |Limiting speed: 3400 r/min |Reference speed: 2600 r/min | 1310003024 
33022 |Product category: Tapered roller bearings single row |Width: 47 mm |Bore diameter: 110 mm |Outside diameter: 170 mm |Source: - |Limiting speed: 3600 r/min |Reference speed: 2600 r/min | 1310003022 
33220 |Product category: Tapered roller bearings single row |Width: 63 mm |Bore diameter: 100 mm |Outside diameter: 180 mm |Source: - |Limiting speed: 3600 r/min |Reference speed: 2400 r/min | 1310003220 
相關問題