抓取並從xpath表中提取數據

我爬過城市Wiki頁面，需要提取城市所屬的國家。我試圖找到包含單詞「country」的<th>，而不是回到<tr>，並發現它在<td>中，但問題有幾種情況。抓取並從xpath表中提取數據

（我的工作是第一種情況的代碼）

a = doc.xpath("//table[contains(@class, 'infobox')]") 
b = a[0].xpath("//table//th[contains(text(),'Country') or contains(text(),'country')]") 
country = b[0].xpath("./../td//a//text()")[0].replace(" ", "_")

我知道爲什麼它不爲其他情況下工作，但我不知道如何解決它。

的關鍵詞「國家」是<th>

<tr class="mergedtoprow"> 
 
     <th scope="row">Country</th> 
 
     <td> 
 
     <a href="/wiki/Poland" title="Poland">Poland</a> 
 
     </td> 
 
</tr>

的關鍵詞「國家」是在<a>那在<span>即<th>

` Constituent country England

<tr class="mergedrow"> 
 
     <th scope="row"> 
 
     <span class="nowrap"> 
 
     <a href="/wiki/Countries_of_the_United_Kingdom" title="Countries of the 
 
     United Kingdom">Constituent country 
 
     </a> 
 
     </span> 
 
     </th> 
 
     <td> 
 
     <span class="flagicon"><img alt="" src="SRC (never mind)" width="23" 
 
     height="14" class="thumbborder" srcset="SRC (never mind)" />&#160; 
 
     </span> 
 
     <a href="/wiki/England" title="England">England</a> 
 
     </td> 
 
    </tr>

關鍵詞是「國家」是<a>在<th>


 

 
     <tr class="mergedrow"> 
 
      <th scope="row"> 
 
      <a href="/wiki/Countries_of_the_United_Kingdom" title="Countries of the United Kingdom">Country 
 
      </a> 
 
      </th> 
 
      <td>England</td> 
 
     </tr>

來源

2017-05-27 Paz

「維基網頁」？如果你的意思是維基百科，你爲什麼不使用維基數據？ –

這是一個大學的分配 – Paz

當然，我認爲這是一個不好的問題:) – Paz

您可以使用下面XPath所需th元素在所有提到的情況下匹配：

//th[matches(normalize-space(), "country", "i")]

注意"i"標誌允許進行不區分大小寫的搜索，所以這兩個「國家」和「國家」應匹配

如果您的工具僅支持XPath 1.0可以使用

//th[contains(.,'Country') or contains(.,'country')]

來源

2017-05-27 10:44:42 Andersson

抓取並從xpath表中提取數據

回答

相關問題