我試圖解析一個網站來提取人名和國家。如何獲得以下兄弟::文本()和以下兄弟:: b?
頁面有時看起來像:
<th>Inventors:</th>
<td align="left" width="90%">
<b>Harvey; John Christopher</b> (New York, NY)<b>, Cuddihy; James William</b> (New York, NY)
</td>
我能得到使用國家
//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::text()
[(New York, NY), (New York, NY)]
有時頁面看起來像(添加圍繞國名):
<th>Inventors:</th>
<td align="left" width="90%">
<b>Harvey; John Christopher</b> (New York, <b>NY</b>)<b>, Cuddihy; James William</b> (New York, <b>NY</b>)
</td>
我可以得到國家:
//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::b
[NY, NY]
現在,我希望能夠在兩種情況下獲得國家。
我試着用:
//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::*[self::text() or self::b]
但當時我只得到 「B」 S ...
我也試過:
//.../following-sibling::text() | //.../following-sibling::b
但我也只得到「b」...
任何想法爲什麼這不按預期方式工作?任何解決方案來獲得這兩個條目