從帶有ElementTree標籤的XML中檢索文本

現在我有一些使用Biopython和NCBI的「Entrez」API從Pubmed Central獲取XML字符串的代碼。我試圖用ElementTree解析XML以獲得頁面中的文本。雖然我有BeautifulSoup代碼，當我從網站本身刮取lxml數據時，完全是這樣做的，但我正在切換到NCBI API，因爲刮板顯然是不可行的。但是現在使用來自NCBI API的XML，我發現ElementTree非常不直觀，並且真的可以使用一些幫助來實現它。當然，我已經看過其他文章，但其中大部分涉及名稱空間，在我的情況下，我只是想使用XML標籤來獲取信息。即使ElementTree文檔也沒有涉及到這一點（從我所知道的）。任何人都可以幫助我找出語法來獲取某些標記中的信息，而不是在特定的命名空間中？從帶有ElementTree標籤的XML中檢索文本

下面是一個例子。注：我使用Python 3.4

的XML的小SNIPPIT：

 <sec sec-type="materials|methods" id="s5"> 
     <title>Materials and Methods</title> 
     <sec id="s5a"> 
     <title>Overgo design</title> 
     <p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50&#x2013;60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p> 
    <table-wrap id="pone-0114631-t002" orientation="portrait" position="float"> 
     <object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id> 
     <label>Table 2</label> 
     <caption> 
     <title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title> 
     </caption> 
     <alternatives> 
     <graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/> 
     <table frame="hsides" rules="groups"> 
      <colgroup span="1"> 
      <col align="left" span="1"/> 
      <col align="center" span="1"/> 
      </colgroup>

我的項目，我想所有的文本中的「P」標籤（不只是這SNIPPIT的XML的，但對於整個XML字符串）。現在

，我已經知道我可以使整個XML字符串轉換爲ElementTree的對象現在

>>> import xml.etree.ElementTree as ET 
>>> tree = ET.ElementTree(ET.fromstring(xml_string)) 
>>> root = ET.fromstring(xml_string)

如果我嘗試使用標籤這樣來獲取文本：

>>> text = root.find('p') 
>>> print("".join(text.itertext()))

或

>>> text = root.get('p').text

我無法提取我想要的文本。從我讀過的，這是因爲我使用標籤「p」作爲參數而不是命名空間。

儘管我覺得在XML文件中獲取「p」標籤中的所有文本應該非常簡單，但我目前無法做到這一點。請讓我知道我錯過了什麼，以及我如何解決這個問題。謝謝！

---編輯---

所以，現在我知道，我應該使用此代碼來獲得的「P」標籤的所有內容：

>>> text = root.find('.//p') 
>>> print("".join(text.itertext()))

儘管我使用itertext（），它只返回來自第一個「p」標籤的內容，而不是查看任何其他內容。 itertext（）是否只在一個標籤內迭代？文檔似乎表明它遍歷所有標籤，所以我不確定爲什麼它只返回一行而不是所有「p」標籤下的所有文本。

----最後編輯 -

我想通了，itertext（）只適用一個標籤內，找到（）只返回的第一個項目。爲了得到，我想我必須使用的findAll（）

>>> all_text = root.findall('.//p') 
>>> for texts in all_text: 
    print("".join(texts.itertext()))

來源

2016-05-31 SnarkShark

root.get（）的enitre文字是錯誤的方法，因爲它會檢索根標籤不是一個子標籤的屬性。 root.find（）是正確的，因爲它會找到第一個匹配的子標籤（或者可以使用root.findall（）爲全部匹配子標籤）。

如果您不僅想查找直接子標籤，而且還需要間接子標籤（如您的示例中所示），則root.find/root.findall中的表達式必須是XPath的子集（請參閱https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support）。在你的情況下它是'.//p'：

text = root.find('.//p') 
    print("".join(text.itertext()))

來源

2016-05-31 20:59:56 mrh1997

很好知道，謝謝！我對XML的這些方面不夠熟悉，所以您的反饋真的很有幫助。當我運行你的代碼時，我的終端只從「p」標籤打印出一行文本。從我收集的內容來看，「迭代」應該避免這種情況。任何想法發生了什麼？ – SnarkShark

從帶有ElementTree標籤的XML中檢索文本

回答

相關問題