Python美麗的湯最有效的方式來查找標籤

我使用python和BeautifulSoup解析許多大型的XML文件。我經常遇到以下任務：Python美麗的湯最有效的方式來查找標籤

<Section1> 
    <Report> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
    </Report> 
</Section1>

我想收集並遍歷所有的矩陣。我使用如下代碼：

res = urlopen(url) 
html = res.read() 
soup = BeautifulSoup(html, 'xml') 
matrices = soup.find("Section1").find_all("Matrix") 
#Then I handle each matrix

爲什麼我不能使用這樣的選擇器？

matrices = soup.find("Section1 Matrix")

有沒有更快的方法來做到這一點？有時我正在訪問更多嵌套在XML中的節點，我需要確保它們是後代，但不一定是其他幾個節點的直接子節點。提供的例子是一個簡化。任何幫助將不勝感激。

來源

2015-09-14 klib

你嘗試使用LXML？它會提升很多表現。 – giaosudau

BeautifulSoup "supports CSS selectors"你需要你的選擇傳遞給.select方法

In [1]: from bs4 import BeautifulSoup as BS 

In [2]: soup = BS("""<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>""", "xml") 

In [3]: soup.select("Section1 Matrix") 
Out[3]: 
[<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>]

如果你想要的是讓你的文檔中的所有節點Matrix;您可以使用 CSSSelector類lxml.cssselect 。

In [3]: from lxml.etree import fromstring 

In [4]: xml_doc = '''<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>''' 

In [5]: tree = fromstring(xml_doc) 

In [6]: matrix = [el for el in sel(tree)] 

In [7]: matrix 
Out[7]: 
[<Element Matrix at 0x7f84b5b8f388>, 
<Element Matrix at 0x7f84b5b8fc48>, 
<Element Matrix at 0x7f84b5b8fd88>, 
<Element Matrix at 0x7f84b5b8fdc8>]

你需要的，如果它是不是已經有點子才能安裝cssselect：pip install cssselect

來源

2015-09-14 07:01:39 styvane

Python美麗的湯最有效的方式來查找標籤

回答

相關問題