2015-09-14 285 views
1

我使用python和BeautifulSoup解析許多大型的XML文件。我經常遇到以下任務:Python美麗的湯最有效的方式來查找標籤

<Section1> 
    <Report> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
     <Matrix>...</Matrix> 
    </Report> 
</Section1> 

我想收集並遍歷所有的矩陣。我使用如下代碼:

res = urlopen(url) 
html = res.read() 
soup = BeautifulSoup(html, 'xml') 
matrices = soup.find("Section1").find_all("Matrix") 
#Then I handle each matrix 

爲什麼我不能使用這樣的選擇器?

matrices = soup.find("Section1 Matrix") 

有沒有更快的方法來做到這一點?有時我正在訪問更多嵌套在XML中的節點,我需要確保它們是後代,但不一定是其他幾個節點的直接子節點。提供的例子是一個簡化。任何幫助將不勝感激。

+5

你嘗試使用LXML? 它會提升很多表現。 – giaosudau

回答

1

BeautifulSoup "supports CSS selectors"你需要你的選擇傳遞給.select方法

In [1]: from bs4 import BeautifulSoup as BS 

In [2]: soup = BS("""<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>""", "xml") 

In [3]: soup.select("Section1 Matrix") 
Out[3]: 
[<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>, 
<Matrix>...</Matrix>] 

如果你想要的是讓你的文檔中的所有節點Matrix;您可以使用 CSSSelectorlxml.cssselect

In [3]: from lxml.etree import fromstring 

In [4]: xml_doc = '''<Section1> 
    ...:  <Report> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:   <Matrix>...</Matrix> 
    ...:  </Report> 
    ...: </Section1>''' 

In [5]: tree = fromstring(xml_doc) 

In [6]: matrix = [el for el in sel(tree)] 

In [7]: matrix 
Out[7]: 
[<Element Matrix at 0x7f84b5b8f388>, 
<Element Matrix at 0x7f84b5b8fc48>, 
<Element Matrix at 0x7f84b5b8fd88>, 
<Element Matrix at 0x7f84b5b8fdc8>] 

你需要的,如果它是不是已經有點子才能安裝cssselect:pip install cssselect