2012-11-05 115 views
9

I am trying to use beautiful soup to parse html and find all href with a specific anchor tag蟒蛇/ beautifulsoup找到所有<a href> with specific anchor text

<a href="http://example.com">TEXT</a> 
<a href="http://example.com/link">TEXT</a> 
<a href="http://example.com/page">TEXT</a> 

all the links I am looking for have the exact same anchor text, in this case TEXT. I am NOT looking for the word TEXT, I want to use the word TEXT to find all the different HREF

edit:

for clarification looking for something similar to using the class to parse for the links

<a href="http://example.com" class="visible">TEXT</a> 
<a href="http://example.com/link" class="visible">TEXT</a> 
<a href="http://example.com/page" class="visible">TEXT</a> 

and then using

findAll('a', 'visible') 

except the HTML I am parsing doesn't have a class but always the same anchor text

回答

24

Would something like this work?

In [39]: from bs4 import BeautifulSoup 

In [40]: s = """\ 
    ....: <a href="http://example.com">TEXT</a> 
    ....: <a href="http://example.com/link">TEXT</a> 
    ....: <a href="http://example.com/page">TEXT</a> 
    ....: <a href="http://dontmatchme.com/page">WRONGTEXT</a>""" 

In [41]: soup = BeautifulSoup(s) 

In [42]: for link in soup.findAll('a', href=True, text='TEXT'): 
    ....:  print link['href'] 
    ....: 
    ....: 
http://example.com 
http://example.com/link 
http://example.com/page 
+0

trying to find a quicker way, to me this takes a little longer to process since it finds ALL href, then compares each one to the text to find a match. Preferably I could parse directly for the links required. Something like when the href has a class you can do findAll('a', 'className') – cwal

+0

@cwal Oh, gotcha (my bad - long day :)). Try the updated version - it builds it into the filter. Does that do what you want? This will load them as a generator as opposed to loading all of them, so I believe this is the fastest you will get (as there needs to be some way up front for BS to check if a link fits your criteria). Happy to help think through another way if this doesn't work. – RocketDonkey

+0

this certainly looks like it works! I had tried this, but without the href=true and it didnt seem to work. Unfortunately I dont have the time right now to check if it works for me but I will as soon as possible and post back my results. thank you! – cwal