beautifulsoup找不到存在的href文件

我有一個HTML文件中像下面：beautifulsoup找不到存在的href文件

<form action="/2811457/follow?gsid=3_5bce9b871484d3af90c89f37" method="post"> 
<div> 
<a href="/2811457/follow?page=2&amp;gsid=3_5bce9b871484d3af90c89f37">next_page</a> 
&nbsp;<input name="mp" type="hidden" value="3" /> 
<input type="text" name="page" size="2" style='-wap-input-format: "*N"' /> 
<input type="submit" value="jump" />&nbsp;1/3 
</div> 
</form>

如何從文件中提取的「1/3」？

它是html的一部分，我打算說清楚。當我使用beautifulsoup，

我是新來的beautifulsoup，我看了文檔，但仍然困惑。

如何從html文件中提取「1/3」？

total_urls_num = re.findall('\d+/\d+',response)

工作代碼：

from BeautifulSoup import BeautifulSoup 
import re 

with open("html.txt","r") as f: 
    response = f.read() 
    print response 
    soup = BeautifulSoup(response) 
    delete_urls = soup.findAll('a', href=re.compile('follow\?page')) #works,should escape ? 
    print delete_urls 
    #total_urls_num = re.findall('\d+/\d+',response) 
    total_urls_num = soup.find('input',type='submit') 
    print total_urls_num

來源

2012-06-17 young001

位：（。* \ d/\ d *）'\ D'不'/ D' – JBernardo

但是當我改變，仍然沒有按't work，it return None – young001

'soup.find（'input'，value ='jump）.next'怎麼樣？ –

我認爲問題是，你要搜索的文字是不是有些標籤的屬性中，但經過來。您可以使用.next訪問：

In [144]: soup.find("input", type="submit") 
Out[144]: <input type="submit" value="jump" /> 

In [145]: soup.find("input", type="submit").next 
Out[145]: u'&nbsp;1/3\n'

，然後你可以從1/3，只要你喜歡：

In [146]: re.findall('\d+/\d+', _) 
Out[146]: [u'1/3']

或者乾脆是這樣的：

In [153]: soup.findAll("input", type="submit", text=re.compile("\d+/\d+")) 
Out[153]: [u'&nbsp;1/3\n']

來源

2012-06-17 03:08:07 DSM

你正是我想要的，thx帝斯曼，我應該多讀些美麗的文件。 – young001

閱讀本document

不是

total_urls_num = soup.find('input',style='submit') #can't work

您應該使用type代替style

>>>temp = soup.find('input',type='submit').next 
'&nbsp;1/3\n' 
>>>re.findall('\d+/\d+', temp) 
[u'1/3'] 
>>>re.findall('\d+/\d+', temp).[0] 
u'1/3'

來源

2012-06-17 03:12:51 shihongzhi

是的，thx，我做了一個愚蠢的工作 – young001

beautifulsoup找不到存在的href文件

回答

相關問題