Web刮板正在抓取文本和<span>文本</span>。跨度文本不需要

基本上，我試圖用BeautifulSoup在Python中刮表。Web刮板正在抓取文本和<span>文本</span>。跨度文本不需要

我已經設法擦除其他鏈接數組中的所有數據，但由於某種原因，當我添加.text時，它會在span標籤內打印文本和文本。跨文本不是必需的。

我試過做.string和.text.text，但它似乎沒有工作。

任何人都可以在這裏發現問題嗎？

這裏是我的代碼：

soup = BeautifulSoup(urllib2.urlopen('http://www.livefootballontv.com/').read()) 

for row in soup('div', {'id': 'tv-guide'})[0]('ul'): 
    tds = row('li') 
    print tds[0].string, tds[1].text, tds[1].span.string, tds[2].string, tds[3].img['alt'], '\n' 
    db = MySQLdb.connect("127.0.0.1","root","","footballapp") 
    cursor = db.cursor() 
    sql = "INSERT INTO TVGuide(DATE, FIXTURE, COMPETITION, KICKOFF, CHANNELS) VALUES (%s,%s,%s,%s,%s)" 
    results = (str(tds[0].string), str(tds[1]).text, str(tds[1].span.string), str(tds[2].string), str(tds[3].img['alt'])) 
    cursor.execute(sql, results) 
    db.commit() 
    db.rollback() 
    db.close()

然後我給

日2014年6月22日美國VS PortugalBrasil 2014年世界盃足球賽G組巴西 2014年世界盃G組晚上11:00 BBC1

2014年6月24日，星期二哥斯達黎加vs英格蘭巴西世界盃2014集團 D巴西2014年世界盃D組下午5:00 ITV

來源

2014-02-11 Thomas

[從該元件僅提取文本，而不是其子（可能重複http://stackoverflow.com/questions/4995116/only -extracting-text-from-this-element-not-its-children） –

使用contents，並訪問您想要的條目。

實施例：

from bs4 import BeautifulSoup 
import urllib2 

soup = BeautifulSoup(urllib2.urlopen('http://www.livefootballontv.com/').read()) 

for row in soup('div', {'id': 'tv-guide'})[0]('ul'): 
    tds = row('li') 
    print tds[1].contents[0]

輸出：

SV Hamburg vs Bayern Munich 
Arsenal vs Manchester United 
Napoli vs Roma 
... 
USA vs Portugal 
Costa Rica vs England

來源

2014-02-12 01:03:52

我發現了一個[重複問題]（https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-不是它的孩子）順便說一句。你也可以使用'find（text = True，recursive = False）' –

第一個工作完美，非常感謝你:) Top Geezer – Thomas

Web刮板正在抓取文本和<span>文本</span>。跨度文本不需要

回答

相關問題