2015-03-08 36 views
1

我想隔離位置列,然後最終得到它輸出到數據庫文件。我的代碼如下:試圖隔離1列美麗的湯

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
response = urllib2.urlopen(url) 
html = response.read() 
soup = BeautifulSoup(html) 

trs = soup.find_all('td') 

for tr in trs: 
    for link in tr.find_all('a'): 
    fulllink = link.get ('href') 

tds = tr.find_all("tr") 
location = str(tds[3].get_text()) 



print location 

但我總是得到2個錯誤之一,或者列表超出範圍或退出代碼'0'。我不確定beautfulsoup,因爲我正在努力學習它,所以任何幫助表示感謝,謝謝!

回答

2

有一種更簡單的方法來找到Location列。使用table.wikitable trCSS Selector,爲每一行找到所有td元素,並按索引獲得第4個td

此外,如果有一個單元格內多個位置,你需要分別對待:

import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
soup = BeautifulSoup(urllib2.urlopen(url)) 

for row in soup.select('table.wikitable tr'): 
    cells = row.find_all('td') 
    if cells: 
     for text in cells[3].find_all(text=True): 
      text = text.strip() 
      if text: 
       print text 

打印:

Afghanistan 
Nigeria 
Cameroon 
Niger 
Chad 
... 
Iran 
Nigeria 
Mozambique 
0

您只需將代碼中的tdtr balises替換掉即可。並且要注意str()函數,因爲你的網頁中可以有unicode字符串,不能用簡單的ascii字符串轉換。你的代碼應該是:

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
response = urllib2.urlopen(url) 
html = response.read() 
soup = BeautifulSoup(html) 

trs = soup.find_all('tr') # 'tr' instead of td 

for tr in trs: 
    for link in tr.find_all('a'): 
     fulllink = link.get ('href') 
     tds = tr.find_all("td") # 'td' instead of td 
     location = tds[3].get_text() # remove of str function 
     print location 

and voilà!!