試圖隔離1列美麗的湯

我想隔離位置列，然後最終得到它輸出到數據庫文件。我的代碼如下：試圖隔離1列美麗的湯

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
response = urllib2.urlopen(url) 
html = response.read() 
soup = BeautifulSoup(html) 

trs = soup.find_all('td') 

for tr in trs: 
    for link in tr.find_all('a'): 
    fulllink = link.get ('href') 

tds = tr.find_all("tr") 
location = str(tds[3].get_text()) 



print location

但我總是得到2個錯誤之一，或者列表超出範圍或退出代碼'0'。我不確定beautfulsoup，因爲我正在努力學習它，所以任何幫助表示感謝，謝謝！

來源

2015-03-08 Gaddi

有一種更簡單的方法來找到Location列。使用table.wikitable trCSS Selector，爲每一行找到所有td元素，並按索引獲得第4個td。

此外，如果有一個單元格內多個位置，你需要分別對待：

import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
soup = BeautifulSoup(urllib2.urlopen(url)) 

for row in soup.select('table.wikitable tr'): 
    cells = row.find_all('td') 
    if cells: 
     for text in cells[3].find_all(text=True): 
      text = text.strip() 
      if text: 
       print text

打印：

Afghanistan 
Nigeria 
Cameroon 
Niger 
Chad 
... 
Iran 
Nigeria 
Mozambique

來源

2015-03-08 23:03:23 alecxe

您只需將代碼中的td和tr balises替換掉即可。並且要注意str()函數，因爲你的網頁中可以有unicode字符串，不能用簡單的ascii字符串轉換。你的代碼應該是：

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts" 
response = urllib2.urlopen(url) 
html = response.read() 
soup = BeautifulSoup(html) 

trs = soup.find_all('tr') # 'tr' instead of td 

for tr in trs: 
    for link in tr.find_all('a'): 
     fulllink = link.get ('href') 
     tds = tr.find_all("td") # 'td' instead of td 
     location = tds[3].get_text() # remove of str function 
     print location

and voilà!!

來源

2015-03-08 22:59:22 MajorTom

試圖隔離1列美麗的湯

回答

相關問題