2013-02-26 80 views
0

這應該如何解析這與美麗的湯4,當它不承認td.font.unwrap()? 我應該更換它還是什麼?或解開它?如何獲取​​內<font>內使用美麗的湯4

<td align="CENTER"> 
    <font size="+2">橫</font>(F 
    <font size="+2">橫</font>) 
</td> 

我想剛剛獲得的橫(F橫) 串什麼我得到現在的問題是:橫(F)

我打電話到TD現場就好了,但只是沒能獲得最後一個字符... 這是我如何我打電話它現在 y = cols[1].text

cols<td>場,這是行內<tr>第二個....

FULL下面的代碼:

# coding: utf8 
from pysqlite2 import dbapi2 as sqlite3 
import urllib2 
from bs4 import BeautifulSoup 
from string import * 


conn = sqlite3.connect(':memory:') 
cursor = conn.cursor() 

# # create a table 
def createTable(): 
    cursor.execute("""CREATE TABLE characters 
         (rank INTEGER, word TEXT, definition TEXT) 
        """) 


def insertChar(rank,word,definition): 
    cursor.execute("""INSERT INTO characters (rank,word,definition) 
         VALUES (?,?,?)""",(rank,word,definition)) 


def main(): 
    createTable() 

    # u = unicode("辣", "utf-8") 

    # insertChar(1,u,"123123123") 

    # content = "\n".join(response.readlines()[1:]) 
    soup = BeautifulSoup(urllib2.urlopen('http://www.zein.se/patrick/3000char.html').read()) 

    # print (html_doc.prettify()) 

    tables = soup.blockquote.table 

    # print tables 



    rows = tables.find_all('tr')[1:] 
    result=[] 
    for tr in rows: 
     # print tr 
     cols = tr.find_all('td') 
     character = [] 
     # col = cols.fonts.unwrap() 
     # x = int (cols[0].string) 
     x = 0 
     y = cols[1].text 
     # chars = y.find_all('font') 

     z = "11" 
     print y 
     # y = cols[1].string 
     # z = cols[2].string 

     # xx = unicode(x, "utf-8") 
     # yy = unicode(y , "utf-8") 
     # zz = unicode(z , "utf-8") 
     insertChar(x,y,z) 

    conn.commit() 

main() 

我感謝您的幫助!謝謝

+0

請發佈更多的代碼,我從'td.text'中得到'u'\ n \ u6a2a(F \ n \ u6a6b)\ n''就好了。 – 2013-02-26 00:17:57

+0

@Pavel Anossov謝謝你的幫助:) – user805981 2013-02-26 00:19:22

回答

1

該網站聲稱在gb2312,但它不是。這應該可以解決它:

url = 'http://www.zein.se/patrick/3000char.html' 
soup = bs4.BeautifulSoup(urllib2.urlopen(url).read(), from_encoding='gb18030') 

或只是

soup = bs4.BeautifulSoup(urllib2.urlopen(url).read(), from_encoding='gbk') 

您的瀏覽器想通了,但BeautifulSoup需要一個暗示。

+0

gbk工作得很漂亮:) 你先生是個天才!乾杯! – user805981 2013-02-26 00:29:22