2015-06-22 97 views
0

好了,所以我正在用美麗的湯處理一個HTML文件,我也做了以下內容:清洗文字與美麗的湯

url = "https://en.wikipedia.org/wiki/"+'Category:American_football' 
raw = urlopen(url).read() 
soup=BeautifulSoup(raw) 
pages = soup.find("div" , { "id" : "mw-subcategories" }) 
cleaned = pages.get_text() 
cleaned = cleaned.encode('utf-8') 

和我的輸出如下所示:

"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n" 

我試圖找出如何剝離出的一切,但acutual文本名稱:即

\xe2\x80\x8e (6 C, 218 P)\n\n\n 

有沒有竅門擺脫Ø如果使用美麗的湯庫,或者我應該如何進一步改進文本?

+0

你是什麼意思 「的實際文本名稱」 是什麼意思?什麼是期望的輸出? –

+0

我只想得到這樣子類別名稱的列表:美式足球的城市,美式足球的大陸,由國家美式足球等不格式化字符 – jdv12

+0

那麼你應該導航到他們。用'class =「CategoryTreeLabel」'選擇每個'a'。然後獲取每個文本。 –

回答

1

導航到你想要的a秒。

soup = bs4.BeautifulSoup(raw) 
for cat in soup.findAll("a", {"class": "CategoryTreeLabel"}): 
    print(cat.text) 

輸出:

American football by city 
American football by continent 
American football by country 
American football-related lists 
American football occupations 
American football competitions 
American football equipment 
History of American football 
American football incidents 
American football media 
American football organisations 
American football people 
American football plays 
American football positions 
American football records and statistics 
Seasons in American football 
Semi-professional American football 
American football strategy 
American football teams 
American football terminology 
American football trophies and awards 
Variations of American football 
American football venues 
Women's American football 
American football logos 
American football stubs