0
好了,所以我正在用美麗的湯處理一個HTML文件,我也做了以下內容:清洗文字與美麗的湯
url = "https://en.wikipedia.org/wiki/"+'Category:American_football'
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-subcategories" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
和我的輸出如下所示:
"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n"
我試圖找出如何剝離出的一切,但acutual文本名稱:即
\xe2\x80\x8e (6 C, 218 P)\n\n\n
有沒有竅門擺脫Ø如果使用美麗的湯庫,或者我應該如何進一步改進文本?
你是什麼意思 「的實際文本名稱」 是什麼意思?什麼是期望的輸出? –
我只想得到這樣子類別名稱的列表:美式足球的城市,美式足球的大陸,由國家美式足球等不格式化字符 – jdv12
那麼你應該導航到他們。用'class =「CategoryTreeLabel」'選擇每個'a'。然後獲取每個文本。 –