用python閱讀url時希臘字的識別

我是新的python程序員。我寫了一個簡單的腳本，做以下幾點：用python閱讀url時希臘字的識別

請求的URL從用戶
讀取URL（的urlopen（URL）.read（））
的標記化上面的命令

我把兩個文件中的標記化的結果。這個人有拉丁文字（英文，西班牙文等）和其他人（希臘文字等）。

問題是，當我打開一個希臘語網址時，我從中取出了希臘語，但我將它看作是一系列字符，而不是單詞（因爲它發生在拉丁語中）。

我希望把詞的列表（μαρια，γιωργος，παιδι），但（項目3號）我採取的是('μ','α','ρ','ι', 'α'........)數量的項目多達字母

我應該怎麼辦呢？（編碼爲UTF-8）

下面的代碼：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

#Importing useful libraries 
#NOTE: Nltk should be installed first!!! 
import nltk 
import urllib #mporei na einai kai urllib 
import re 
import lxml.html.clean 
import unicodedata 
from urllib import urlopen 

http = "http://" 
www = "www." 
#pattern = r'[^\a-z0-9]' 

#Demand url from the user 
url=str(raw_input("Please, give a url and then press ENTER: \n")) 


#Construct a valid url syntax 
if (url.startswith("http://"))==False: 
    if(url.startswith("www"))==False: 
     msg=str(raw_input("Does it need 'www'? Y/N \n")) 
     if (msg=='Y') | (msg=='y'): 
      url=http+www+url 
     elif (msg=='N') | (msg=='n'): 
      url=http+url 
     else: 
      print "You should type 'y' or 'n'" 
    else: 
     url=http+url 

latin_file = open("Latin_words.txt", "w") 
greek_file = open("Other_chars.txt", "w") 
latin_file.write(url + '\n') 
latin_file.write("The latin words of the above url are the following:" + '\n') 
greek_file.write("Οι ελληνικές λέξεις καθώς και απροσδιόριστοι χαρακτήρες") 

#Reading the given url 

raw=urllib.urlopen(url).read() 

#Retrieve the html body from the url. Clean it from html special characters 
pure = nltk.clean_html(raw) 
text = pure 

#Retrieve the words (tokens) of the html body in a list 
tokens = nltk.word_tokenize(text) 

counter=0 
greeks=0 
for i in tokens: 
    if re.search('[^a-zA-Z]', i): 
     #greeks+=1 
     greek_file.write(i) 
    else: 
     if len(i)>=4: 
      print i 
      counter+=1 
      latin_file.write(i + '\n') 
     else: 
      del i 


#Print the number of words that I shall take as a result 
print "The number of latin tokens is: %d" %counter 

latin_file.write("The number of latin tokens is: %d and the number of other characters is: %d" %(counter, greeks)) 
latin_file.close() 
greek_file.close()

我檢查它在很多方面，而且，據我可以得到它，程序只是承認希臘字母，但不承認希臘字，意思是說，女巫的空間，我們分開的話！

如果我在終端中鍵入希臘語句子並且空格，它看起來是正確的。這個問題發生在我讀東西的時候（比如html頁面的body）

另外，在text_file.write（i）中，關於希臘語我的，如果我寫了text_file.write（i +'\ n'），結果是身份不明的人物，又名，我失去了我的編碼！

有關上述的任何想法？

來源

2012-09-27 user1702506

請將您的代碼添加到問題中。 – 2012-09-27 07:36:10

你是什麼意思'字符序列'？像這樣：'['a'，'b'，'c']'，你期待'abc'？張貼一些代碼，所以我們沒有這個來回:) –

提示：'印記令牌應該告訴你你回來的東西;然後你可以適當地調整你的for循環。 –

在這裏，我想你要找的子字符串不if re.search('[^a-zA-Z]', i) 您可以通過循環列表token

來源

2012-09-27 07:45:48 adaniluk

我明白你的意思了，但我應該如何區分拉丁文和其他文字呢？不需要我一些正則表達式？ – user1702506

如果're.search（'[^希臘字母]'，i）在我' – adaniluk

我的意思是在這裏，嘗試檢查單詞是否包含希臘字母，如果是這樣，它必須是希臘字 – adaniluk

的Python re模塊是臭名昭著的弱支持unicode得到的名單的話。對於嚴重的unicode工作，請考慮替代方案regex module，它完全支持unicode腳本和屬性。例如：

text = u""" 
Some latin words, for example: cat niño määh fuß 
Οι ελληνικές λέξεις καθώς και απροσδιόριστοι χαρακτήρες 
""" 

import regex 

latin_words = regex.findall(ur'\p{Latin}+', text) 
greek_words = regex.findall(ur'\p{Greek}+', text)

來源

2012-09-27 08:07:20 georg

thnx幫助 – user1702506

這裏是你的代碼的簡化版本，採用優requests library用於提取網址時，with statement自動關閉文件和io，以幫助UTF8。

import io 
import nltk 
import requests 
import string 

url = raw_input("Please, give a url and then press ENTER: \n") 
if not url.startswith('http://'): 
    url = 'http://'+url 
page_text = requests.get(url).text 
tokens = nltk.word_tokenize(page_text) 

latin_words = [w for w in tokens if w.isalpha()] 
greek_words = [w for w in tokens if w not in latin_words] 

print 'The number of latin tokens is {0}'.format(len(latin_words)) 

with (io.open('latin_words.txt','w',encoding='utf8') as latin_file, 
     io.open('greek_words.txt','w',encoding='utf8') as greek_file): 

    greek_file.writelines(greek_words) 
    latin_file.writelines(latin_words) 

    latin_file.write('The number of latin words is {0} and the number of others {1}\n'.format(len(latin_words),len(greek_words))

我簡化了檢查URL的部分;這種方式將無法讀取無效的URL。

來源

2012-09-27 08:08:21

thnx幫助 – user1702506

用python閱讀url時希臘字的識別

回答

相關問題