我只是想解決這個問題,但在這裏和那裏雜耍。如何從1 M個文件收集中收集獨特的單詞及其頻率?
import java.io.*;
import java.util.*;
public class file{
public static void main(String[] args) throws Exception {
Scanner in = new Scanner(System.in);
Scanner sc=null;
int count=0,uwords=0;
File folder = new File("<folder path>");// The collection of files
File[] listOfFiles = folder.listFiles();
HashMap<String,Integer> words_fre = new HashMap<String,Integer>();
FileWriter fw = new FileWriter("abc.txt");
//ArrayList<String> words = new ArrayList<String>();
for (File file : listOfFiles) {
if (file.isFile()) {
//System.out.println(file.getName());
try{
sc=new Scanner(/*new BufferedReader(new File*/(file));
//sc.useDelimiter("\\W");
while(sc.hasNext()){
String s = sc.next().toString();
s = s.replaceAll("\\<.*?>","");
count++; // words count
if(words_fre.containsKey(s))
{
int a = words_fre.get(s);
words_fre.put(s,a+1);
}
else {
words_fre.put(s,1);
uwords++; // unique words count
}
}
Object[] key = words_fre.keySet().toArray();
Arrays.sort(key);
for (int i = 0; i < key.length; i++) {
//System.out.println(key[i]+"= "+words_fre.get(key[i]));
fw.write(key[i]+" : "+words_fre.get(key[i]) +"\n");
}
}catch(IOException e)
{
System.out.println(e);
}
}
}
/*System.out.println("Total Words = "+count);
System.out.println("Unique Words = "+words_fre.size());*/
fw.write("Total Words = "+count+"\n");
fw.write("Unique Words = "+words_fre.size());
fw.close();
}
}
所以基本上我的輸出類似的東西 eg.- : 3 16800 : 1 23-12-2010 : 1 7 : 1 6 : 2 8वीं : 2 अंशु : 1 अधिकतर : 2 अन्य : 1 अपने : 1 हो। : 1 ॥ : 1
: 3
我還需要去除支架的第一項[3]和倒數第二個[||:1]和最後一個[3]
拉回購後,當你編輯並再次發送..你正在得到什麼錯誤信息..? – Girish