計算文本文件中文字的出現次數

我該如何去追蹤單詞出現在文本文件中的次數？我想爲每個字都這樣做。計算文本文件中文字的出現次數

例如，如果輸入的是一樣的東西：

。「那人說喜男孩」

每個「人說喜男孩」將有1

發生的「」將有2

我想保持一個字典，詞/出現對的occurence但我不確定如何在C中實現這一點。解決方案的任何類似或相關問題的鏈接將非常棒。

編輯：爲了避免推出我自己的散列表，我決定學習如何使用glib。一路走來，我發現了一個很好的教程，可以解決類似的問題。 http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

我非常驚訝於不同方法的數量，尤其是Ruby實現的簡單性和優雅性。

來源

2008-12-25 vinc456

這個線程什麼時候變成了「用您的選擇語言分享您的解決方案？」 – 2012-05-06 07:37:23

是的，帶有單詞出現對的字典可以正常工作，而實現這種字典的常用方法是使用散列表（或者有時候是二叉搜索樹）。

您也可以使用trie（或其壓縮版本，"Patricia trie"/Radix trie），其複雜性對於此問題是漸近最優的，但我懷疑實際上它可能比（好）散列表實現慢。

[我真的認爲散列表或嘗試是否更好取決於輸入中單詞的分佈 - 例如一個散列表需要將每個單詞存儲在其散列桶中（以防止衝突），而如果有很多帶有共同前綴的單詞，那麼這些共同前綴是共享的，並且每個單獨存儲一次，但是還有所有的指針的開銷......如果你碰巧都試一下，我很好奇，想知道他們是如何比較]

來源

2008-12-25 22:49:59 ShreevatsaR

您可以使用散列表並使散列表中的每個條目都指向一個結構，該結構包含到目前爲止發現的單詞和次數。

來源

2008-12-25 22:37:30 Jared

將兩個不同的單詞散列到同一條目是否可能發生衝突？我需要在入口處進行一些檢查嗎？還是存在一個完美的散列函數？我有點生疏，但我會做我的研究。謝謝 – vinc456 2008-12-25 22:41:59

這是通常的做法。您需要確保避免衝突---通常通過將每個哈希桶設置爲字數統計結構的鏈接列表。 +1 – dmckee 2008-12-25 22:49:52

當我上次看到它時，這不是+1嗎？爲什麼有人會低估正確答案？：來自我的P +1。 – ShreevatsaR 2008-12-25 23:12:57

警告未經測試的代碼：

#include <stdio.h> 

struct LLNode 
{ 
    LLNode* Next;  
    char* Word; 
    int  Count; 
}; 

void PushWord(LLNode** list, const char* word) 
{ 
    LLNode* node = NULL; 
    unsigned int len = 0; 
    if (*list == NULL) 
    { 
     $list = new LLNode; 
     $list = "\0"; 
    } 
    node = *list; 
    while ((node = node->Next) != NULL) // yes we are skipping the first node 
    { 
     if (!strcmp(node->Word, word)) 
     { 
      node->Count++; 
      break; 
     } 

     if (!node->Next) 
     { 
      LLNode* nnode = new LLNode; 
      nnode->Count = 1; 
      node->Next = nnode; 
      len = strlen(word); 
      node->Word = new char[len + 1]; 
      strcpy(node->Word, word); 
      break; 
     } 
    } 
} 

void GetCounts(LLNode* list) 
{ 
    if (!list) 
     return; 
    LLNode* node = list; 
    while ((node = node->Next) != NULL) // yes we are skipping the first node 
    { 
     printf("Word: %s, Count: %i", node->Word, node->Count); 
    } 
} 

void PushWords(LLNode** list, const char* words) 
{ 
    char ch = '\0'; 
    unsigned int len = strlen(words); 
    char buff[len]; // to be sure we have no buffer ovverunes. May consume too much memery for your application though. 
    int index = 0; 
    for (unsigned int i = 0; i < len; i++) 
    { 
     ch = words[i]; 
     if (index > 0 && ch == ' ') 
     { 
      ch[index + 1] = '\0'; 
      PushWords(list, buff); 
      index = 0; 
     } 
     else if (ch != ' ') 
     { 
      ch[index++] = ch; 
     } 
    } 

    if (index > 0 && ch == ' ') 
    { 
     ch[index + 1] = '\0'; 
     PushWords(list, buff); 
     index = 0; 
    } 
} 

int main() 
{ 
    LLNode* list = NULL; 
    PushWords(&list, "Hello world this is a hello world test bla"); 
    GetCount(list); 
    // release out memery here 
}

我寫的剛剛這樣它可能不會工作 - 但這是一般的想法。

另一個草圖這次在C++（注：性病::地圖有相當不錯的搜索時間）：

#include <iostream> 
#include <string> 
#include <map> 

using namespace std; 

typedef map<string, int> CountMap; 

void PushWords(CountMap& list, const char* words) 
{ 
    char ch = '\0'; 
    unsigned int len = strlen(words); 
    string str; 
    int index = 0; 
    for (unsigned int i = 0; i < len; i++) 
    { 
     ch = words[i]; 
     if (index > 0 && ch == ' ') 
     { 
      list[str] = list[str] + 1; 
      index = 0; 
     } 
     else if (ch != ' ') 
     { 
      str += ch; 
      index++; 
     } 
    } 

    if (index > 0 && ch == ' ') 
    { 
     list[str] = list[str] + 1; 
    } 
} 

void PrintCount(CountMap& list) 
{ 
    CountMap::iterator iter = list.begin(), end = list.end(); 
    for (; iter != end; ++iter) 
    { 
     cout << (*iter).first << " : " << (*iter).second; 
    } 
} 


int main() 
{ 
    CountMap map; 
    PushWords(map, "Hello world this is a hello world test bla"); 
    PrintCount(map); 
}

來源

2008-12-25 22:52:03 nlaq

只是爲了好奇

，這裏是字數問題的一個簡單的Ruby的解決方案。它應該基本上與C中的算法相同，只需要更多的代碼。

h = Hash.new(0) 
File.read("filename.txt").split.each do |w| 
    h[w] += 1 
end 
p h

來源

2008-12-26 00:23:33 martinus

這算不算？

#include <stdio.h> 
#include <stdlib.h> 
int main(int argc, char **argv) 
{ 
    char buffer[2048]; 
    if (argc != 2) 
    { 
     fprintf(stderr, "Usage: %s file\n", argv[0]); 
     exit(EXIT_FAILURE); 
    } 
    snprintf(buffer, sizeof(buffer), "tr -cs '[a-z][A-Z]' '[\\n*]' < %s |" 
            " sort | uniq -c | sort -n", argv[1]); 
    return(system(buffer)); 
}

它基本上封裝了說明如何在Unix上將單詞計數爲shell腳本的規範腳本。

'tr'命令將任何不是字母字符的內容轉換爲換行符並擠出重複項。第一個'sort'將每個單詞的所有出現組合在一起。 'uniq -c'計數每個單詞連續出現的次數，打印單詞及其計數。第二個'sort'按順序遞增重複。您可能需要選擇'tr';它不是從系統到系統的最穩定的命令，並且它設法經常讓我做手動打擊。在Solaris上使用的/ usr/bin中/ TR 10，上面的代碼產生（自有源）：

1 
    1 A 
    1 EXIT 
    1 FAILURE 
    1 Usage 
    1 Z 
    1 a 
    1 c 
    1 cs 
    1 exit 
    1 file 
    1 fprintf 
    1 if 
    1 main 
    1 return 
    1 sizeof 
    1 snprintf 
    1 stderr 
    1 stdio 
    1 stdlib 
    1 system 
    1 tr 
    1 uniq 
    1 z 
    2 argc 
    2 char 
    2 h 
    2 include 
    2 int 
    2 s 
    2 sort 
    3 argv 
    3 n 
    4 buffer

來源

2008-12-26 00:24:15

對於個人的話，就沒有必要寫一個程序在所有除非是一些較大的部分：

sed -e 's/[[:space:]]/\n/g' < file.txt | grep -c WORD

來源

2008-12-26 09:50:27

#include <conio.h> 
#include <iostream.h> 
#include <fstream.h> 
#include <cstdlib> 

struct stdt 
{ 
     char name[20] ; 
     int id ; 

}; //std 

int main() 
{ 
     stdt boy ; 
     int a = 0 ; 
     ofstream TextFile ; 
     cout << "Begin File Creation \n" ; 
     TextFile.open("F:\\C++ Book Chapter Program\\Ch 7\\File.txt"); 
     if (!TextFile) 
     { 
      cerr <<"Erro 100 Openoing File.DAT" ; 
      exit(100);  
     }//end if 
     while (a < 3) 
     { 
      TextFile.write((char*) &boy , sizeof (boy)) ; 
      cout << "\nEnter Name : " ; 
      cin >> boy.name; 
      cout << "\nEnter ID : " ; 
      cin >> boy.id ; 
      a++; 
     }//end while 

     TextFile.close(); 
     cout << "\nEnd File Creation" ; 

     ifstream TextFile1 ; 
     TextFile1.open("F:\\C++ Book Chapter Program\\Ch 7\\File.txt"); 
     while (TextFile1.read((char*) &boy , sizeof (boy))) 
     { 
      cout << "\nEnter Name : " << boy.name; 
      cout << "\nEnter ID : " << boy.id ; 


     }// end While 

     getch(); 
     return 0 ; 
}//end main

來源

2009-05-08 15:21:37

在Perl

：

my %wordcount =(); 
while(<>){map {$wordcount{$_}++} (split /\s+/)} 
print "$_ = $wordcount{$_}\n" foreach sort keys %wordcount;

，並在Perl高爾夫（只是爲了好玩）：

my%w;      
map{$w{$_}++}split/\s+/while(<>); 
print"$_=$w{$_}\n"foreach keys%w;

來源

2009-05-08 15:47:35 dsm

計算文本文件中文字的出現次數

回答

相關問題