我需要將.text文件中的文本複製到另一個.txt文件中。我唯一的限制是，如果有任何重複，我只會把這個詞放在新文件中一次。分隔原始文件中單詞的唯一值是空格。我正在考慮將整個文本複製到一個字符串中，然後檢查重複，但我不知道如何檢查，如果是一個好主意。你們能幫我一些想法嗎？複製文件中的單詞

2015-07-20 Marco

使二叉樹（或類似線索，地圖（或設置）） – BLUEPIXY

戰略

我們需要的數據結構，可容納單詞的數量不受限制，可以稱之爲SET。

這個結構必須有一個操作來添加一個單詞，讓它叫做SET :: ADD。
這種操作的返回值必須表明單詞是否已添加或是否有錯誤。

的SET結構，就是一個字只能添加一次，所以有通過SET返回了兩個錯誤值的特點::地址：GENERIC_ERROR用於內部實現錯誤和DUPLICATE爲嘗試插入已經存在的單詞。

的SET結構也有一個操作對其進行初始化（SET :: INIT）和一個釋放它（SET ::免費）。

然後我們從輸入文件中讀取單詞，每次只讀一個單詞，然後我們將每個單詞添加到SET結構中。
如果插入成功，我們將這樣一個單詞寫入輸出文件，否則我們跳過這一步。

僞算法

1. S = SET::INIT 
2. FIN = OpenFile(input_file) 
3. FOUT = OpenFile(output_file); 
4. FOR EACH WORD IN FIN 
4.1 IF SET::ADD(S, WORD) == SUCCESS THEN 
4.1.1 WriteToFile(FOUT, WORD) 
4.2 END IF 
5. CloseFile(FIN); 
6. CloseFile(FOUT); 
7. SET::FREE(S);

的戰術

真正艱苦的工作在這裏正在實施SET結構。
數據結構由可在其上執行的操作以及此操作的前後條件定義。

所以理論上我們只需要做到這一點實現SET當簡單的事情::地址：搜索，如果這個詞已經存在，添加它：

1. FUNCTION SET::ADD(S, W) 
1.1 FOR EACH WORD IN S 
1.1.1 IF WORD == W THEN 
1.1.1.1 RETURN DUPLICATE 
1.1.2 END IF 
1.2 ADD W TO S

這兩個步驟嚴重依賴實施。
對於具有這種需求的數據結構有很多實現，例如非常天真的實現可以使用固定大小的指針陣列。然而這具有嚴重的缺點。
更好的實現可能是鏈接列表，這讓我們插入無限數量的單詞，但需要線性時間來插入和搜索！

所以我們進入時間複雜的領域。
現在我會告訴你，這個結構可以用分期固定的時間來實現。但讓我們從頭開始。

從鏈表的下一個邏輯步驟是：用二叉樹，使用和哈希表。
兩者都可以同時進行插入和搜索，即兩個操作都可以合併。
第一個採取ø（爲log N）來插入和搜索，第二次在Ö（）。
然而，第一個更結構化讓我們不僅做搜索，還做有序搜索。這個特性對這個問題沒有用，所以我們應該選擇哈希表。

我選擇了樹。這是因爲二叉樹讓你用指針和遞歸練習比散列表更好，這是一個Well Well Type（當你參加一個類型理論課你會感謝我！）。

如果你不熟悉二叉樹，現在就開始吧！詢問Google或您的老師！

實施

我們的樹的節點應該包含

它存儲
的鏈接文字（即指針）到它的左子樹
的鏈接（即指針）其右子樹

最後兩個很簡單，第一個可以實現爲poi或者作爲固定大小的結構內陣列。
我選擇了後者，因爲它使得節點只需要一個呼叫到malloc，這也得到了一個事實的支持，即由於我們用fscanf來讀取它，所以我們有一個詞的最大尺寸。

的一些注意事項，以代碼

我用指針的指針來實現add_word，這使溶液優雅，但保持鉛筆和紙在手！
此外，我（應該）已確保通過使用strncpy,snprintf並動態地將fscanf格式說明符與右len一起使用，不會發生緩衝區溢出。 這是很好的做法 -
我已經使用了assert檢查是否爲格式說明緩衝區分配的大小不夠大，這是因爲正確的大小可以通過編程來計算，一旦代碼被編譯它保持固定，所以不需要重運行檢查。
與fscanf使用的格式說明的形式爲％S其中是MAX_WORD_SIZE，因此，例如它原來是「40多歲的％」。
我使用遞歸獲得樂趣，將三個從葉子釋放到根。

#include <stdio.h> 
#include <string.h> 
#include <stdlib.h> 
#include <assert.h> 


/* Values returned by add_word */ 
/* The word has been added to the tree  */ 
#define AW_ADDED   0 
/* The word cannot be added as malloc failed*/   
#define AW_ERROR  -1 
/* The word is a duplicate     */ 
#define AW_DUPLICATE  1  

/* Maximum size of a word     */ 
#define MAX_WORD_SIZE 40   


/* Structure of the binary tree node  */ 
typedef struct node 
{ 
    struct node* left;    /* Ptr to left (less than) branch  */ 
    struct node* right;    /* Ptr to right (greater than) branch */ 
    char word[MAX_WORD_SIZE+1];  /* Word stored in the node    */ 
} node; 


/* 
    Add a word to the tree identified by the root pointer. 
    This function allocate all the memory itself, the root of a tree is a pointer to a node structure. 

    root is a pointer to the root (a ptr to ptr since the root may be updated) 
    word is the word to add 
*/ 
int add_word(node** root, const char* word) 
{ 
    int compare; 

    /* Traverse the tree until you find a null pointer (beware, we work with ptr to ptr) */ 
    while (*root) 
    { 
     /* Compare the current node word with the given one */ 
     compare = strcmp(word, (*root)->word); 

     /* They are equal? Easy, just return the appropriate value */ 
     if (!compare) 
      return AW_DUPLICATE; 

     /* Move to the left of right based on the sign of the comparison */ 
     root = compare < 0 ? &((*root)->left) : &((*root)->right); 
    } 

    /* We found a null ptr to update with a ptr to a new node */ 

    /* Allocate memory */ 
    if (!(*root = malloc(sizeof(node)))) 
     return AW_ERROR; 

    /* Init the new node */ 
    (*root)->left = (*root)->right = 0; 

    /* Copy the given word, up to MAX_WORD_SIZE byte*/ 
    strncpy((*root)->word, word, MAX_WORD_SIZE); 

    /* Set a null terminator on the last byte in the case the word is exactly MAX_WORD_SIZE char*/ 
    (*root)->word[MAX_WORD_SIZE] = 0; 

    return AW_ADDED; 
} 

/* 
    Free the memory used by the tree 
    Set the pointers to NULL. 
    Use recursion for didactic purpose, an iterative solution would consume less resources as 
    this is NOT tail recursion. 
*/ 
void free_tree(node** root) 
{ 
    if (*root) 
    { 
     /* Go to children nodes */ 
     free_tree(&((*root)->left)); 
     free_tree(&((*root)->right)); 

     /* Free current node */ 
     free(*root); 
     *root = NULL; 
    } 

} 

int main() 
{ 
    /* Open the files */ 
    FILE* fin = fopen("in.txt", "r"); 
    FILE* fout = fopen("out.txt", "w"); 

    /* Check the descriptors */ 
    if (!fin) 
     return printf("Cannot open input file\n"), 1; 

    if (!fout) 
     return printf("Cannot open output file\n"), 2; 

    /* This is out tree */ 
    node* root = NULL; 
    /* This is the buffer for reading word from fin*/ 
    char new_word[MAX_WORD_SIZE+1]; 
    /* This is the buffer for creating fscanf format specifier*/ 
    char format[32]; 

    /* Create the fscanf format specifier */ 
    int char_used = snprintf(format, sizeof(format), "%%%ds", MAX_WORD_SIZE); 
    assert(char_used + 1 <= sizeof(format)); 

    /* Read the file until the end */ 
    while (!feof(fin)) 
    { 
     /* Read a word and add it to the tree, if it is added, write it to new file */ 
     if (fscanf(fin, format, new_word) && add_word(&root, new_word) == AW_ADDED) 
      fprintf(fout, "%s ", new_word);        
    } 

    /* Close and exit */ 
    fclose(fin); 
    fclose(fout); 

    free_tree(&root); 

    return 0; 

}

來源

2015-07-20 23:22:54

您可以嘗試使用MD5或類似的東西散列字符串，然後將它們存儲在二叉樹。應具有相當低的平均時間複雜度。我不確定MD5的速度有多快;對於小字符串可能會有更好的散列算法。

你可以只存儲在陣列中的所有字符串和使用的strcmp在每次你拿起一個新的字符串時所有的人，如果你不關心效率雖然。

來源

2015-07-20 23:13:45

複製文件中的單詞

回答

戰略

的戰術

實施

相關問題