2013-03-30 49 views
0

我有一個包含25000個文本文件的文件夾,我想讀取這些文件,並將話到table.My文本文件被命名爲格式如下1.txt,2.txt,........等等以25000.txt。每個文本文件都包含以下形式的單詞。讀取文本文件和插入出現在這些文件中的話到一個表中的SQL Server

sample contents of my file 
apple 
cat 
rat 
shoe 

的話可以在其他TEXTFILES重複太多,我想C#代碼,可以讀取文本文件識別不重複,重複這樣的詞語,以及那些,然後將它們插入到數據庫中sqlserver的中下面的表格。

keyword document name 
cat  1.txt,2.txt,3.txt 
rat  4.txt,1.txt 
fish  5.txt 

`

using System; 

using System.Collections.Generic; 

using System.ComponentModel; 

using System.Data; 

using System.Drawing; 

using System.Linq; 

using System.Text; 


using System.Windows.Forms; 

using System.IO; 

using System.Data.SqlClient; 



namespace RAMESH 
{ 
public partial class Form1 : Form 
{ 
    public Form1() 
    { 
     InitializeComponent(); 
    } 

    private void textBox1_TextChanged(object sender, EventArgs e) 
    { 

    } 

    private void button2_Click(object sender, EventArgs e) 
    { 

     string[] files = Directory.GetFiles(textBox1.Text, "*.txt"); 
     int i; 
     string sqlstmt,str; 
     SqlConnection con = new SqlConnection("data source=dell-pc\\sql1; initial   catalog=db; user id=sa; password=a;"); 
     SqlCommand cmd; 
     sqlstmt = "delete from Items"; 
     cmd = new SqlCommand(sqlstmt, con); 
     con.Open(); 
     cmd.ExecuteNonQuery(); 
     for (i = 0; i < files.Length; i++) 
     { 
      StreamReader sr = new StreamReader(files[i]); 
      FileInfo f = new FileInfo(files[i]); 
      string fname; 
      fname = f.Name; 
      fname = fname.Substring(0, fname.LastIndexOf('.')); 
      //MessageBox.Show(fname); 
      while ((str = sr.ReadLine()) != null) 
      { 
       int nstr=1; 
       //int x,y; 
       //for (x = 0; x < str.Length; x++) 
       //{ 
       // y = Convert.ToInt32(str.Substring(x,1)); 
       // if ((y < 48 && y > 75) || (y < 65 && y > 97) || (y < 97 && y > 122)) ; 
       //} 
       sqlstmt = "insert into Items values('" + str + "','" + fname + "')"; 
       cmd = new SqlCommand(sqlstmt, con);      
       try 
       { 
        cmd.ExecuteNonQuery(); 
       } 
       catch (Exception ex) 
       { 
        sqlstmt = "update Items set docname=docname + '," + fname + "' where itemname='" + str + "'"; 
        cmd = new SqlCommand(sqlstmt, con); 
        cmd.ExecuteNonQuery(); 
       } 
      } 
      sr.Close(); 
     } 
     MessageBox.Show("keywords added successfully"); 
     con.Close(); 
    } 
} 

} `

+3

你到目前爲止嘗試過什麼?告訴我們你自己做了什麼,詢問具體的問題,你會得到具體的答案。 –

+0

確定即時發送我的c#代碼 – rameshkumar

+0

你可以添加一個存儲過程到你的數據庫?這段代碼效率很低,並且很容易出現很多問題,比如Sql Injections – Steve

回答

1

首先,我將添加一個存儲過程到數據庫的邏輯隔離的更新或插入

CREATE PROCEDURE UpsertWords 
@word nvarchar(MAX), @file nvarchar(256) 
as 

    Declare @cnt integer 
    Select @cnt = Count(*) from Items where ItemName = @word 
    if @cnt = 0 
     INSERT INTO Items (@word, @file) 
    else 
     UPDATE Items SET docname = docname + ',' + @file where ItemName = @word 

現在,我們可以簡化您的代碼

..... 

// Build the command just one time, outside the loop, 
// make it point to the stored procedure above 
cmd = new SqlCommand("UpsertWords", con); 
cmd.CommandType = CommandType.StoredProcedure;      

// Create dummy parameters, the actual value is supplied inside the loop 
cmd.Parameters.AddWithValue("@word", string.Empty); 
cmd.Parameters.AddWithValue("@file", string.Empty); 

// Now loop on every file 
for (i = 0; i < files.Length; i++) 
{ 
    // Open and read all the lines in the current file 
    string[] lines = File.ReadAllLines(files[i]); 

    // Get only the filename part without the extension 
    string fname = Path.GetFileNameWithoutExtension(files[i]) 

    // In case of just one line per file, this loop will execute just one time 
    // however we also could handle more than one line per file 
    foreach(string line in lines) 
    { 
     // Set the actual value of the parameters created outside the loop 
     cmd.Parameters["@word"] = line; 
     cmd.Parameters["@file"] = fname; 
     // Run the insert or update (the logic is inside the storedprocedure) 
     cmd.ExecuteNonQuery(); 
    } 

此時不清楚您的行是由單個單詞組成還是由多個單詞(製表符,逗號,分號)分隔的多個單詞組成。在這種情況下,你需要分割字符串和另一個循環。

但是,我發現你的數據庫架構錯了。最好爲每個單詞添加一個新行,並在其中出現該文件。這樣一個簡單的查詢像

SELECT docname from Items where itemname = @word 

將yeld沒有任何大的性能問題,所有的文件和你有更多的搜索數據庫。
或者,如果你需要算一個字

SELECT ItemName, COUNT(ItemName) as WordCount 
FROM Items 
GROUP BY ItemName 
ORDER BY Count(ItemName) ASC 
+0

你能簡單地解釋這個過程嗎我的文本文件只包含一行中的一個單詞 – rameshkumar

+0

I將向 – Steve

+0

以上的代碼添加註釋,表格的格式爲 – rameshkumar

0

的發生試試這個辦法:

首先開始與您的文件,遍歷並創建一個簡單的XML文檔。

 var fname = "File12.txt"; 
     var keywords = new List<string>(new[]{ "dog", "cat", "moose" });   

     var miXML = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("root")); 

     foreach (var el in keywords.Select(i => new XElement("item", new XAttribute("key", i)))) 
     { 
      miXML.Root.Add(el); 
     } 

     using (var con = new SqlConnection("Server=localhost;Database=HT;Trusted_Connection=True;")) 
     { 
      con.Open(); 
      using (var cmd = new SqlCommand("uspUpsert", con) {CommandType = CommandType.StoredProcedure}) 
      { 
       cmd.Parameters.AddWithValue("@X", miXML.ToString()); 
       cmd.Parameters.AddWithValue("@fileName", fname); 
       cmd.ExecuteNonQuery(); 
      } 
     } 

然後爲您的存儲過程,你可以調用這個PROC,這將是XML轉換成表格,插入關鍵字和文件名到數據庫中。

CREATE PROCEDURE uspUpsert 
    @X xml, 
    @Filename varchar(100) 
AS 
BEGIN 
SET NOCOUNT ON; 

    WITH KV as (
     select 
      x.v.value('@key', 'varchar(20)') as Keyword 
      ,@FileName as FileName 
     FROM @x.nodes('/root/item') x(v) 
    ) 
    insert into Items 
    select KV.keyWord, KV.FileName 
    from KV 
    left outer join Items I on I.Keyword=KV.Keyword and I.FileName=KV.FileName 
    where I.id is null 
END 

既然你可能不希望「FILE1.TXT FILE2.TXT file3.txt」查找重複,您將使用此查詢來查找單詞的重複文件:

select * from items where keyword='dog' 

另外,現在可以進行計數並在該表上進行所有其他聚合。

相關問題