有關如何將CSV記錄合併到ASP .NET中的SQL表的建議

我有一個ASP .NET MVC應用程序，我正在嘗試爲其編寫導入功能。有關如何將CSV記錄合併到ASP .NET中的SQL表的建議

我確實有一些細節，比如我使用實體框架V4在MVC應用程序，但我特別關注在算法，將工作最好的，最好是用什麼樣的表現的說明它有，爲什麼。

該操作將異步執行，因此執行時間不像RAM那樣使用多少因素。

我應該指出，有幾件事情（數據庫是主要的），我已經被迫繼承，由於時間限制，直到晚些時候才能清理。

詳細

導入功能是把內存中的CSV文件（已經從銷售隊伍出口，上傳），並將其合併到現有的數據庫表。該過程需要準備：可能已在CSV被改變

更新現有記錄，而不刪除重新添加數據庫記錄，以保持每個記錄的主鍵。
添加和刪除CSV文件中更改的所有記錄。

的CSV和數據庫表的當前結構是這樣的：

表和CSV都包含52列。
現有數據庫模式中的每一列都是一個VARCHAR（100）字段;我打算優化這個，但不能在目前的時間範圍內。
數據庫後端是MS SQL。
該CSV文件中有大約1700行值的數據。我看不到這個數字超過5000，因爲顯然已經有很多重複的條目。
現在，我只打算從CSV中實際導入其中的10列，其餘表格的字段將保留爲空，並且我將在以後刪除不需要的列。
正在將CSV文件讀入數據表，以便於使用。
我最初以爲，在我的銷售隊伍CSV中的ContactID場是一個唯一的標識符，雖然做了一些測試的進口後，似乎有零在CSV唯一字段文件本身，至少我可以找。
鑑於這種情況，我已經被迫到主鍵字段添加到聯繫人表，以便其他表仍然可以保持與任何指定的聯繫人的有效關係。但是，這顯然阻止我簡單地刪除並重新創建每個導入上的記錄。

BEGIN EDIT

它我很清楚，我試圖來實現的，在執行現有的數據庫記錄時，表和CSV之間不存在關係的更新，根本無法實現。

這是沒有這麼多，我也沒事先知道這一點，但更多的，我希望有隻是一些聰明的主意我沒有想到的，可以做到這一點。

考慮到這一點，我最終決定在我的算法中做出假設，即ContactID 是是唯一標識符，然後查看我結束了多少重複項。

我要去一個可能的解決方案如下回答。算法和實際實現。我會離開它幾天，因爲我更願意接受別人更好的解決方案作爲答案。

這裏的一些事情，我實現我下面的解決方案後發現：

我不得不縮小由CSV提供，以便它匹配的那些行被導入到數據庫中的行。
的SqlDataReader的是完全正常的，什麼都有最大的影響是被執行的單獨UPDATE/INSERT查詢。
對於一個完全新鮮的進口，項目到內存的初始讀取不是由UI注意到，插入過程大約需要30秒才能完成。
在全新數據集上只跳過了15個副本ID，這少於總數據集的1％。我認爲這是一個可以接受的損失，因爲我被告知銷售人員數據庫無論如何都將進行清理。我希望在這些情況下可以重新生成ID。
我還沒有收集到的導入過程中的任何資源指標，但在速度方面，這是正常的，因爲進度條的我已經實現提供反饋給用戶。

END EDIT

資源

鑑於每個字段的分配大小，即使有這種相對少量的記錄，我關注的主要是關於可能被分配的內存量在進口時。

該應用程序不會在共享環境中運行，因此在這方面有足夠的空間來呼吸。而且，這個特定的功能只能手動運行一週左右。

我的目標是至少能夠在半專用機器上舒適地運行。機器規格是可變的，因爲應用程序最終可能會作爲產品銷售（但同樣不是針對共享環境的）。

就導入過程的運行時間而言，如前所述，這將是異步的，我已經將一些AJAX調用和一個進度條放在一起。所以我會想象在一兩分鐘內任何地方都可以。

解決方案

我發現下面的職位，這似乎是接近我想要的：

Compare two DataTables to determine rows in one but not the other

在我看來，對一個哈希表進行查找是個好主意。但是，如前所述，如果我可以避免將CSV和聯繫人表全部加載到內存中，那麼這將是首選，我不能用hashtable方法來避免它。

我不知道如何實現的一件事是，我可能會計算每行的哈希以比較一組數據是DataTable對象，另一組是聯繫人項目的EntitySet。

我在想，除非我想手動迭代每列值以計算散列，否則我將需要兩個數據集都是相同的對象類型，除非任何人有一些奇特的解決方案。

我最好簡單地忘記這個過程的實體框架？我肯定花了很多時間嘗試批量遠程執行操作，所以我很樂意將它從等式中移除。

如果沒有任何意義或缺失，我很抱歉，我很累。請讓我知道，明天我會解決它。

我感謝任何可以提供的幫助，因爲我開始變得絕望。我花了更多時間苦苦思索如何解決這個問題。

謝謝！

來源

2011-02-15 Geekman

salesforce中的每個表都有一個名爲Id的主鍵列。所以您應該可以爲CSV中的每一行都設置唯一的密鑰。另外，完全取決於您如何生成CSV，它可能具有15個字符的ID，這些ID是區分大小寫的，例如， 00a＆00A指的是不同的記錄。 – superfell 2011-02-15 16:40:15

@superfell您引用的區分大小寫的字段實際上在ContactID和AccountID字段中使用，但即使在修改相應數據庫列的煤炭化時，我遇到了具有非唯一ContactID的問題。 CSV的grep證實了這一點。不過，我將不得不考慮將您所說的ID列添加到CSV中。 – Geekman 2011-02-15 23:13:18

根據你的時間尺度，我會（而且確實）簡單地使用DBAmp by Forceamp。這表現爲OLE DB驅動程序，因此可用作SQL Server中的鏈接服務器。

該工具的標準用法是使用提供的存儲過程將Salesforce架構複製/刷新到SQL Server。我在一些非常大的環境中執行此操作，並且每隔15分鐘刷新一次而不重疊。

DBAmp在底層SQL Server表中維護列類型。

最後一點，請注意15char Salesforce ID（SObject ID）。這些只有在區分大小寫的情況下才是唯一的。 Salesforce報告通常輸出15char ID，但API轉儲通常是18char不區分大小寫的ID。有關轉換等的更多信息here。如果在區分大小寫時仍然看到衝突，我會傾向於認爲它是對文件執行的一些預處理，或者可能是用於導出的報告中的錯誤。

除了您的評論，Salesforce ID是全球唯一的，也就是說它們不會在不同客戶的生產組織之間重複。因此，即使你從多個組織中提取記錄，他們也不應該相互衝突。完全複製沙箱組織與「主」生產組織具有相同的ID。

如果您有興趣使用API直接簽出Salesforce.Net library，這對您的入門很有幫助。

來源

2011-02-18 00:31:53

基於你一次處理不超過5000行的事實，我傾向於使用ADO.Net（可能是SQLDataReader）僅將數據獲取到對象中。 WRT主鍵 - 我不知道Salesforce導出的數據的詳細信息，但是c.f @ superfell的評論。如果沒有，你可以爲對象生成自己的PK）。

然後我可以使用List<T>類可用的方法通過比較連續字段等過濾/遍歷行。

這主要是由事實證明我的C#比我SQL ;-)

好運好幾倍的動機。

來源

2011-02-15 22:32:24 5arx

我想出了以下可能的解決方案。在實現它之後，使用我的測試CSV，我發現只有15個重複的ContactID。

鑑於這個比例低於當前聯繫人的1％，我認爲它是可以接受的。

爲了達到這個目的，我必須使CSV提供的列等於應用程序導入的列，否則比較顯然會失敗。

這裏的算法，我已經把：

 /* Algorithm: 
     * ---------- 
     * I'm making the assumption that the ContactID field is going to be unique, and if not, I will ignore any duplicates. 
     * The reason for this is that I don't see a way to be able to update an existing database record with the changes in the CSV 
     * unless a relationship exists to indicate what CSV record relates to what database record. 
     * 
     * - Load DB table into memory 
     * - Load CSV records into memory 
     * - For each record in the CSV: 
     *  - Add this record's contact ID to a list of IDs which need to remain the DB. 
     *  If it already exists in the list, we have a duplicate ID. Skip. 
     *   
     *  - Concatenate CSV column values into a single string, store for later comparison. 
     *  
     *  - Select the top record from the DB DataTable where: the ContactID field in the DB record matches that in the CSV. 
     *  
     *  - If no DB records were found 
     *   - Add this new record to the DB. 
     *   
     *  - Concatenate column values for the DB record and compare this to the string generated previously. 
     *  - If the strings match, skip any further processing 
     *  
     *  - For each column in the CSV record: 
     *   - Compare against the value for the same column in the DB record. 
     *   - If values do not match, use StringBuilder to add to your UPDATE query for this record. 
     *   
     * 
     * - Now we need to clean out the records from the DB which no longer exist in the CSV. Use the previously built list of ContactIDs. 
     * - For each record in the DB: 
     *  - If the ContactID in the DB record is not in your list, use a StringBuilder to add this ID to a DELETE statement. eg. OR [ContactID] = ... 
     *  
     */

下面是我的實現：

public class ContactImportService : ServiceBase 
{ 

    private DataTable csvData; 

    //... 

    public void DifferentialImport(Guid ID) 
    { 

     //This is a list of ContactIDs which we come across in the CSV during processing. 
     //Any records in the DB which have an ID not in this list will be deleted. 
     List<string> currentIDs = new List<string>(); 

     lock (syncRoot) 
     { 
      jobQueue[ID].TotalItems = (short)csvData.Rows.Count; 
      jobQueue[ID].Status = "Loading contact records"; 
     } 

     //Load existing data into memory from Database. 
     SqlConnection connection = 
      new SqlConnection(Utilities.ConnectionStrings["MyDataBase"].ConnectionString); 
     SqlCommand command = new SqlCommand("SELECT " + 
       "[ContactID],[FirstName],[LastName],[Title]" + 
       // Etc... 
       "FROM [Contact]" + 
       "ORDER BY [ContactID]", connection); 

     connection.Open(); 
     SqlDataReader reader = command.ExecuteReader(CommandBehavior.CloseConnection); 
     DataTable dbData = new DataTable(); 
     dbData.Load(reader); 
     reader = null; 

     lock (syncRoot) 
     { 
      jobQueue[ID].Status = "Merging records"; 
     } 

     int affected = -1; 
     foreach (DataRow row in csvData.Rows) 
     { 
      string contactID = row["ContactID"].ToString(); 
      //Have we already processed a record with this ID? If so, skip. 
      if (currentIDs.IndexOf(contactID) != -1) 
       break; 

      currentIDs.Add(row["ContactID"].ToString()); 

      string csvValues = Utilities.GetDataRowString(row); 

      //Get a row from our DB DataTable with the same ID that we got previously: 
      DataRow dbRecord = (from record in dbData.AsEnumerable() 
          where record.Field<string>("ContactID") == contactID 
          select record).SingleOrDefault(); 

      //Found an ID not in the database yet... add it. 
      if (dbRecord == null) 
      { 
       command = new SqlCommand("INSERT INTO [Contact] " + 
        "... VALUES ...", connection); 
       connection.Open(); 
       affected = command.ExecuteNonQuery(); 
       connection.Close(); 
       if (affected < 1) 
       { 
        lock (syncRoot) 
        { 
         jobQueue[ID].FailedChanges++; 
        } 
       } 
      } 

      //Compare the DB record with the CSV record: 
      string dbValues = Utilities.GetDataRowString(dbRecord); 

      //Values are different, we need to update the DB to match. 
      if (csvValues == dbValues) 
       continue; 

      //TODO: Dynamically build the update query based on the specific columns which don't match using StringBulder. 
      command = new SqlCommand("UPDATE [Contact] SET ... WHERE [Contact].[ContactID] = @ContactID"); 
      //... 
      command.Parameters.Add("@ContactID", SqlDbType.VarChar, 100, contactID); 
      connection.Open(); 
      affected = command.ExecuteNonQuery(); 
      connection.Close(); 

      //Update job counters. 
      lock (syncRoot) 
      { 
       if (affected < 1) 
        jobQueue[ID].FailedChanges++; 
       else 
        jobQueue[ID].UpdatedItems++; 
       jobQueue[ID].ProcessedItems++; 
       jobQueue[ID].Status = "Deleting old records"; 
      } 

     } // CSV Rows 

     //Now that we know all of the Contacts which exist in the CSV currently, use the list of IDs to build a DELETE query 
     //which removes old entries from the database. 
     StringBuilder deleteQuery = new StringBuilder("DELETE FROM [Contact] WHERE "); 

     //Find all the ContactIDs which are listed in our DB DataTable, but not found in our list of current IDs. 
     List<string> dbIDs = (from record in dbData.AsEnumerable() 
           where currentIDs.IndexOf(record.Field<string>("ContactID")) == -1 
           select record.Field<string>("ContactID")).ToList(); 

     if (dbIDs.Count != 0) 
     { 
      command = new SqlCommand(); 
      command.Connection = connection; 
      for (int i = 0; i < dbIDs.Count; i++) 
      { 
       deleteQuery.Append(i != 0 ? " OR " : ""); 
       deleteQuery.Append("[Contact].[ContactID] = @" + i.ToString()); 
       command.Parameters.Add("@" + i.ToString(), SqlDbType.VarChar, 100, dbIDs[i]); 
      } 
      command.CommandText = deleteQuery.ToString(); 

      connection.Open(); 
      affected = command.ExecuteNonQuery(); 
      connection.Close(); 
     } 

     lock (syncRoot) 
     { 
      jobQueue[ID].Status = "Finished"; 
     } 

     remove(ID); 

    } 

}

SqlDataReader中似乎足夠了，它是拿大頭的時間單獨的更新查詢，所有其他操作相比可以忽略不計。

我會說在這一點上，大約需要30秒做一個新的導入，其中所有的記錄必須導入。通過我已經實施的進度反饋，這對於最終用戶來說足夠快。

我還沒有測量任何資源使用。

來源

2011-02-18 02:41:11 Geekman

有關如何將CSV記錄合併到ASP .NET中的SQL表的建議

回答

相關問題