我有一個數據庫過程,它處理一些XML並將其標準化爲多個表。 我不是一個龐大的數據庫傢伙,並不知道最關於數據庫鎖,所以我主要只是想知道我所做的任何事情是完全錯誤的。處理大量的XML如何不鎖定數據庫和最佳性能?
首先,有一個results
表,包含類似以下的東西:
resultId,computerId(int)|rawData(xml) ------------------------------- 1|1|<installedSoftware><software name="Google Chrome" version="1.0" /><software name="Mozilla Firefox" version="3.0" /></installedSoftware> 2|2|<installedSoftware><software name="Internet Explorer" version="6" /><software name="Google Chrome" version="1.0" /></installedSoftware>
我的存儲過程看起來是這樣的:
CREATE TABLE #ResultsToProcess
(
int resultId
)
-- Only trying to process 1000 results at a time. If I try to do to many at once, I sometimes get timeouts. Instead I do small chunks.
SELECT
TOP 1000
resultId
FROM
results
CREATE TABLE #TempSoftware
(
computerId INT,
softwareName NVARCHAR(MAX),
softwareVersion NVARCHAR(MAX)
)
INSERT INTO #TempSoftware
SELECT DISTINCT
computerId,
T(N).value('(@name[1])', 'NVARCHAR(MAX)') AS softwareName,
T(N).value('(@version[1])', 'NVARCHAR(MAX)') AS softwareVersion,
FROM
results CROSS APPLY results.rawData.nodes('/installedSoftware[1]/software') AS T(N)
INNER JOIN #ResultsToProcess ON results.resultId = #ResultsToProcess.resultId
-- May need to do some additional processing on the temporary data before actually using it.
-- To reduce duplicate data, we insert into a full list of software. There is an index based on softwareName and softwareVersion. The softwareTable has an auto increment int primary key.
INSERT INTO software(softwareName,softwareVersion)
SELECT DISTINCT softwareName, softwareVersion FROM #TempSoftware
WHERE NOT EXISTS(SELECT 1 FROM software WHERE software.softwareName = #TempSoftware.softwareName AND software.softwareVersion = #TempSoftware.softwareVersion)
-- Finally we will link any software to the computer. However in this case, the temp table does not have any indexes. Would it be worth-while to add some?
INSERT INTO computer_software(computerId,softwareId)
SELECT
#TempSoftware.computerId,
#software.softwareId
FROM #TempSoftware INNER JOIN ON software ON #TempSoftware.softwareName = software.softwareName AND #TempSoftware.softwareVersion = #software.softwareVersion
所以在附加到這一點,程序也將處理其他基於計算機的屬性,全部來自相同的表格/列。
我對此代碼將是問題:
在此處理期間,其他項目可以不斷加入到results.rawData表。從XML節點中選擇來創建我的臨時表需要一點時間,我擔心在發生這種情況時試圖插入表中的任何內容都可能會被迫等待。通過在該過程的開始處使用
resultId
列,我嘗試創建該過程將一次處理的數據範圍。在此處理期間,可以查詢其他表格(例如查找計算機上存在的軟件)。由於我只在這些表格中做了一個簡短的批量插入,我假設那裏應該沒有問題。
#TempSoftware
表沒有索引,我在兩個NVARCHAR(MAX)
列上做一個連接。在這個表上創建索引值得嗎?或者創建索引的開銷會比連接更差。我在這裏做什麼愚蠢的事情,我應該被打?
感謝您的任何建議。我再次不是一個大數據庫傢伙。我假設直接在數據庫中進行所有處理比將數據拉回到C#中,執行處理並重新插入數據庫要好。