2016-11-10 58 views
3

執行慢我創建了一個簡單的腳本,兩個字符串之間的得分。請找到US​​QL和後端.NET代碼下面的USQL

CN_Matcher.usql:

REFERENCE ASSEMBLY master.FuzzyString; 

@searchlog = 
     EXTRACT ID int, 
       Input_CN string, 
       Output_CN string 
     FROM "/CN_Matcher/Input/sample.txt" 
     USING Extractors.Tsv(); 

@CleansCheck = 
    SELECT ID,Input_CN, Output_CN, CN_Validator.trial.cleanser(Input_CN) AS Input_CN_Cleansed, 
      CN_Validator.trial.cleanser(Output_CN) AS Output_CN_Cleansed 
    FROM @searchlog; 

@CheckData= SELECT ID,Input_CN, Output_CN, Input_CN_Cleansed, Output_CN_Cleansed, 
        CN_Validator.trial.Hamming(Input_CN_Cleansed, Output_CN_Cleansed) AS HammingScore, 
        CN_Validator.trial.LevinstienDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS LevinstienDistance, 
        FuzzyString.ComparisonMetrics.JaroWinklerDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS JaroWinklerDistance 
             FROM @CleansCheck; 

OUTPUT @CheckData 
    TO "/CN_Matcher/CN_Full_Run.txt" 
    USING Outputters.Tsv(); 

CN_Matcher.usql.cs:

using Microsoft.Analytics.Interfaces; 
using Microsoft.Analytics.Types.Sql; 
using System; 
using System.Collections.Generic; 
using System.IO; 
using System.Linq; 
using System.Text; 

namespace CN_Validator 
{ 
    public static class trial 
    { 

     public static string cleanser(string val) 
     { 
      List<string> wordsToRemove = "l.p. registered pc bldg pllc lp. l.c. div. national l p l.l.c international r. limited school azioni joint co-op corporation corp., (corp) inc., societa company llp liability l.l.l.p llc bancorporation manufacturing c dst (inc) jv ltd. llc. technology ltd., s.a. mfg rllp incorporated per venture l.l.p c. p.l.l.c l.p.. p. partnership corp co-operative s.p.a tech schl bancorp association lllp n r ltd inc. l.l.p. p.c. co district int intl assn. sa inc l.p co, co. division lc intl. lp professional corp. a l. l.l.c. building r.l.l.p co.,".Split(' ').ToList(); 
      return string.Join(" ", val.ToLower().Split(' ').Except(wordsToRemove)); 
     } 

     public static int Hamming(string source, string target) 
     { 
      int distance = 0; 
      if (source.Length == target.Length) 
      { 
       for (int i = 0; i < source.Length; i++) 
       { 
        if (!source[i].Equals(target[i])) 
        { 
         distance++; 
        } 
       } 
       return distance; 
      } 
      else { return 99999; } 
     } 

     public static int LevinstienDistance(string source, string target) 
     { 
      int n = source.Length; 
      int m = target.Length; 
      int[,] d = new int[n + 1, m + 1]; // matrix 
      int cost; // cost 
      // Step 1 
      if (n == 0) return m; 
      if (m == 0) return n; 
      for (int i = 0; i <= n; d[i, 0] = i++) ; 
      for (int j = 0; j <= m; d[0, j] = j++) ; 
      for (int i = 1; i <= n; i++) 
      { 
       for (int j = 1; j <= m; j++) 
       { 
        cost = (target.Substring(j - 1, 1) == source.Substring(i - 1, 1) ? 0 : 1); 
        d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), 
           d[i - 1, j - 1] + cost); 
       } 
      } 
      return d[n, m]; 
     } 

    } 
} 

我已經跑了樣品批次100輸入並設置並行度爲1,優先級爲1000. 工作在1.6分鐘內完成

我想用1000個輸入測試相同的作業,並將並行度設置爲1,優先級設置爲1000,並根據我的計算,因爲它需要1.6分鐘的100個輸入我認爲1000個輸入需要大約20分鐘,但它跑了超過50分鐘,我沒有看到任何進展

所以我增加了一個100的輸入工作,並測試它跑了一樣以前的時間。所以我想增加平行度並將其提高到3並再次運行,即使在1小時後也沒有完成。

JOB_ID = 07c0850d-0770-4430-a288-5cddcfc26699

的主要問題是,我無法看到任何進展或狀態。

請讓我知道如果我做錯什麼。

反正在USQL使用構造函數?因爲如果我能夠做到這一點,我不需要一次又一次地執行相同的清潔步驟。

回答

2

我假設你正在使用的文件集語法指定1000個文件?不幸的是,文件集的當前默認實現不能很好地擴展,編譯(準備)階段將需要很長時間(執行也是如此)。我們目前在預覽中有更好的實現。你可以給我發一封郵件到usql在微軟網絡公司,我會告訴你如何試用預覽實現。

感謝 邁克爾

+0

嗨邁克爾它不是1000個文件它是1000個輸入的一個文件。我會郵寄給你。感謝您的迴應。 – The6thSense

0

我看着這樣做的更多的基於集合的方式。例如,而不是抱着字的代碼隱藏文件中刪除,追究他們的U-SQL表,因此很容易添加到:

CREATE TABLE IF NOT EXISTS dbo.wordsToRemove 
(
    word string, 

    INDEX cdx_wordsToRemvoe CLUSTERED (word ASC) 
    DISTRIBUTED BY HASH (word) 
); 

INSERT INTO dbo.wordsToRemove (word) 
SELECT word 
FROM (
VALUES 
    ("l.p."), 
    ("registered"), 
    ("pc"), 
    ("bldg"), 
    ("pllc"), 
    ("lp."), 
    ("l.c."), 
    ("div."), 
    ("national"), 
    ("l"), 
    ("p"), 
    ("l.l.c"), 
    ("international"), 
    ("r."), 
    ("limited"), 
    ("school"), 
    ("azioni"), 
    ("joint"), 
    ("co-op"), 
    ("corporation"), 
    ("corp.,"), 
    ("(corp)"), 
    ("inc.,"), 
    ("societa"), 
    ("company"), 
    ("llp"), 
    ("liability"), 
    ("l.l.l.p"), 
    ("llc"), 
    ("bancorporation"), 
    ("manufacturing"), 
    ("c"), 
    ("dst"), 
    ("(inc)"), 
    ("jv"), 
    ("ltd."), 
    ("llc."), 
    ("technology"), 
    ("ltd.,"), 
    ("s.a."), 
    ("mfg"), 
    ("rllp"), 
    ("incorporated"), 
    ("per"), 
    ("venture"), 
    ("l.l.p"), 
    ("c."), 
    ("p.l.l.c"), 
    ("l.p.."), 
    ("p."), 
    ("partnership"), 
    ("corp"), 
    ("co-operative"), 
    ("s.p.a"), 
    ("tech"), 
    ("schl"), 
    ("bancorp"), 
    ("association"), 
    ("lllp"), 
    ("n"), 
    ("r"), 
    ("ltd"), 
    ("inc."), 
    ("l.l.p."), 
    ("p.c."), 
    ("co"), 
    ("district"), 
    ("int"), 
    ("intl"), 
    ("assn."), 
    ("sa"), 
    ("inc"), 
    ("l.p"), 
    ("co,"), 
    ("co."), 
    ("division"), 
    ("lc"), 
    ("intl."), 
    ("lp"), 
    ("professional"), 
    ("corp."), 
    ("a"), 
    ("l."), 
    ("l.l.c."), 
    ("building"), 
    ("r.l.l.p"), 
    ("co.,") 
) AS words(word); 

然後做比較,我分裂原語起來,去掉我們不想再把語句後面的話再度合作,這樣的事情:

//DECLARE @inputFile string = "input/input.csv"; // 500 companies, Standard & Poor 500 companies from wikipedia 
DECLARE @inputFile string = "input/input2.csv"; // 850,000 companies, part 1 of extract from Companies House 


@searchlog = 
    EXTRACT id int, 
      Input_CN string, 
      Output_CN string 
    FROM @inputFile 
    USING Extractors.Csv(silent : true); 
    //USING Extractors.Csv(skipFirstNRows:1); 


// Split the input string to remove unwanted words 
@Input_CN = 
    SELECT id, 
      new SQL.ARRAY<string>(Input_CN.Split(' ')) AS splitWords 
    FROM @searchlog; 


@Output_CN = 
    SELECT id, 
      new SQL.ARRAY<string>(Output_CN.Split(' ')) AS splitWords 
    FROM @searchlog; 


// Remove unwanted words from input string 
@Input_CN = 
    SELECT * 
    FROM 
    (
     SELECT o.id, 
       x.splitWord.ToLower() AS splitWord 
     FROM @Input_CN AS o 
      CROSS APPLY 
       EXPLODE(splitWords) AS x(splitWord) 
    ) AS y  
    ANTISEMIJOIN 
     dbo.wordsToRemove AS w 
    ON y.splitWord == w.word; 

// Remove unwanted words from output string 
@Output_CN = 
    SELECT * 
    FROM 
    (
     SELECT o.id, 
       x.splitWord.ToLower() AS splitWord 
     FROM @Output_CN AS o 
      CROSS APPLY 
       EXPLODE(splitWords) AS x(splitWord) 
    ) AS y 
    ANTISEMIJOIN 
     dbo.wordsToRemove AS w 
    ON y.splitWord == w.word; 




// Put the input string back together again 
@Input_CN = 
    SELECT id, 
      String.Join(" ", ARRAY_AGG (splitWord)) AS Input_CN_Cleansed 
    FROM @Input_CN 
    GROUP BY id; 


@Output_CN = 
    SELECT id, 
      String.Join(" ", ARRAY_AGG (splitWord)) AS Output_CN_Cleansed 
    FROM @Output_CN 
    GROUP BY id; 



@output = 
    SELECT i.id, 
      i.Input_CN_Cleansed, 
      o.Output_CN_Cleansed, 
      CN_Validator.trial.Hamming(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS HammingScore, 
      CN_Validator.trial.LevinstienDistance(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS LevinstienDistance 
    FROM @Input_CN AS i 
     INNER JOIN 
      @Output_CN AS o 
     ON i.id == o.id; 



OUTPUT @output 
    TO "/output/output.csv" 
    USING Outputters.Csv(); 

我發現性能相似,但有可能設計更易於維護。無論如何,我的代碼只需要幾分鐘就能運行850 + k條記錄,而不是50分鐘以上,所以也許還有另一個問題。 NB我錯過了FuzzyString庫,所以在我的測試中沒有包括這個 - 它可以解釋這個差異。

如果你從微軟獲得此更新,請回發到這個線程,甚至將其標記爲答案,如果你喜歡。

+0

如果我在這個問題上得到解決,我一定會在這裏發佈。感謝代碼翻新。由於不建議在SQL中對數據進行規範化,所以我認爲這是在.net版本中完成的,但是您的代碼尋找可維護性,而且看起來您正在使用USQL的全部功能。 – The6thSense

相關問題