2016-11-26 49 views
1

我很新到u-SQL計數,試圖解決的u-SQL腳本搜索字符串,然後GROUPBY該字符串,並獲得不同的文件

STR1 = \全球\歐洲\莫斯科\ 12345 \ FILE1.TXT

STR2 = \ global.bee.com \歐洲\莫斯科\ 12345 \ FILE1.TXT

STR3 = \全球\歐洲\阿姆斯特丹\ 54321 \ File1.Rvt STR4 = \ global.bee .com \ europe \ amsterdam \ 12345 \ File1.Rvt

case1: 我該如何得到「\ eur我想從str1和str2中取出(「\ europe \ Moscow \ 12345 \ File1.txt」),然後「Groupby(\ global \ europe \莫斯科\ 12345) 「並採取不同的文件的數量從路徑( 」「 \歐洲\莫斯科\ 12345 \」)

所以輸出會是這樣的:

distinct_filesby_Location_Date

到解決上述情況我嘗試了下面的u-sql代碼,但不太清楚我是否在寫正確的腳本:

@inArray = SELECT new SQL.ARRAY<string>(
       filepath.Contains("\\europe")) AS path 
    FROM @t; 

@filesbyloc = 
    SELECT [ID], 
     path.Trim() AS path1 
    FROM @inArray 
    CROSS APPLY 
    EXPLODE(path1) AS r(location); 

OUTPUT @filesbyloc 
TO "/Outputs/distinctfilesbylocation.tsv" 
USING Outputters.Tsv(); 

任何幫助,你會不勝感激。

回答

1

這樣做的一種方法是將要處理的所有字符串放在一個文件中,例如strings.txt並將其保存在U-SQL輸入文件夾中。還有一個與你想要匹配的城市的文件,例如cities.txt。請嘗試以下的U型SQL腳本:

@input = 
    EXTRACT filepath string 
    FROM "/input/strings.txt" 
    USING Extractors.Tsv(); 

// Give the strings a row-number 
@input = 
    SELECT ROW_NUMBER() OVER() AS rn, 
      filepath 
    FROM @input; 


// Get the cities 
@cities = 
    EXTRACT city string 
    FROM "/input/cities.txt" 
    USING Extractors.Tsv(); 

// Ensure there is a lower-case version of city for matching/joining 
@cities = 
    SELECT city, 
      city.ToLower() AS lowercase_city 
    FROM @cities; 


// Explode the filepath into separate rows 
@working = 
    SELECT rn, 
      new SQL.ARRAY<string>(filepath.Split('\\')) AS pathElement 
    FROM @input AS i; 

// Explode the filepath string, also changing to lower case 
@working = 
    SELECT rn, 
      x.pathElement.ToLower() AS pathElement 
    FROM @working AS i 
     CROSS APPLY 
      EXPLODE(pathElement) AS x(pathElement); 


// Create the output query, joining on lower case city name, display, normal case name 
@output = 
    SELECT c.city, 
      COUNT(*) AS records 
    FROM @working AS w 
     INNER JOIN 
      @cities AS c 
     ON w.pathElement == c.lowercase_city 
    GROUP BY c.city; 


// Output the result 
OUTPUT @output TO "/output/output.txt" 
USING Outputters.Tsv(); 

//OUTPUT @working TO "/output/output2.txt" 
//USING Outputters.Tsv(); 

我的結果:

My output file results

HTH

+1

非常感謝你wBob,你真的讓我的工作變得簡單我只是用谷歌搜索找到一些方法來做到這一點。 Bob還有一件事,如果你看過我的輸出鏈接,你必須看到2個字段「位置」和「日期」,這意味着按日期位置的文件數量。如何也可以添加到您提供的上述解決方案中。請指教。再一次非常感謝你回覆我的帖子這麼快:-) –

+0

好極了,你應該考慮把它當作答案! – wBob

+0

日期在哪裏?從您的示例數據中不清楚。它在文件名中,還是你需要從文件本身收集它? – wBob

1

以自由格式輸入文件爲TSV文件,不知道所有的列語義,這是一種編寫查詢的方法。請注意,我做出了評論中提供的假設。

@d = 
    EXTRACT path string, 
      user string, 
      num1 int, 
      num2 int, 
      start_date string, 
      end_date string, 
      flag string, 
      year int, 
      s string, 
      another_date string 
    FROM @"\users\temp\citypaths.txt" 
    USING Extractors.Tsv(encoding: Encoding.Unicode); 

// I assume that you have only one DateTime format culture in your file. 
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup. 
@d = 
SELECT new SqlArray<string>(path.Split('\\')) AS steps, 
     DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date 
FROM @d; 

// This assumes your paths have a fixed formatting/mapping into the city 
@d = 
SELECT steps[4].ToLowerInvariant() AS city, 
     end_date 
FROM @d; 

@res = 
SELECT city, 
     end_date, 
     COUNT(*) AS count 
FROM @d 
GROUP BY city, 
     end_date; 

OUTPUT @res 
TO "/output/result.csv" 
USING Outputters.Csv(); 

// Now let's pivot the date and count. 

OUTPUT @res2 
TO "/output/res2.csv" 
USING Outputters.Csv(); 
     @res2 = 
SELECT city, MAP_AGG(end_date, count) AS date_count 
FROM @res 
GROUP BY city; 

// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation. 
@res2 = 
SELECT city, 
     date_count["2016-11-21"]AS [2016-11-21], 
     date_count["2016-11-22"]AS [2016-11-22] 
FROM @res2; 

更新後得到了一些實例DATA IN私人電子郵件:基於數據

你發給我的(城市的提取和計數,你要麼可以用做後合併爲中概述Bob的回答是,您需要事先了解您的城市,或者從我的示例中的城市位置獲取字符串,您不需要事先知道城市),您想要將行集樞轉city, count, date進入行集date, city1, city2, ...的每行都包含每個城市的日期和計數。

你可以很容易地通過以下方式改變@res2計算調整我上面的例子:

// Now let's pivot the city and count. 
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count 
     FROM @res 
     GROUP BY end_date; 

// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below). 
@res2 = 
SELECT end_date, 
     city_count["istanbul"]AS istanbul, 
     city_count["midlands"]AS midlands, 
     city_count["belfast"] AS belfast, 
     city_count["acoustics"] AS acoustics, 
     city_count["amsterdam"] AS amsterdam 
FROM @res2; 

注意,在我的例子,你需要看它枚舉樞軸語句中的所有城市在SQL.MAP列中。如果這不是已知的,你將不得不首先提交一個腳本來爲你創建腳本。例如,假設您的city, count, date行集位於文件中(或者您可以複製語句以在生成腳本和生成的腳本中生成行集),則可以將其寫爲以下腳本。然後將結果作爲實際處理腳本提交。

// Get the rowset (could also be the actual calculation from the original file 
@in = EXTRACT city string, count int?, date string 
     FROM "https://stackoverflow.com/users/temp/Revit_Last2Months_Results.tsv" 
     USING Extractors.Tsv(); 

// Generate the statements for the preparation of the data before the pivot 
@stmts = SELECT * FROM (VALUES 
        ("@s1", "EXTRACT city string, count int?, date string FROM \"https://stackoverflow.com/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"), 
        ("@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;") 
       ) AS T(stmt_name, stmt); 

// Now generate the statement doing the pivot 
@cities = SELECT DISTINCT city FROM @in2; 

@pivots = 
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt 
FROM @cities; 

// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name. 
@output = 
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt 
FROM (VALUES(1)) AS T(x); 

// Now put the statements into one rowset. Note that null are ordering high in U-SQL 
@result = 
SELECT stmt_name, "=" AS assign, stmt FROM @stmts 
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots 
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output; 

// Now output the statements in order of the stmt_name 
OUTPUT @result 
TO "/pivot.usql" 
ORDER BY stmt_name 
USING Outputters.Text(delimiter:' ', quoting:false); 

現在下載並提交它。

+0

找到輸出嗨Michael,感謝您的評論,我嘗試應用您在上面建議的代碼,這可能是一個解決方案,但根據我的要求它沒有給我預期的結果。如果你可以分享你的「電子郵件ID」,我可以給你詳細的,因爲這個地方是非常有限的分享的細節。 –

+0

您可以通過Microsoft的usql聯繫我。 我推薦的一件事是看代碼,並確定我的假設和您的方案之間的差異,以確定您可能需要更改樣本的位置。 –

相關問題