2014-03-03 55 views
1

我在BigQuery中擁有數百萬行的表,並且我想將adx_catg_id列拆分爲多個新列。請注意,adx_catg_id列包含以空格分隔的任意數量的單詞。將字符串拆分爲具有bigquery的多列

下面的查詢示例可以將adx_catg_id拆分爲多個列,如果該字符串只包含少於五個單詞的話。我可以擴展它來支持更多的單詞,但我需要自動化它。

SELECT 
    TS, str0, str2, str4, str6, str7 
    from 
    (select REGEXP_EXTRACT(str5, r'^(.*) .*') as str7 
    from 
    (select SUBSTR (str5, LENGTH(REGEXP_EXTRACT(str5, r'^(.*) .*')) + 2, LENGTH(str5)) as str6 
    from 
    (select REGEXP_EXTRACT(str3, r'^(.*) .*') as str5 
    from 
    (select SUBSTR (str3, LENGTH(REGEXP_EXTRACT(str3, r'^(.*) .*')) + 2, LENGTH(str3)) as str4 
    from 
    (select REGEXP_EXTRACT(str1, r'^(.*) .*') as str3 
    from 
    (select SUBSTR (str1, LENGTH(REGEXP_EXTRACT(str1, r'^(.*) .*')) + 2, LENGTH(str1)) as str2 
    from 
    (select REGEXP_EXTRACT(TS, r'^(.*) .*') as str1 
    from 
    (select SUBSTR(TS, LENGTH(REGEXP_EXTRACT(TS, r'^(.*) .*')) + 2,LENGTH(TS)) as str0 
    from 
    (select adx_catg_id TS from [mydataset.conversions]) 
)))))))) 

如何循環上述查詢以根據字符串長度生成新列中的所有單詞?

+0

的可能的複製[BigQuery的:SPLIT()返回只有一個值(https://stackoverflow.com/questions/27060396/bigquery-split-returns-only-one-value) – marengaz

回答

3

檢查了這一點...

SELECT 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){0}([^\s]*)\s?') as Word0, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){1}([^\s]*)\s?') as Word1, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){2}([^\s]*)\s?') as Word2, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){3}([^\s]*)\s?') as Word3, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){4}([^\s]*)\s?') as Word4, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){5}([^\s]*)\s?') as Word5, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){6}([^\s]*)\s?') as Word6, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){7}([^\s]*)\s?') as Word7, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){8}([^\s]*)\s?') as Word8, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){9}([^\s]*)\s?') as Word9, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){10}([^\s]*)\s?') as Word10, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){11}([^\s]*)\s?') as Word11, 
Regexp_extract(StringToParse,r'^(?:[^\s]*\s){12}([^\s]*)\s?') as Word12, 
FROM 
(SELECT 'arbitrary number of words separated by space.' as StringToParse) 

或者,如果你想在相反的順序:

SELECT 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){1}$') as Word1, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){2}$') as Word2, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){3}$') as Word3, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){4}$') as Word4, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){5}$') as Word5, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){6}$') as Word6, 
Regexp_extract(StringToParse,r'\s?([^\s]*)(?:[^\s]*\s?){7}$') as Word7, 
FROM 
(SELECT 'arbitrary number of words separated by space.' as StringToParse) 

它仍然領域的一個固定的數目,但編碼更簡單,更具可讀性。

希望這有助於

+0

我剛剛看到它非常類似於解決方案Fh提到你... –

+0

很多thx NN爲您的查詢。它更清晰可讀,但它仍然以固定數字工作,即使使用總字數也不能從最後一個單詞開始打印單詞。 – gadhgadhi

+0

我添加了一個用於解析最後一個單詞的示例。 –

0

不幸的是,今天在BigQuery中沒有簡單的SPLIT(),但它是一個很好的功能請求。

我喜歡你開發的答案,我會更多地嘗試。對於另一種方法,您也可以嘗試https://stackoverflow.com/a/18711812/132438

在此期間自動執行此操作的最佳方式可能是在BigQuery之外自動生成查詢。