u-sql script to search for a string then Groupby t

2019-07-27 04:02发布

问题:

I am quite new to u-sql, trying to solve

str1=\global\europe\Moscow\12345\File1.txt

str2=\global.bee.com\europe\Moscow\12345\File1.txt

str3=\global\europe\amsterdam\54321\File1.Rvt str4=\global.bee.com\europe\amsterdam\12345\File1.Rvt

case1: how do i get just "\europe\Moscow\12345\File1.txt" from the strings variable str1 & str2, i want to just take ("\europe\Moscow\12345\File1.txt") from str1 and str2 then "Groupby(\global\europe\Moscow\12345)" and take the count of distinct files from the path (""\europe\Moscow\12345\")

so the output would be something like this:

distinct_filesby_Location_Date

to solve the above case i tried the below u-sql code but not quite sure whether i am writing the right script or not:

@inArray = SELECT new SQL.ARRAY<string>(
                filepath.Contains("\\europe")) AS path
    FROM @t;

@filesbyloc =
    SELECT [ID],
        path.Trim() AS path1
    FROM @inArray
    CROSS APPLY
    EXPLODE(path1) AS r(location);

OUTPUT @filesbyloc
TO "/Outputs/distinctfilesbylocation.tsv"
USING Outputters.Tsv();

any help would you greatly appreciated.

回答1:

One approach to this is to put all the strings you want to work with in a file, eg strings.txt and save it in your U-SQL input folder. Also have a file with the cities in you want to match, eg cities.txt. Then try the following U-SQL script:

@input =
    EXTRACT filepath string
    FROM "/input/strings.txt"
    USING Extractors.Tsv();

// Give the strings a row-number
@input =
    SELECT ROW_NUMBER() OVER() AS rn,
           filepath
    FROM @input;


// Get the cities
@cities =
    EXTRACT city string
    FROM "/input/cities.txt"
    USING Extractors.Tsv();

// Ensure there is a lower-case version of city for matching / joining
@cities =
    SELECT city,
           city.ToLower() AS lowercase_city
    FROM @cities;


// Explode the filepath into separate rows
@working =
    SELECT rn,
           new SQL.ARRAY<string>(filepath.Split('\\')) AS pathElement
    FROM @input AS i;

// Explode the filepath string, also changing to lower case
@working =
    SELECT rn,
           x.pathElement.ToLower() AS pathElement
    FROM @working AS i
         CROSS APPLY
             EXPLODE(pathElement) AS x(pathElement);


// Create the output query, joining on lower case city name, display, normal case name
@output =
    SELECT c.city,
           COUNT( * ) AS records
    FROM @working AS w
         INNER JOIN
             @cities AS c
         ON w.pathElement == c.lowercase_city
    GROUP BY c.city;


// Output the result
OUTPUT @output TO "/output/output.txt"
USING Outputters.Tsv();

//OUTPUT @working TO "/output/output2.txt"
//USING Outputters.Tsv();

My results:

HTH



回答2:

Taking the liberty to format your input file as TSV file, and not knowing all the column semantics, here is a way to write your query. Please note that I made the assumptions as provided in the comments.

@d =
    EXTRACT path string,
            user string,
            num1 int,
            num2 int,
            start_date string,
            end_date string,
            flag string,
            year int,
            s string,
            another_date string
    FROM @"\users\temp\citypaths.txt"
    USING Extractors.Tsv(encoding: Encoding.Unicode);

// I assume that you have only one DateTime format culture in your file. 
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup.
@d =
SELECT new SqlArray<string>(path.Split('\\')) AS steps,
       DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date
FROM @d;

// This assumes your paths have a fixed formatting/mapping into the city
@d =
SELECT steps[4].ToLowerInvariant() AS city,
       end_date
FROM @d;

@res =
SELECT city,
       end_date,
       COUNT( * ) AS count
FROM @d
GROUP BY city,
         end_date;

OUTPUT @res
TO "/output/result.csv"
USING Outputters.Csv();

// Now let's pivot the date and count.

OUTPUT @res2
TO "/output/res2.csv"
USING Outputters.Csv();
        @res2 = 
SELECT city, MAP_AGG(end_date, count) AS date_count 
FROM @res 
GROUP BY city;

// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation.
@res2 =
SELECT city,
       date_count["2016-11-21"]AS [2016-11-21],
       date_count["2016-11-22"]AS [2016-11-22]
FROM @res2;

UPDATE AFTER RECEIVING SOME EXAMPLE DATA IN PRIVATE EMAIL:

Based on the data you sent me (after the extraction and counting of the cities that you either could do with the join as outlined in Bob's answer where you need to know your cities in advance, or with the taking the string from the city location in the path as in my example, where you do not need to know the cities in advance), you want to pivot the rowset city, count, date into the rowset date, city1, city2, ... were each row contains the date and the counts for each city.

You could easily adjust my example above by changing the calculations of @res2 in the following way:

// Now let's pivot the city and count.
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count 
        FROM @res 
        GROUP BY end_date;

// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below).
@res2 =
SELECT end_date,
       city_count["istanbul"]AS istanbul,
       city_count["midlands"]AS midlands,
       city_count["belfast"] AS belfast, 
       city_count["acoustics"] AS acoustics, 
       city_count["amsterdam"] AS amsterdam
FROM @res2;

Note that as in my example, you will need to enumerate all cities in the pivot statement by looking it up in the SQL.MAP column. If that is not known apriori, you will have to first submit a script that creates the script for you. For example, assuming your city, count, date rowset is in a file (or you could just duplicate the statements to generate the rowset in the generation script and the generated script), you could write it as the following script. Then take the result and submit it as the actual processing script.

// Get the rowset (could also be the actual calculation from the original file
@in = EXTRACT  city string, count int?, date string
      FROM "/users/temp/Revit_Last2Months_Results.tsv"
      USING Extractors.Tsv();

// Generate the statements for the preparation of the data before the pivot 
@stmts = SELECT * FROM (VALUES
                  ( "@s1", "EXTRACT  city string, count int?, date string FROM \"/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"),
                  ( "@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;" )
                  ) AS  T( stmt_name, stmt);

// Now generate the statement doing the pivot
@cities = SELECT DISTINCT city FROM @in2;

@pivots = 
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt 
FROM @cities;

// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name.
@output = 
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt 
FROM (VALUES(1)) AS T(x);

// Now put the statements into one rowset. Note that null are ordering high in U-SQL
@result = 
SELECT stmt_name, "=" AS assign, stmt FROM @stmts
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output;

// Now output the statements in order of the stmt_name
OUTPUT @result 
TO "/pivot.usql" 
ORDER BY stmt_name 
USING Outputters.Text(delimiter:' ', quoting:false);

Now download the file and submit it.



标签: u-sql