I have the Data in Azure Data Lake Store and I am processing the data present there with Azure Data Analytic Job with U-SQL. I have several CSV files which contain spatial data, similar to this:
File_20170301.csv
longtitude| lattitude | date | hour | value1
----------+-----------+--------------+------+-------
45.121 | 21.123 | 2017-03-01 | 01 | 20
45.121 | 21.123 | 2017-03-01 | 02 | 10
45.121 | 21.123 | 2017-03-01 | 03 | 50
48.121 | 35.123 | 2017-03-01 | 01 | 60
48.121 | 35.123 | 2017-03-01 | 02 | 15
48.121 | 35.123 | 2017-03-01 | 03 | 80
File_20170302.csv
longtitude| lattitude | date | hour | value1
----------+-----------+--------------+------+-------
45.121 | 21.123 | 2017-03-02 | 01 | 20
45.121 | 21.123 | 2017-03-02 | 02 | 10
45.121 | 21.123 | 2017-03-02 | 03 | 50
48.121 | 35.123 | 2017-03-02 | 01 | 60
48.121 | 35.123 | 2017-03-02 | 02 | 15
48.121 | 35.123 | 2017-03-02 | 03 | 80
Each file contains data for a different day and for all longtitude-lattitude combinations.
I would like to merge all files I have and split the data so I can end up with one file for each longtitude-lattitude combination.
So, after looping through all files in my folder and appending all data for all days, i would end up with the following:
File_45_21.csv
longtitude| lattitude | date | hour | value1
----------+-----------+--------------+------+-------
45.121 | 21.123 | 2017-03-01 | 01 | 20
45.121 | 21.123 | 2017-03-01 | 02 | 10
45.121 | 21.123 | 2017-03-01 | 03 | 50
45.121 | 21.123 | 2017-03-02 | 01 | 20
45.121 | 21.123 | 2017-03-02 | 02 | 10
45.121 | 21.123 | 2017-03-02 | 03 | 50
File_48_35.csv
longtitude| lattitude | date | hour | value1
----------+-----------+--------------+------+-------
48.121 | 35.123 | 2017-03-01 | 01 | 60
48.121 | 35.123 | 2017-03-01 | 02 | 15
48.121 | 35.123 | 2017-03-01 | 03 | 80
48.121 | 35.123 | 2017-03-02 | 01 | 60
48.121 | 35.123 | 2017-03-02 | 02 | 15
48.121 | 35.123 | 2017-03-02 | 03 | 80
In theory the following should happen:
- Find distinct values for a combination of longtitude and lattitude in the data
- take the above Array of distinct values and create a file for each combination and extract the data to it from the source files, based on the two parameters (longtitude and lattitude)
What I am struggling with is how to get started on the looping and the extraction of data based on two parameters in the source and how to "partition" the data source by the distinct values of the combination of parameters.
I would recommend to look at the notion of File Sets to operate on many files (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx) and some of the suggestions to do dynamic outputs based on values until that feature becomes available (see How do I partition a large file into files/directories using only U-SQL and certain fields in the file? as an example).
U-SQL does not support dynamic U-SQL directly, but it is possible to use "script the script" techniques to create your output. You could then manually run this output, or use something like Powershell or Azure Data Factory to run it.
I have created a simple example based on your test data partially based on the example from here.