Here is my complete task description:
I have to extract data from multiple files using u-sql and output it into csv file. Every input file contains multiple reports based on some string rows ("START OF ..." and "END OF ..." working as report separator). Here is an example (data format) of a single source (input) file :
START OF DAILY ACCOUNT
some data 1
some data 2
some data 3
some data n
END OF DAILY ACCOUNT
START OF LEDGER BALANCE
some data 1
some data 2
some data 3
some data 4
some data 5
some data n
END OF LEDGER BALANCE
START OF DAILY SUMMARY REPORT
some data 1
some data 2
some data 3
some data n
END OF DAILY SUMMARY REPORT
So now my question is that how can I fetch records between "START OF ..." and "END OF ..." rows for all files?
I want something like this at the end :
@dailyAccountResult = [select all rows between "START OF DAILY ACCOUNT" and "END OF DAILY ACCOUNT" rows]
@ledgerBalanceResult = [select all rows between "START OF LEDGER BALANCE" and "END OF LEDGER BALANCE" rows]
@dailySummaryReportResult = [select all rows between "START OF DAILY SUMMARY REPORT" and "END OF DAILY SUMMARY REPORT" rows]
Do I need to write custom extractor for this? If yes then please suggest me how.
I think this is possible using normal U-SQL without a custom extractor. I have created a simple example based on your sample data:
My results:
Now each section is tagged with a section name, you can easily assign the data to different variables and optionally include header/footer rows, eg
Give it a try and let me know how you get on.
The relevant questions to ask:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
ensures the entire input gets processed sequentially by 1 instance, which is sufficient and may be feasible for this case depending on the input size.Do Rowsets have order?
No! Rowsets are unordered concepts - think of them as non-deduping HashSets.
var input = new HashSet<string>(File.ReadLines(@In_Data)); File.WriteAllLines(@Out_NewData, input)
isn't expected to preserve the original line order (even if it does for some inputs, that's implementation detail, not guaranteed semantic behaviour).
Ditto for rowsets - the input order of lines is lost (unguaranteed) the moment data gets translated into a rowset. Attempting to use ROW_NUMBER() is therefore fruitless - there is no order to preserve by the time ROW_NUMBER() can be invoked. The only way to use ROW_NUMBER() is if the rowset had some Key whose sort order could recreate the original order of lines.
Because rowsets do not have order, you need a custom extractor no matter what - it is the only part of the script able to observe the order of lines in the file, given
You can choose to include all your logic in the custom extractor or simply add a numbered column to mimic ROW_NUMBER and use native U-Sql for the logic.