Handling Files With Different Columns in USQL

2019-04-10 13:41发布

问题:

I have a USQL script and CSV extractor to load my files. However some months the files may contain 4 columns and some months they may contain 5 columns.

If I setup up my extractor with a column list for either 4 or 5 fields I get an error about the expected width of the file. Go check delimiters etc etc. No surprise.

What is the work around to this problem please given USQL is still in a newbie and missing some basic error handling?

I've tried using the silent clause in the extractor to ignore wider columns which is handy for 4 columns. Then getting a row count of the rowset with an IF condition that then has an extractor for 5 columns. However this leads to a world of rowset variables not being used as scalar variables in the IF expression.

Also I tried a C# style count and a sizeof(@AttemptExtractWith4Cols). Neither work.

Code snippet to give you a feel for the approach I'm taking:

DECLARE @SomeFilePath string = @"/MonthlyFile.csv";

@AttemptExtractWith4Cols =
    EXTRACT Col1 string,
            Col2 string,
            Col3 string,
            Col4 string
    FROM @SomeFilePath
    USING Extractors.Csv(silent : true); //can't be good.

//can't assign rowset to scalar variable!
DECLARE @RowSetCount int = (SELECT COUNT(*) FROM @AttemptExtractWith4Cols);

//tells me @AttemptExtractWith4Cols doesn't exist in the current context!
DECLARE @RowSetCount int = @AttemptExtractWith4Cols.Count();

IF (@RowSetCount == 0) THEN
    @AttemptExtractWith5Cols =
        EXTRACT Col1 string,
                Col2 string,
                Col3 string,
                Col4 string,
                Col5 string
        FROM @SomeFilePath
        USING Extractors.Csv(); //not silent
END;


//etc

Of course if there was such a thing as a TRY CATCH block in USQL this would be a lot easier.

Is this even a reasonable approach to take?

Any input would be greatly appreciated.

Thank you for your time.

回答1:

U-SQL now supports OUTER UNION so you can handle it like this:

// Scenario 1; file has 4 columns
DECLARE @file1 string = @"/input/file1.csv";

// Scenario 2; file has 5 columns
//DECLARE @file1 string = @"/input/file2.csv";


@file =
    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string
    FROM @file1
    USING Extractors.Csv(silent : true)

    OUTER UNION ALL BY NAME ON (col1, col2, col3, col4)

    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string,
            col5 string
    FROM @file1
    USING Extractors.Csv(silent : true);


@output =
    SELECT *
    FROM @file;


OUTPUT @output
    TO "/output/output.csv"
USING Outputters.Csv();

In my example, file1 has 4 columns and file2 has 5 columns. The script runs successfully in either scenario.

My results:

Hope that makes sense.



回答2:

The OUTER UNION is a great solution. Alternatively, you can also write your own generic extractor if you expect your rows in a file to be different. See this blog post for an example.



回答3:

Here's another solution which I found helpful. You can read the file as a single text column (using "\t" as delimiter since there aren't any) and then dynamically split using C# string functions. I have tested this on a similar problem. The advantage of this approach is that you could potentially use the same approach for any number of columns.

SELECT
      (String)(ColList[0])   AS ColA
    , (String)(ColList[1])   AS ColB
    , (String)(ColList[2])   AS ColC
    , (String)(ColList[3])   AS ColD
    , (int?)(NumColumns >= 5 ? (String)(ColList[4]) : (String)null)
                             AS ColE
FROM (
    SELECT ColList
         , ColList.Count AS NumColumns
    FROM (
        SELECT SqlArray.Create(RowText.Split(','))   AS ColList
        FROM (
            EXTRACT RowText string
            FROM @SomeFilePath
            USING Extractors.Text(delimiter: '\t', quoting: false)
        ) AS [T1]
    ) AS [T2]
) AS [T3]

Warning: This solution is not aware of text quoting. Any commas in field values will break this logic.



标签: c# u-sql