可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.

Example:

...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...

Sorry for my English.

回答1:

I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.

回答2:

U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.

Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.

So you have two options:

Mark the extractor as operating on the whole file with adding the following line before the extractor class:
```
[SqlUserDefinedExtractor(AtomicFileProcessing = true)] 
```
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.

回答3:

I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.