I'm trying to create an SSIS package to process files from a directory that contains many years worth of files. The files are all named numerically, so to save processing everything, I want to pass SSIS a minimum number, and only enumerate files whose name (converted to a number) is higher than my minimum.
I've tried letting the ForEach File loop enumerate everything and then exclude files in a Script Task, but when dealing with hundreds of thousands of files, this is way too slow to be suitable.
The FileSpec property lets you specify a file mask to dictate which files you want in the collection, but I can't quite see how to specify an expression to make that work, as it's essentially a string match.
If there's an expression within the component somewhere which basically says Should I Enumerate? - Yes / No
, that would be perfect. I've been experimenting with the below expression, but can't find a property to which to apply it.
(DT_I4)REPLACE( SUBSTRING(@[User::ActiveFilePath],FINDSTRING( @[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > @[User::MinIndexId] ? "True" : "False"
The best you can do is use FileSpec to specify a mask, as you said. You could include at least some specs in it, like files starting with "201" for 2010, 2011 and 2012. Then, in some other task, you could filter out those you don't want to process (for instance, 2010).
From investigating how the ForEach loop works in SSIS (with a view to creating my own to solve the issue) it seems that the way it works (as far as I could see anyway) is to enumerate the file collection first, before any mask is specified. It's hard to tell exactly what's going on without seeing the underlying code for the ForEach loop but it seems to be doing it this way, resulting in slow performance when dealing with over 100k files.
While @Siva's solution is fantastically detailed and definitely an improvement over my initial approach, it is essentially just the same process, except using an Expression Task to test the filename, rather than a Script Task (this does seem to offer some improvement).
So, I decided to take a totally different approach and rather than use a file-based ForEach loop, enumerate the collection myself in a Script Task, apply my filtering logic, and then iterate over the remaining results. This is what I did:
In my Script Task, I use the asynchronous
DirectoryInfo.EnumerateFiles
method, which is the recommended approach for large file collections, as it allows streaming, rather than having to wait for the entire collection to be created before applying any logic.Here's the code:
So, I enumerate the collection, applying my logic as files are discovered and immediately adding the file path to my list for output. Once complete, I then assign this to an SSIS Object variable named ActiveFilenames which I'll use as the collection for my ForEach loop.
I configured the ForEach loop as a ForEach From Variable Enumerator, which now iterates over a much smaller collection (Post-filtered
List<string>
compared to what I can only assume was an unfilteredList<FileInfo>
or something similar in SSIS' built-in ForEach File Enumerator.So the tasks inside my loop can just be dedicated to processing the data, since it has already been filtered before hitting the loop. Although it doesn't seem to be doing much different to either my initial package or Siva's example, in production (for this particular case, anyway) it seems like filtering the collection and enumerating asynchronously provides a massive boost over using the built in ForEach File Enumerator.
I'm going to continue investigating the ForEach loop container and see if I can replicate this logic in a custom component. If I get this working I'll post a link in the comments.
Here is one way you can achieve this. You could use
Expression Task
combined withForeach Loop Container
to match the numerical values of the file names. Here is an example that illustrates how to do this. The sample usesSSIS 2012
.This may not be very efficient but it is one way of doing this.
Let's assume there is a folder with bunch of files named in the format YYYYMMDD. The folder contains files for the first day of every month since 1921 like 19210101, 19210201, 19210301 .... all the upto current month 20121101. That adds upto
1,103
files.Let's say the requirement is only to loop through the files that were created since June 1948. That would mean the SSIS package has to loop through only the files greater than
19480601
.On the SSIS package, create the following three parameters. It is better to configure parameters for these because these values are configurable across environment.
ExtensionToMatch
- This parameter ofString
data type will contain the extension that the package has to loop through. This will supplement the value toFileSpec
variable that will be used on the Foreach Loop container.FolderToEnumerate
- This parameter ofString
data type will store the folder path that contains the files to loop through.MinIndexId
- this parameter ofInt32
data type will contain the minimum numerical value above which the files should match the pattern.Create the following four parameters that will help us loop through the files.
ActiveFilePath
- This variable ofString
data type will hold the file name as the Foreach Loop container loops through each file in the folder. This variable is used in the expression of another variable. To avoid error, set it to a non-empty value, say 1.FileCount
- This is a dummy variable ofInt32
data type will be used for this sample to illustrate the number of files that the Foreach Loop container will loop through.FileSpec
- This variable ofString
data type will hold the file pattern to loop through. Set the expression of this variable to below mentioned value. This expression will use the extension specified on the parameters. If there are no extensions, it will*.*
to loop through all files.ProcessThisFile
- This variable ofBoolean
data type will evaluate whether a particular file matches the criteria or not.Configure the package as shown below. Foreach loop container will loop through all the files matching the pattern specified on the
FileSpec
variable. An expression specified on the Expression Task will evaluate during runtime and will populate the variable ProcessThisFile. The variable will then be used on the Precedence constraint to determine whether to process the file or not.The script task within the Foreach loop container will increment the counter of variable
FileCount
by 1 for each file that successfully matches the expression.The script task outside the Foreach loop will simply display how many files were looped through by the Foreach loop container.
Configure the Foreach loop container to loop through the folder using the parameter and the files using the variable.
Store the file name in variable
ActiveFilePath
as the loop passes through each file.On the Expression task, set the expression to the following value. The expression will convert the file name without the extension to a number and then will check if it evaluates to greater than the given number in the parameter
MinIndexId
Right-click on the Precedence constraint and configure it to use the variable
ProcessThisFile
on the expression. This tells the package to process the file only if it matches the condition set on the expression task.On the first script task, I have the variable
User::FileCount
set to the ReadWriteVariables and the following C# code within the script task. This increments the counter for file that successfully matches the condition.On the second script task, I have the variable
User::FileCount
set to the ReadOnlyVariables and the following C# code within the script task. This simply outputs the total number of files that were processed.When the package is executed with MinIndexId set to
1948061
(excluding this), it outputs the value773
.When the package is executed with MinIndexId set to
20111201
(excluding this), it outputs the value11
.Hope that helps.