I have a program now which merges a particular worksheet from multiple workbooks. This program has to process around 300 workbooks and create a consolidated workbook. In the consolidated workbook, I have to add a few more data points to each of the rows copied from other workbooks. The number of rows in the target workbook is likely be around 100,000.
I have another XL which contains many other data related to a customer like the name, country, currency, etc which I have to add as other columns.
One workbook contains data from one customer only. So, for the data copied from one workbook, the data for all these columns will essentially be the same.
Right now, I am thinking of opening the customerData workbook and copying the data selectively based on customerID and pasting it into the target workbook when I paste the data from each worksheet.
Now the code will go like this:
For each workbook in the source Folder
Open source workbook
Copy the range from worksheet1
Copy the customerID from worksheet2
Open target workbook
Paste range from source.worksheet1 in the target worksheet
Fill the range in the target worksheet with the first column as customerID
Open workbook with customer master data
Copy data based on customerID
Paste in the target worksheet using fill range
Other approach is to finish pasting all the data from all the worksheets and then do the lookup of the customer data - the advantage here is that there can be data for the customer in multiple files, hence there is some efficiency in avoiding multiple lookups. And the customer master data file has to be opened only once.
I would like to know whether the approach I have taken is right and whether there is any improvement that I can do it.
Specifically, I have the following questions:
- Is there a way to avoid opening each source file?
- Is it better to add all the data from the source workbooks to an array in memory and copy it once into the target workbook?
- If I go with the array approach, should I add the lookup to the array before pasting or should I do it after pasting the data in the destination range?
I write this as a reply because it may be too long as a comment.
1) There is no way to avoid opening the source files to retrieve data. For operations like this, it is very important that for every workbook you open for retrieval, you create a new instance of Excel (CreateObject method) and clear it entirely from memory when finished by applying the .Quit method. Workbook.close is not flawless as you would expect (the memory is not cleared flawessly), and the higher the number of workbooks you try to open / close in one loop, the higher the chance that your application may crash/freeze at a certain point.
2) Reading / Writing operations take the most time, so it is indeed advised that you try to reduce their frequency to a minimum. The best is indeed to read all data per retrieved workbook at once into a single range (and array) and perform all operations in memory. You can write the retrieved array to a single range when you have performed the operations you wanted to do.
You can load all the workbooks at once in an array like you suggest (as far as I know, the size of an array is only limited to the amount of available memory. For reasons of being practical, I would suggest though that you write the data to the target workbook every time you close the file. If something unexpected happens, you don't need to re-run the entire code, but you can set the for next loop as from where you were the last time.
3) If you want to prevent that the application stops every time a bug is met (which is highly likely when you retrieve this amount of data), use
It is advised that you create a small error handler that writes the error code, description, and the place where the error occurred in a separate "log" file, so you can track the mistakes afterwards. But this way you can get your coffee while the code runs.
4) As for the lookup, I am not certain that I correctly understand what you try to do. The lookup that is the fastest is naturally the one that has the shortest loop. Nested loop are usually rather expensive, certainly when they are long.
I will just mention (like I usually do) that it is advised to use MS Access for user db's (or any other RDBMS) in the future (when I hear "customer ID", this is the first thing I think about). The main advantage in the field of performance is that you can access records directly by applying indexes. It is easy to query using simple SQL statements, much easier than when you have to program everything yourself with VBA and loop through arrays. SQL Statements can be used to perform general CRUD (Create, Retrieve, Update, Delete records) operations, but also much more. Additionally you can apply VBA to take care of any additional transformations.
I hope my reply was of any help to you, although I didn't offer any real code.