I need to copy files from one directory to another, depending on the existence of the file name in a table of SQL database.
For this I use the following code:
using(SqlConnection connection = new SqlConnection("datasource or route"))
{
connection.Open();
using(SqlCommand cmd = new SqlCommand("SELECT idPic, namePicFile FROM DocPicFiles", connection))
using (SqlDataReader reader = cmd.ExecuteReader())
{
if (reader != null)
{
while (reader.Read())
{
//picList IS AN ARRAY THAT Contains All the files names in a directory
if (picList.Any(s => s.Contains(reader["namePicFile"].ToString())))
{
File.Copy("theFile in the Directory or array picList", "the destiny directory"+ ".jpg", false)
}
}
}
}
}
Is there any way that this can be done in less time? It takes 1 hour for do that, for 20.876 records.
Allow me to make a guess - Mmmmm... No. There is no way to do it faster.
How come I am so confident? Because file copying requires talking to disk and that is a horribly slow operation. Even further, if you try to go for multi-threading, the results will go slower instead of faster because the 'mechanical' operation of moving the head over the disk isn't sequential anymore, which may have been earlier by chance.
See answers to this question I asked earlier.
So yeah, try going to SSDs if you aren't yet using them, otherwise you are getting the best already.
Below here is something for us to put into perspective what does slow mean in disk writing when compared to caches. If cache access is taking 10 min., it implies that it takes 2 years to read from disk. All the accesses are shown in the image below. Clearly when your code will execute, the bottleneck will be disk writes. The best you can do it to let the disk writes stay sequential.
File.Copy is as fast as it gets. You must keep in mind that you depend on the file transfer speed dictated by your hardware and at 20000 files, the latency for data access also comes into play. If you are doing this on a HDD, you could see a big improvement after switching to SSD or some other fast medium.
For this case alone, most likely the hardware is your bottleneck.
EDIT: I consider keeping the connection to the database open for such a long time as a bad practice. I suggest you fetch all the needed data in some in-memory cache (array, list, whatever) and then iterate through that as you copy the files. A db connection is a precious resource and on applications that must handle high concurrency (but not only), releasing the connection fast is a must.
Since your i/o subsystem is almost certainly the botteneck here, using the parallel task library is probably about as good as it gets:
I addressed this problem by creating a single compressed file (.zip) using the parameter to just store the file (no compression). Creating the single (.zip) file, moving that single file, then expanding at the location proved to be 2x faster when dealing with thousands of files.