Enumerating 100,000+ folder hierarchy taking hours

2019-09-03 12:39发布

问题:

Why is this code taking hours to complete :

public void SetRoot(string path)
{
    Tag = path;
    BeginUpdate();
    AddFolderRecursive(Nodes, path);
    EndUpdate();
}

private void AddFolderRecursive(TreeNodeCollection nodes, string path)
{
    try
    {
        var dirs = Directory.EnumerateDirectories(path).OrderBy(d => d).Select(d => d.Split(Path.DirectorySeparatorChar).Last());
        TreeNode node;
        ShellFileGetInfo.FolderIcons fi;
        foreach (var d in dirs)
        {
            node = nodes.Add(Path.Combine(path, d), d, ImageList.Images.Count);
            node.Tag = Path.Combine(path, d);
            node.SelectedImageIndex = ImageList.Images.Count + 1;
            fi = ShellFileGetInfo.GetFolderIcon((string)node.Tag, false);
//                  ImageList.Images.Add(fi.closed);
//                  ImageList.Images.Add(fi.open);
            AddFolderRecursive(node.Nodes, (string)node.Tag);
        }
    }
    catch (UnauthorizedAccessException)
    {

    }
}

I have left this code running for 14 hours now and it still has not completed getting a list of all the folders when passing calling it like SetRoot( @"c:\" );. The code is working, it is adding to the tree, but this is just ridiculous.

Essentially, I want to popular the treeview with all folders on my drive (so the treeview is searchable) AND use the actual folder icons (which I am using SHGetFileInfo p/invoke with the requisite params to do so). Even without getting the icon (which there is another problem I am having is that getting the icon of the folder is unique on a folder by folder basis even though the icon images themselves may be the same. I can not see a way to determine - quickly - if I already have that image saved in my TreeView's ImageList - aka, the folder icon for 'c:\windows' is the same as 'c:\windows\system32', however the handle, etc all return different information so there doesn't seem to be anything to go on to uniquely index them.).

What can be adjusted in my code to make this process much faster, while still retaining the folder icons from the system ? Keep in mind, that I also want ALL folders without skipping empty ones as

(I can't even show a picture of the TreeView as I stopped the loop from running after 14 hours before it finished displaying).

To test the speed of the TreeView control, I wrote the following code :

DateTime start = DateTime.UtcNow;
treeView1.BeginUpdate();
await Task.Run(() =>
{
    int x = 0;
    while (x < 100000)
    {
        x++;
        if (treeView1.InvokeRequired)
        {
            treeView1.Invoke((MethodInvoker)delegate { treeView1.Nodes.Add("Node - " + x); } );
        }
    }
});
treeView1.EndUpdate();
Text = start.ToLongTimeString() + " - " + DateTime.UtcNow.ToLongTimeString();

And here is a screenshot of the results :

As you can see, to populate the TreeView with 100,000 items happens very quickly at approximately 2 minutes provided you use BeginUpdate and EndUpdate to prevent the control from drawing or refreshing on every item. This also shows that it is definitely not the TreeView control which is holding me back -- 14 hours is excessive - even with a drive from 1996, 14 hours to enumerate 100,000 folders, is way too long.

回答1:

The Windows tree view control is simply not designed to hold so many nodes. It is simply impossible for you to populate this control with thousands of nodes in realistic time. Furthermore it is surely unrealistic to even enumerate all of those items in short time. Even more awkward is attempting to extract icons for every object in the tree ahead of time.

The way forward is that you do not attempt to populate the control with all items. Just populate parent nodes. Then when they are opened enumerate and add the children. This is how all shell programs operate.



回答2:

Upon further research I found that the issue is that the methods available for enumerating files and folders is very slow, and that when a folder is not accessible an UnauthorizedAccessException is thrown which has an inherant delay of around 200ms per incident. These exceptions stack and cause a large delay.

Additionally, Davids statement with regards to a quadratic exponent in adding items to the TreeView causes further delay is also true, however in this case, the additional delay is disproportionate to the scaling of TreeView node addition only.

To solve this problem, I was able to narrow it down to 3 issues, two of which I have completely solved so those portions of the control function within a reasonable time-frame. Breaking it down, here are the 3 issues causing delay with the OP question:

  • TreeView node addition is exponentially slow the deeper the tree, and the more nodes are added.
  • File system access does not access the native Journaling system available for NTFS and thus each file or directory is individually sourced per call. Further, if a folder is marked as restricted, the UnauthorizedAccessException imposes an artificial delay of approximately 200ms per encounter.
  • Retrieval of custom folder icons requests several IO actions (with their respective delays), and further the method above for storing every icon pair is inefficient, causing it's own additional delays even though minor in this scope, the delay is also relevant.

After narrowing the scope of the delay, I was able to target these factors to reduce them one by one.


Speeding up getting a list of folders

The first thing I had to do, was replace the file system access method with something more viable - direct access to the NTFS Journal system which I was able to do by harvesting some code from StCroixSkipper's USN Journal Explorer v1.3 and MFT Scanner in VB.NET to make the following class NtfsUsnJournal.cs which I put on pastebin as this exceeds what I can post on StackOverflow.

This change allowed me to retrieve all folders recursively on the C drive, in under 4 seconds

NOTE: I have thus-far been unable to figure out a way to access the Journal without requiring the elevated (Administrator) privilege for the application. All attempts to access the Journal without elevation resulted in denied access exceptions.


Improving TreeView performance for large sets of nested Nodes

Next, I needed to improve the TreeView's performance to be able to add over 100,000 nested nodes as the current folder structure loaded. To do this, took a bit of Google-fu, and some tinkering to adjust the code to function for the above class's Usn format.

The result is the following addition to a customer usercontrol extending TreeView :

#region TreeViewFast
private readonly Dictionary<ulong, TreeNode> _treeNodes = new Dictionary<ulong, TreeNode>();

/// <summary>
/// Load the TreeView with items.
/// </summary>
/// <typeparam name="T">Item type</typeparam>
/// <param name="items">Collection of items</param>
/// <param name="getId">Function to parse Id value from item object</param>
/// <param name="getParentId">Function to parse parentId value from item object</param>
/// <param name="getDisplayName">Function to parse display name value from item object. This is used as node text.</param>
public void LoadItems<T>(IEnumerable<T> items, Func<T, ulong> getId, Func<T, ulong?> getParentId, Func<T, string> getDisplayName)
{
    // Clear view and internal dictionary
    Nodes.Clear();
    _treeNodes.Clear();

    // Load internal dictionary with nodes
    foreach (var item in items)
    {
        var id = getId(item);
        var displayName = getDisplayName(item);
        var node = new TreeNode { Name = id.ToString(), Text = displayName, Tag = item };
        _treeNodes.Add(getId(item), node);
    }

    // Create hierarchy and load into view
    foreach (var id in _treeNodes.Keys)
    {
        var node = GetNode(id);
        var obj = (T)node.Tag;
        var parentId = getParentId(obj);

        if (parentId.HasValue)
        {
            var parentNode = GetNode(parentId.Value);
            if(parentNode == null)
            {
                Nodes.Add(node);
            } else
            {
                parentNode.Nodes.Add(node);
            }
        }
        else
        {
            Nodes.Add(node);
        }
    }
}

/// <summary>
/// Get a handle to the object collection.
/// This is convenient if you want to search the object collection.
/// </summary>
public IQueryable<T> GetItems<T>()
{
    return _treeNodes.Values.Select(x => (T)x.Tag).AsQueryable();
}

/// <summary>
/// Retrieve TreeNode by Id.
/// Useful when you want to select a specific node.
/// </summary>
/// <param name="id">Item id</param>
public TreeNode GetNode(ulong id)
{
    try
    {
        return _treeNodes[id];
    } catch (KeyNotFoundException)
    {
        return null;
    }
}

/// <summary>
/// Retrieve item object by Id.
/// Useful when you want to get hold of object for reading or further manipulating.
/// </summary>
/// <typeparam name="T">Item type</typeparam>
/// <param name="id">Item id</param>
/// <returns>Item object</returns>
public T GetItem<T>(ulong id)
{
    return (T)GetNode(id).Tag;
}


/// <summary>
/// Get parent item.
/// Will return NULL if item is at top level.
/// </summary>
/// <typeparam name="T">Item type</typeparam>
/// <param name="id">Item id</param>
/// <returns>Item object</returns>
public T GetParent<T>(ulong id) where T : class
{
    var parentNode = GetNode(id).Parent;
    return parentNode == null ? null : (T)Parent.Tag;
}

/// <summary>
/// Retrieve descendants to specified item.
/// </summary>
/// <typeparam name="T">Item type</typeparam>
/// <param name="id">Item id</param>
/// <param name="deepLimit">Number of generations to traverse down. 1 means only direct children. Null means no limit.</param>
/// <returns>List of item objects</returns>
public List<T> GetDescendants<T>(ulong id, int? deepLimit = null)
{
    var node = GetNode(id);
    var enumerator = node.Nodes.GetEnumerator();
    var items = new List<T>();

    if (deepLimit.HasValue && deepLimit.Value <= 0)
        return items;

    while (enumerator.MoveNext())
    {
        // Add child
        var childNode = (TreeNode)enumerator.Current;
        var childItem = (T)childNode.Tag;
        items.Add(childItem);

        // If requested add grandchildren recursively
        var childDeepLimit = deepLimit.HasValue ? deepLimit.Value - 1 : (int?)null;
        if (!deepLimit.HasValue || childDeepLimit > 0)
        {
            var childId = ulong.Parse(childNode.Name);
            var descendants = GetDescendants<T>(childId, childDeepLimit);
            items.AddRange(descendants);
        }
    }
    return items;
}
#endregion

To use, I created a new method which acts as a simple 'loader' as follows :

public void PopulateTree(string path)
{
    Tag = path;
    using (NtfsUsnJournal ntfs = new NtfsUsnJournal(new DriveInfo(path)))
    {
        List<NtfsUsnJournal.UsnEntry> folders;
        ntfs.GetNtfsVolumeFolders(out folders);


        Func<NtfsUsnJournal.UsnEntry, ulong> getId = (x => x.FileReferenceNumber);
        Func<NtfsUsnJournal.UsnEntry, ulong?> getParentId = (x => x.ParentFileReferenceNumber);
        Func<NtfsUsnJournal.UsnEntry, string> getDisplayName = (x => x.Name);

        LoadItems(folders, getId, getParentId, getDisplayName);
    }
}

Testing this, it now only takes 6 seconds to fully load all 100,000+ folders into the TreeView, and user experience is instant expanding


Folders with custom icons

This is the last place which I am currently hung up on, and am still searching for a way to fully improve this.

What I have done so far, is to do a check to see if desktop.ini exists within the folder, and if it does, then call the SHGetFileInfo pinvoke to get the custom folder icon. I then add the folder that was being expanded to an internal list signaling that I already checked that folder and got any associated icons, which all happens inside the OnBeforeExpand event. While these calls are inexpensive, it still adds a significant delay (expanding c:\windows takes 12 seconds) to the process.

Here is the code for that (also inside the custom TreeView)

private List<string> _expandedCache;

protected override void OnBeforeExpand(TreeViewCancelEventArgs e)
{
    if (!_expandedCache.Contains(e.Node.FullPath))
    {
        BeginUpdate();
        ShellFileGetInfo.FolderIcons fi;
        _expandedCache.Add(e.Node.FullPath);
        string curPath;
        foreach(TreeNode n in e.Node.Nodes)
        {
            curPath = Path.Combine((string)Tag, n.FullPath.Replace('/', Path.DirectorySeparatorChar));
            if (File.Exists(Path.Combine(curPath, "desktop.ini")) == true)
            {
                fi = ShellFileGetInfo.GetFolderIcon(curPath, false);
                if(fi.closed != null || fi.open != null)
                {
                    ImageList.Images.Add(fi.closed);
                    ImageList.Images.Add(fi.open);
                    n.SelectedImageIndex = ImageList.Images.Count - 1;
                    n.ImageIndex = ImageList.Images.Count - 2;
                }
            }
        }
        EndUpdate();
    }
    base.OnBeforeExpand(e);
}

This is the last major set-back which I believe there is a way to access much faster than the conventional File.Exists() method and SHGetFileInfo pinvoke for obtaining custom folder icons in an inexpensive manner

Update: After more testing, I was able to narrow down this last problem to when adding an Icon to the ImageList. Each Time an image is added to an ImageList that is connected to the TreeView, the entire TreeView of nodes is refreshed. If anyone has an idea on how to make all of this work together while keeping the performance high at this magnitude of images, please let me know. Or if this internal refresh could somehow be pushed to the background in a way that doesn't lock the UI.



回答3:

I'm willing to bet that the code is not the reason, it can be one of two things:

  1. your hard drive is a mess , for that you can try fragmentation methods.

  2. (what also happend to me) your folders are not indexed by windows (indexing - https://en.m.wikipedia.org/wiki/Indexing_Service) to fix it you need to go to the main folder you are working on and ask windows to index the folder and all of its sub folders (somewhere on the folder box), this proccess will take about a day, but after that your programm should work fine (and fast).



标签: c# pinvoke