How do you deal with lots of small files?

2020-02-02 10:50发布

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.

I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.

Is there a way to optimize NTFS or Windows so that it can work with all these small files?

14条回答
爷的心禁止访问
2楼-- · 2020-02-02 11:26

To create a folder structure that will scale to a large unknown number of files, I like the following system:

Split the filename into fixed length pieces, and then create nested folders for each piece except the last.

The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.

12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg

This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.

And here's a beautiful PowerShell one-liner to get you going!

$s = '123456'

-join  (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )
查看更多
我想做一个坏孩纸
3楼-- · 2020-02-02 11:27

Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?

查看更多
乱世女痞
4楼-- · 2020-02-02 11:29

NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.

For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.

查看更多
▲ chillily
5楼-- · 2020-02-02 11:29

The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.

One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:

ABC\
    DEF\
        ABCDEFGHI.db
    EFG\
        ABCEFGHIJ.db

Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.

查看更多
来,给爷笑一个
6楼-- · 2020-02-02 11:31

I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.

You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.

In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.

查看更多
爷、活的狠高调
7楼-- · 2020-02-02 11:31

Rename the folder each day with a time stamp.

If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.

Then you will get one folder for each day, each containing several thousand files.

You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.

You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.

查看更多
登录 后发表回答