How do you deal with lots of small files?

2020-02-02 10:50发布

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.

I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.

Is there a way to optimize NTFS or Windows so that it can work with all these small files?

14条回答
ら.Afraid
2楼-- · 2020-02-02 11:37

One common trick is to simply create a handful of subdirectories and divvy up the files.

For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.

查看更多
兄弟一词,经得起流年.
3楼-- · 2020-02-02 11:38

Aside from placing the files in sub-directories..

Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.

So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.

查看更多
乱世女痞
4楼-- · 2020-02-02 11:39

I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.

查看更多
来,给爷笑一个
5楼-- · 2020-02-02 11:39

You could try using something like Solid File System.

This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.

http://www.eldos.com/solfsdrv/

查看更多
乱世女痞
6楼-- · 2020-02-02 11:42

Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.

If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.

The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:

1.xml
24.xml
12331.xml
2304252.xml

you can sort them into directories like so:

data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml

This scheme will ensure that you will never have more than 100 files in each directory.

查看更多
贪生不怕死
7楼-- · 2020-02-02 11:44

If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.

The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).

Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.

Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.

查看更多
登录 后发表回答