Tips for managing a large number of files?

2019-01-22 04:04发布

问题:

There are some very good questions here on SO about file management and storing within a large project.

Storing Images in DB - Yea or Nay?
Would you store binary data in database or in file system?

The first one having some great insights and in my project i've decided to go the file route and not the DB route.

A major point against using the filesystem is backup. But in our system we have a great backup scheme so i am not worried about that.

The next path is how to store the actual files. And I've thought about having the files' location static at all times and create a virtual directory system in the database side of things. So links to the file don't change.

The system i am building will have one global file management so all files are accessible to all users. But many that have gone the file route talk about physical directory size (if all the files are within one directory for example)

So my question is, what are some tips or best practice methods in creating folders for these static files, or if i shouldn't go the virtual directory route at all.

(the project is on the LAMP stack (PHP) if that helps at all)

回答1:

One way is to assign a unique number to each file and use it to look up the actual file location. Then you an use that number to distribute files in different directories in the filesystem. For example you could use something like this scheme:

/images/{0}/{1}/{2}

{0}: file_number % 100
{1}: (file_number / 100) % 100
{2}: file_number



回答2:

I've ran into this problem some time ago for a website that was hosting a lot of files. What we did was take a GUID (which is also the Primary Key field of a file) (e.g. BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301) and store a file like this: /B/C/C/BCC46E3F-2F7A-42b1-92CE-DBD6EC6D6301/filename.ext

This has certain advantages:

  • You can scale out the file servers over multiple servers (and assign specific directories to each one)
  • You don't have to rename the file
  • Your directories are guaranteed to be unique

Hope this helps!



回答3:

In order to avoid creating an excessive number of entries in a single directory, you may want to base creating directories on pieces of the filename. So for instance, if you have a file named d7f5ae9b7c5a.png, you may want to store it in media/d7/f5/d7f5ae9b7c5a.png. If your filenames are all hexadecimal then this will restrict the number of entries in a single directory to 256 up until the final level.



回答4:

  1. One user image ~ 100kb, so let have 10 000 users in database, each user will have in average 5 images, so we will have 5 terabytes DB, and each image output will be executed via a DB and this extra DB traffic will reduce the general DB server perfomance. ... you may use the DB cluster to avoid this, but suppose it is expensive

  2. User report about error on live database, (on test - all works correctly), how would you create dump an unpack it on developers machine? How much time it will take?

  3. In one moment you can decide to put images on some CDN, what will be the changes in your source code?



回答5:

I usually take this approach:

Have a global settings variable for your application that points to the folder where you store uploaded files. In your database store the relative paths to the files (relative to what the settings variable points to).

So if a file is located at /www/uploads/image.jpg, your settings varible points to /www/uploads your database row has image.jpg. This is a flexible way that decouples your systems directory structure from your application.

Further you can fragment file storage in directories based on what database tables these relate to. Say you have a table user_reports and a table user_photos. You store the files that relate to user_reports in /www/uploads/user_reports. If you have large number of user uploads you can implement fragmentaion even further. Say a user uploads a file on 20.03.2009, the file is called report.pdf, so you store it at /www/uploads/user_reports/2009/03/20/report.pdf.



回答6:

I can't say much about how apache and PHP manage files, but I can say something about the ext3 file system. ext3 does not seem to have problems with large numbers of files in the same directory. I've tested it with up to a million files. Make sure the dir_index option is enabled on the file system before creating the directories. You can check by running dump2fs and change this option by running tune2fs. Hashing the files into a tree of subdirectories can still be useful because command line tools can still have problems listing the contents of the directory.