Thousands of images, how should I organize the dir

2019-01-16 13:17发布

站内文章 / Linux

32 0

做个烂人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am getting thousands of pictures uploaded by thousands of users on my Linux server, which is hosted by 1and1.com (I believe they use CentOS, but am unsure of the version). This is a language agnostic question, however, for your reference, I am using PHP.

My first thought was to just dump them all in the same directory, however, I remember a little while ago, there was a limit to how many files or directories could be dropped in a directory.

My second thought was to partition the files inside directories based on the users email address (as it is what I am using for the user name anyhow) but I don't want to run into the limit for directories in a directory....

Anyhow, for images from user@domain.com, I was going to do this:

/images/domain.com/user/images...

Is this smart to do, what if thousands of users have say 'gmail' perhaps I could even go deeper, like this

/images/domain.com/[first letter of user name]/user/images...

so for mike@gmail.com it would be...

/images/domain.com/m/mike/images...

Is this a bad approach? What is everyone else doing? I don't want to run into problems with too many directories also...

How many files in a directory is too many?
Optimum web folder structure for ~250,000 images
How to store images in your filesystem
Tips for managing a large number of files?

回答1:

I would do the following:

Take an MD5 hash of each image as it comes in.
Write that MD5 hash in the database where you are keeping track of these things.
Store them in a directory structure where you use the first couple of bytes of the MD5 hash hex string as the dir name. So if the hash is 'abcdef1234567890' you would store it as 'a/b/abcdef1234567890'.

Using a hash also lets you merge the same image uploaded multiple times.

回答2:

to expand upon Joe Beda's approach:

database
database
database

if you care about grouping or finding files by user, original filename, upload date, photo-taken-on date (EXIF), etc., store this metadata in a database and use the appropriate queries to pick out the appropriate files.

Use the database primary key — whether a file hash, or an autoincrementing number — to locate files among a fixed set of directories (alternatively, use a fixed maximum-number-of-files N per directory, and when you fill up go to the next one, e.g. the kth photo should be stored at {somepath}/aaaaaa/bbbb.jpg where aaaaaa = floor(k/N), formatted as decimal or hex, and bbbb = mod(k,N), formatted as decimal or hex. If that's too flat a hierarchy for you, use something like {somepath}/aa/bb/cc/dd/ee.jpg)

Don't expose the directory structure directly to your users. If they are using web browsers to access your server via HTTP, give them a url like www.myserver.com/images/{primary key} and encode the proper filetype in the Content-Type header.

回答3:

Here are two functions I wrote a while back for exactly this situation. They've been in use for over a year on a site with thousands of members, each of which has lots of files.

In essence, the idea is to use the last digits of each member's unique database ID to calculate a directory structure, with a unique directory for everyone. Using the last digits, rather than the first, ensures a more even spread of directories. A separate directory for each member means maintenance tasks are a lot simpler, plus you can see where's people's stuff is (as in visually).

// checks for member-directories & creates them if required
function member_dirs($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_dir[0] = $GLOBALS['site_path'] . "files/members/" . $dir_1 . "/";
    $user_dir[1] = $user_dir[0] . $dir_2 . "/";
    $user_dir[2] = $user_dir[1] . $dir_3 . "/";
    $user_dir[3] = $user_dir[2] . $user_id . "/";
    $user_dir[4] = $user_dir[3] . "sml/";
    $user_dir[5] = $user_dir[3] . "lrg/";

    foreach ($user_dir as $this_dir) {
        if (!is_dir($this_dir)) { // directory doesn't exist
            if (!mkdir($this_dir, 0777)) { // attempt to make it with read, write, execute permissions
                return false; // bug out if it can't be created
            }
        }
    }

    // if we've got to here all directories exist or have been created so all good
    return true;

}

// accompanying function to above
function make_path_from_id($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_path = "files/members/" . $dir_1 . "/" . $dir_2 . "/" . $dir_3 . "/" . $user_id . "/";
    return $user_path;

}

sanitize_var() is a supporting function for scrubbing input & ensuring it's numeric, $GLOBALS['site_path'] is the absolute path for the server. Hopefully, they'll be self-explanatory otherwise.

回答4:

What I used for another requirement but which can fit your needs is to use a simple convention.

Increment by 1 and get the length of the new number, and then prefix with this number.

For example:

Assume 'a' is a var which is set with the last id.

a = 564;
++a;
prefix = length(a);
id = prefix + a; // 3565

Then, you can use a timestamp for the directory, using this convention:

20092305 (yyyymmdd)

Then you can explode your path like this:

2009/23/05/3565.jpg

(or more)

It's interesting because you can keep a sort order by date, and by number at the same time (sometimes useful) And you can still decompose your path in more directories

回答5:

Joe Beda's answer is almost perfect, but please note that the MD5 has been proven to be collidable in iirc 2 hours on a laptop?

That said, if You actually will use the file's MD5 hash in the described way, Your service will become vulnerable to attacks. How will the attack look like?

A hacker doesn't like a particular photo
He ensures that this is plain MD5 that You are using (MD5 of image+secret_string can scare him out)
He uses a magic method of colliding a picture of (use Your imagination here) hash with the photo he doesn't like
He uploads the photo like he would normally do
Your service overwrites the old one with the new one and displays both

Someone says: let's not overwrite it then. Then, if it's possible to predict that someone will upload something (f.e. a popular picture on the web might get uploaded), it's possible to take the "hash-place" of it first. User would be happy when uploading a picture of a kitty, He would find that it actually appears as (use Your imagination here). I say: use SHA1, as it's been proven to be hackable in iirc 127 years by a 10.000 computers cluster?

回答6:

Might be late to the game on this. But one solution (if it fits your use-case) could be file name hashing. It is a way to create an easily reproducible file path using the name of the file while also creating a well distributed directory structure. For example, you can use the bytes of the filename's hashcode as it's path:

String fileName = "cat.gif";
int hash = fileName.hashCode();
int mask = 255;
int firstDir = hash & mask;
int secondDir = (hash >> 8) & mask;

This would result in the path being:

/172/029/cat.gif

You can then find cat.gif in the directory structure by reproducing the algorithm.

Using HEX as the directory names would be as easy as converting the int values:

String path = new StringBuilder(File.separator)
        .append(String.format("%02x", firstDir))
        .append(File.separator)
        .append(String.format("%02x", secondDir)
        .toString();

Resulting in:

/AC/1D/cat.gif

I wrote an article about this a few years ago and recently moved it to Medium. It has a few more details and some sample code: File Name Hashing: Creating a Hashed Directory Structure. Hope this helps!