Here is the best method I have come up with so far and I would like to know if there is an even better method (I'm sure there is!) for storing and fetching millions of user images:
In order to keep the directory sizes down and avoid having to make any additional calls to the DB, I am using nested directories that are calculated based on the User's unique ID as follows:
$firstDir = './images';
$secondDir = floor($userID / 100000);
$thirdDir = floor(substr($id, -5, 5) / 100);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";
User ID's ($userID
) range from 1 to the millions.
So if I have User ID 7654321
, for example, that user's first pic will be stored in:
./images/76/543/7654321/1.jpg
For User ID 654321
:
./images/6/543/654321/1.jpg
For User ID 54321
it would be:
./images/0/543/54321/1.jpg
For User ID 4321
it would be:
./images/0/43/4321/1.jpg
For User ID 321
it would be:
./images/0/3/321/1.jpg
For User ID 21
it would be:
./images/0/0/21/1.jpg
For User ID 1
it would be:
./images/0/0/1/1.jpg
This ensures that with up to 100,000,000 users, I will never have a directory with more than 1,000 sub-directories, so it seems to keep things clean and efficient.
I benchmarked this method against using the following "hash" method that uses the fastest hash method available in PHP (crc32). This "hash" method calculates the Second Directory as the first 3 characters in the hash of the User ID and the Third Directory as the next 3 character in order to distribute the files randomly but evenly as follows:
$hash = crc32($userID);
$firstDir = './images';
$secondDir = substr($hash,0,3);
$thirdDir = substr($hash,3,3);
$fourthDir = $userID;
$imgLocation = "$firstDir/$secondDir/$thirdDir/$fourthDir/1.jpg";
However, this "hash" method is slower than the method I described earlier above, so it's no good.
I then went one step further and found an even faster method of calculating the Third Directory in my original example (floor(substr($userID, -5, 5) / 100);
) as follows:
$thirdDir = floor(substr($userID, -5, 3));
Now, this changes how/where the first 10,000 User ID's are stored, making some third directories have either 1 user sub-directory or 111 instead of 100, but it has the advantage of being faster since we do not have to divide by 100, so I think it is worth it in the long-run.
Once the directory structure is defined, here is how I plan on storing the actual individual images: if a user uploads a 2nd pic, for example, it would go in the same directory as their first pic, but it would be named 2.jpg
. The default pic of the user would always just be 1.jpg
, so if they decide to make their 2nd pic the default pic, 2.jpg
would be renamed to 1.jpg
and 1.jpg
would be renamed 2.jpg
.
Last but not least, if I needed to store multiple sizes of the same image, I would store them as follows for User ID 1 (for example):
1024px:
./images/0/0/1/1024/1.jpg
./images/0/0/1/1024/2.jpg
640px:
./images/0/0/1/640/1.jpg
./images/0/0/1/640/2.jpg
That's about it.
So, are there any flaws with this method? If so, could you please point them out?
Is there a better method? If so, could you please describe it?
Before I embark on implementing this, I want to make sure I have the best, fastest, and most efficient method for storing and retrieving images so that I don't have to change it again.
Thanks!