How do URL shorteners guarantee unique URLs when t

2019-03-09 03:28发布

问题:

There are a lot of questions here on stackoverflow regarding URL shorteners as well as elsewhere on the internet, e.g.

How to code a URL shortener?

How do URL shortener calculate the URL key? How do they work?

http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html

However, there is one thing I don't understand. For instance, http://goo.gl uses four characters at the moment. However, they claim their short URLs don't expire. As mentioned in the article on coding horror, if they can't recycle URLs, the only possible solution is at one point to add an additional character.

Ok, so far so good. With 4 characters that means about 15 million unique addresses. For something like Google Maps, I don't think that is very much and if you can't recycle, my guess is they run out of available addresses fairly quickly.

Now for the part I don't get. While handing out addresses, they start to run out of unused addresses. They have to check if a newly generated address has not been issued yet. The chance this has happened and the address is already in use increases. The straightforward solution of course is to generate a new URL over and over again until they find a free one or until they have generated all 1.5 million alternatives. However, this surely can't be how they actually do it, because this would be far too time-consuming. So how do they manage this?

Also, there are probably several visitors at once asking for a short URL, so they must have some synchronization going on as well. But how should the situation be managed when the fifth character needs to be added?

Finally, when doing some research on how the URLs from http://goo.gl work, of course I requested a short URL for a map on Google Maps several times. None of them will ever be used. However, when Google would strictly enforce the policy of URLs not expiring once issued, this means that there are lots and lots of dormant URLs in the system. Again, I assume Google (and the other services as well) have come up with a solution to this problem as well. I could imagine a clean up service which recycle URLs which have not been visited in the first 48 hours after creation or less then 10 times in the first week. I hope someone can shed some light on this issue as well.

In short, I get the general principle of URL shorteners, but I see several problems when these URLs cannot expire. Does anyone know how the problems mentioned above might be resolved and are there any other problems?


EDIT

Ok, so this blog post sheds some light on things. These services don't randomly generate anything. They rely on the underlying database's auto-increment functionality and apply a simple conversion on the resulting id. That eliminates the need to check if an id already exists (it doesn't) and the database handles synchronization. That still leaves one of my three questions unanswered. How do these services "know" if a link is actually used once created?

回答1:

Why URL shorteners don't delete entries

I did write TinyURL (a decade ago) to give back an entry that I didn't need. Their reply made me realize how ridiculous I was: they told me "just create all the URLs you need". And the figures speak themselves:

A - With 26 low-case + 26 upper case letters + 10 digits (the choice of reasonable sites), using ONE character gives you 62 positions (i.e. 62 shortened URLs), then each additional char multiplies the position number by 62:

  • 0 char = 1 URLs
  • 1 char = 62 URLs
  • 2 chars = 3,844 (1 URL for each human in a village)
  • 3 chars = 238,328 (idem, in a city)
  • 4 chars = 14,776,336 (in Los Angeles area)
  • 5 chars = 916,132,832 (in Americas, N+Central+S)
  • 6 chars ~ 56,800,235,580 (8 URLs for each human in the world)
  • 7 chars ~ 3,521,614,606,000 (503 for each human, 4 for each web page in the world)
  • 8 chars ~ 218,340,105,600,000 (31,191 URLs for each human)
  • 9 chars ~ 13,537,708,655,000,000 (~2 million URLs for each human)
  • 10 chars ~ 839,299,365,900,000,000 (~120 billions URLs for each human)
  • 11 chars ~ 52,036,560,680,000,000,000

B - Actually the needs and uses are lower than what one could expect. Few people are creating short URLs, and each person creates few URLs. Original URLs are enough in most cases. Results are that the most popular shorteners, after years, still cover the needs today with just 4 or 5 chars, and adding another one when needed will cost nearly zero. Apparently goo.gl and goo.gl/maps are each using 5 chars, youtube is using 11 (using the 62 chars above, plus the dash and maybe a few others).

C - The cost of hosting (storing + operating) an URL is, say, $1000/year for 1Terabyte, with each TB able to contain 5 billions URLs, hence 1 URL costs 0.2 micro-dollar/year in hosting. However the benefit for the Shortener is also very thin, whence the business is not very strong. For the user, the benefit of an URL is hard to evaluate, however a missed link would cost far more than the hosting.

D - There is no point for the user to create a short URL if it risks to become inoperative in the next years, so persistency is a major appeal of a Shortener, and a serious Shortener will probably never cease to serve them unless they are forced out of business; yet this happened already, and anyway Short URLs have their load of drawbacks as well as benefits, as well explained in Wikipedia "URL shortening" (risks of all sorts of hacking, against the user, or the target sites, or the Shortener; e.g. one could attack a Shortener by bot-requesting giga-numbers of URLs, a threat surely fended off by most Shorteners).

Versailles, Tue 12 Mar 2013 20:48:00 +0100, edited 21:01:25