URL shortener: best encoding method?

2019-03-13 09:29发布

问题:

I'm creating a link shortening service and I'm using base64 encoding/decoding of an incremented ID field to create my urls. A url with the ID "6" would be: http://mysite.com/Ng==

I need to also allow users to create a custom url name, like http://mysite.com/music

Here's my (possibly faulty) approach so far. Help in fixing it would be appreciated.

When someone creates a new link:

  • I get the largest link ID from the database (it's not auto incremented)
  • Increment the ID by 1
  • Generate a short URL code (http://website.com/[short url name]) by base64_encoding that ID
  • Insert into links table: id, short_url_code, destination_url

When someone creates a new link and passes a custom short URL:

  • My plan was base64_decode their custom string and use that as the link ID, but I didn't realize that you can't just base64_decode any alphanumeric string and turn it into a number.

Is there a better encoding method that will let me turn any number into a short string, and any string into a number, so I can always lookup short urls (whether custom or autogenerated) by turning the name into a number and querying for a link with an ID equal to that number?

回答1:

First and foremost, make sure you have unicity constraints in place on the ID and short_url_code columns.

When someone creates a new link:

  1. Get the next largest link ID from the database (for performance reasons you should really REALLY use autoincrement or SEQUENCE, depending on what your RDBMS offers; otherwise go ahead and select MAX(ID)+1 )
  2. Generate a short URL code (http://website.com/[short url name]) from ID using base64_encode or any other custom or standard encoding scheme
  3. Insert into the links table: ID, short_url_code, destination_url
  4. If the insert fails because of a constraint violation go back to step 1 to try a new ID; you may have had a violation because:

    1. the same ID has already been used (i.e. inserted) in parallel by another thread/process etc. (this will not happen if you used autoincrement or SEQUENCE, and may happen quite often otherwise), and/or
    2. the same short_url_code has already been used as a custom URL (this will happen very seldomly unless someone is trying to cause trouble on your site)
  5. If the insert succeeded, commit and return the short URL to the user

When someone creates a new link and passes a custom short URL:

  1. Perform the same step 1 as above
  2. Instead of generating the short URL part from ID as in step 2 above, use the custom short_url_code provided by the user
  3. Perform the same step 3 as above
  4. If the insert failed because of:
    1. a constraint violation on ID: go back to step 1 to try a new ID
    2. a constraint violation on short_url_code: return an error to the user asking him to pick a different custom URL, as the short URL he/she provided has already been used
  5. Perform the same step 5 as above


回答2:

base64 can be used to make short urls, but it can also make the url longer. For instance the base64_encode of the number 1 is 'MQ==' which is 4 times the size. Base64 will always have 2 characters to obtain the 64bits, which is not ideal for short urls.

If size is the most important factor then you maybe able to produce the shortest urls by relying on internationalization.

This can make a URI rather long (up to 9 ASCII characters for a single Unicode character), but the intention is that browsers only need to display the decoded form, and many protocols can send UTF-8 without the %HH escaping.

Keep in mind that Browsers work quite well with UTF-8, and twitter will have no trouble with these urls.