My name is Festus.
I need to convert strings to and from Base64 in a browser via JavaScript. The topic is covered quite well on this site and on Mozilla, and the suggested solution seems to be along these lines:
function toBase64(str) {
return window.btoa(unescape(encodeURIComponent(str)));
}
function fromBase64(str) {
return decodeURIComponent(escape(window.atob(str)));
}
I did a bit more research and found out that escape()
and unescape()
are deprecated and should no longer be used. With that in mind, I tried removing calls to the deprecated functions which yields:
function toBase64(str) {
return window.btoa(encodeURIComponent(str));
}
function fromBase64(str) {
return decodeURIComponent(window.atob(str));
}
This seems to work but it begs the following questions:
(1) Why did the originally proposed solution include calls to escape()
and unescape()
? The solution was proposed prior to deprecation but presumably these functions added some kind of value at the time.
(2) Are there certain edge cases where my removal of these deprecated calls will cause my wrapper functions to fail?
NOTE: There are other, far more verbose and complex solutions on StackOverflow to the problem of string=>Base64 conversion. I'm sure they work just fine but my question is specifically related to this particular popular solution.
Thanks,
Festus
TL;DR In principle
escape()
/unescape()
are not necessary, and your second version without the deprecated functions is safe, yet it generates longer base64 encoded output:console.log(decodeURIComponent(atob(btoa(encodeURIComponent("€uro")))))
console.log(decodeURIComponent(escape(atob(btoa(unescape(encodeURIComponent("€uro")))))))
both create the output
"€uro"
yet the version withoutescape()
/unescape()
with a longer base64 representationbtoa(encodeURIComponent("€uro")).length // = 16
btoa(unescape(encodeURIComponent("€uro"))).length // = 8
The
escape()
/unescape()
step can only become necessary if the counterpart (e.g. an unadjustable php-Script expecting the base64 to be done in the specific way.).Long version:
First, to better understand the differences in between the two versions of
toBase64()
andfromBase64()
that you suggest above, let us have a look to thebtoa()
which is at the core of the issue. Documentation says, that the naming of btoa is mnemonic so thatwhich is somewhat misleading, as the documentation hastens to add, that
Even less perfect,
btoa()
is indeed only acceptingplainly spoking only only English alpha-numeric-text works with btoa().
The purpose of encodeURIComponent(), which you have in both of your versions, is to help out with strings having character outside the range U+0000 to U+00FF. An example would be the string "uü€" having three characters
a
(U+0061)ä
(U+00E4)€
(U+20AC)Here only the two first characters are in range. The third character, the Euro sign, is outside and
window.btoa("€")
raises an out of range error. To avoid such an error a solution is needed to represent "€" within the set of U+0000 to U+00FF. This is what window.encodeURIComponent does:window.encodeURIComponent("uü€")
creates the following string:
"a%C3%A4%E2%82%AC"
in which some characters have been encodeda
=a
(stayed the same)ä
=%C3%A4
(changed to its utf8 representation)€
=%E2%82%AC
(changed to its utf8 representation)The (changed to its utf8 representation) works by using the character "%" and a two digit number for each byte of the character's utf8 representation. The "%" is U+0025 and hence allowed inside the
btoa()
-range. The result ofwindow.encodeURIComponent("uü€")
can then be fed tobtoa()
as it has no out of range characters anymore:btoa("a%C3%A4%E2%82%AC") \\ = "YSVDMyVBNCVFMiU4MiVBQw=="
The crux of using an
unescape()
in between thebtoa()
and theencodeURIComponent()
is that all bytes of the utf8 representation use up 3 characters%xx
to store all potential values of a byte 0x00 to 0xFF. Here is whereunescape()
can play an optional role. This is becauseunescape()
takes all such%xx
bytes and creates in its place a single Unicode character in the allowed U+0000 to 0+00FF range.To check :
btoa(encodeURIComponent("uü€"))).length // = 24
btoa(unescape(encodeURIComponent("uü€"))).length // = 8
the main difference is a length reduction of the base64 representation of the text, at the cost of additional parsing via the optional
escape()
/unescape()
, which in case of mainly ASCII character set text is minimal anyway.The main lesson to understand is that
btoa()
is misleadingly named and requires Unicode U+0000 to U+00FF characters whichencodeURIComponent()
by itself generates. The deprecatedescape()
/unescape()
only has a space saving feature, which is maybe desirable but not necessary. The problem of Unicode symbols > U+00FF is addressed here as the btoa/atob Unicode problem, which mentions even ways to improve "all UTF8 Unicode" to base64 encoding possible in modern browsers.