I really do not understand the difference between #encode
and #force_encoding
in Ruby for the String
class. I understand that "kam".force_encoding("UTF-8")
will force "kam"
to be in UTF-8 encoding, but how is #encode(encoding)
different?
http://ruby-doc.org/core-2.0/String.html#method-i-encoding
Difference is pretty big. force_encoding
sets given string encoding, but does not change the string itself, i.e. does not change it representation in memory:
'łał'.bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII').bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII') #=> "\xC5\x82a\xC5\x82"
Encode is assuming that the current encoding is correct and tries to change the string so it reads same way in second encoding:
'łał'.encode('UTF-16') #=> 'łał'
'łał'.encode('UTF-16').bytes #=> [254, 255, 1, 65, 0, 97, 1, 66]
In short, force_encoding
changes the way string is being read from bytes, and encode
changes the way string is written without changing the output (if possible)
Read this Changing an encoding
The associated Encoding of a String can be changed in two different ways.
First, it is possible to set the Encoding
of a string to a new Encoding without changing the internal byte representation of the string, with String#force_encoding
. This is how you can tell Ruby the correct encoding of a string.
Example :
string = "R\xC3\xA9sum\xC3\xA9"
string.encoding #=> #<Encoding:ISO-8859-1>
string.force_encoding(Encoding::UTF_8) #=> "R\u00E9sum\u00E9"
Second, it is possible to transcode a string, i.e. translate its internal byte representation to another encoding. Its associated encoding is also set to the other encoding. See String#encode
for the various forms of transcoding, and the Encoding::Converter class for additional control over the transcoding process.
Example :
string = "R\u00E9sum\u00E9"
string.encoding #=> #<Encoding:UTF-8>
string = string.encode!(Encoding::ISO_8859_1)
#=> "R\xE9sum\xE9"
string.encoding
#=> #<Encoding::ISO-8859-1>