encodings

Text has to be stored as zero’s and one’s (naturally…), and there are multiple standards to convert binary to text.
Lets have a quick overview:

ASCII
ASCII is both a standard and an encoding. The encoding called ASCII used to have only 7 bits. The standard called ASCII is widely used in other encoding as well, which basically means in all of those encoding certain letters have the same binary representation (with more leading 0s for 8 bits or even bigger encodings). For example the letter 1101110 in ASCII could be represented as 01101110 in an 8 bits encoding.

ISO 8859-1 or latin1
This is an encoding widely used for european languages. It is ASCII compatible, and uses 8 bits. Unfortunately, 8 bits are not enough to include arabic letters or even chinese one, which means there are a lot of similar ASCII compatible 8 bit encoding.

UTF-8, UTF-16, UTF-32,..
UTF is one of the solutions to the ISO encodings mess. It is also ASCII compatible, but does not have a fixed size. if you use UTF-8, every letter will have at least 8 bits, but less common ones as äöü can have 16 bits, 32 bits or even more. Now all characters there are can be used with the same encoding!

With this in mind, you can now understand what went wrong when looking at misinterpreted text:

latin1 vs utf8

left: correct; middle: UTF-8 interpreted as latin1; right: latin1 interpreted as UTF-8

Leave a Reply

Your email address will not be published. Required fields are marked *