Character Encodings and Python

Or

Why Unicode Doesn't Have To Hurt

The Pain

Unicode Hurts?

Encountered This?

A unicode error in python 2.7

How did you fix it?

  • Randomly apply .encode() and .decode()?
  • The unidecode module?
  • Or Just Ignore it?

¯\_(ツ)_/¯

Did it go away?

You probably didn't really fix the problem

At best you got lucky, at worst you corrupted data!

What's the root cause?

History Time!

How did we get here?

Human communication is complicated and messy

Communicating via computers only makes this worse

Glyphs

Written langugage uses glyphs

Need to convert analog glyphs to digital

Character Encodings!

1836
Morse Code
1874
Baudot Code (5-bit)
1928
IBM Binary Coded Decimal (6-bit)
1963
American Standard Code for Information Interchange (7-bit)
1963
IBM Extended Binary Coded Decimal Interchange Code (8-bit)

For a good example...

TO JAPAN!

文字化け (Mojibake)

Garbled Characters

ASCII

One of the better old encodings

Binary Value => Character

  • 0b0000000 => NUL
  • 0b1111111 => DEL
  • 0b1000001 => A
  • 0b1000011 => B
  • 0b1100001 => a
  • 0b1100011 => a

Not bad... for english

ISO-Latin-1 adds an 8th bit

What about other languages?

We need a modern solution

Unicode Consortium Logo

Universal Encoding

Define abstract "Code Points"

Leave representation up to other software

Maximum of 17 * 216 or 1,114,112 Code Points

Unicode

Version 9.0 (June 2016)

135 Scripts

128,237 Characters (11.5% of total)

A Code Point Spec

Code Point 2603
UTF-8 E2 98 83
Name SNOWMAN
Alias Snowy Weather

Need an encoding!

  • Binary
  • Variable Length
  • Backwards Compatibility
  • Can't Have Eight Zeros In A Row (NULL!)

Common Encodings

  • UTF-8
  • UTF-16
  • UTF-32

UTF-8

UTF-8 Table

UTF-8 Growth

Growth of UTF-8 on the Web

Let's Look At Some Code

What to do!?

(╯°□°)╯︵ ┻━┻

Python Best Practices

  1. Make a unicode sandwich!
  2. Always know your encoding
  3. Otherwise guess
  4. Test with unicode inputs
  5. Use UTF-8
  6. DON’T PLAY WHACK-A-MOLE

Summary

  • Languages are messy
  • Many incompatible Character encodings
  • Unicode fixes this
  • Unicode and UTF-8 are not the same but very related
  • Know the encodings you are working with
  • Always make a unicode sandwich

Resources

Thanks For Your Attention!

Questions?