Python 3 Unicode and Byte Strings

A notable difference between Python 2 and Python 3 is that character data is stored using Unicode instead of bytes. It is quite likely that when migrating existing code and writing new code you may be unaware of this change as most string algorithms will work with either type of representation; but you cannot intermix the two.

If you are working with web service libraries such as urllib (formerly urllib2) and requests, network sockets, binary files, or serial I/O with pySerial  you will find that data is now stored as byte strings.

You are most likely to notice problems when comparing data against string constants. Comparing Unicode strings against byte strings will fail: either raising a TypeError when using an ordered comparison (<, <=, >, >=) or always returning False with equality/inequality.

Character Sets … a bit of history

If you are already familiar with Unicode or have encountered character set mappings you can skip forward to the Python and Unicode section.

Computers only work with numbers so that in order to handle text each character is assigned a unique number known as its character code. You are almost certainly familiar with ASCII that defines standard codes for 128 characters. ASCII defines 95 visible characters and 33 non-printing ones such as space, linefeed (often called newline), tab, escape and carriage return (generated by the Enter key). It dates back to 1963 and was designed for use in telegraph and serial communications using 7 of the 8 bits available in a byte (the top bit was used as a parity bit for error checking).

ASCII only defines codes for the English characters and symbols commonly used in the USA. While it has a code for the dollar sign ($ code #24 or 36 decimal) it doesn’t have codes for, amongst others, the GB pound (£), Yen (¥) or Euro () currency symbols.

Once data started being stored on disk or transmitted via error checking network protocols the 8th bit of each byte could be used for mapping an  additional 128 characters. However there was a plethora of ways this could be, and was, defined. While an extended (8-bit) ASCII set added in the GBP and Yen symbols it pre-dated the definition of the Euro.

Multiple definitions for 8-bit code sets were developed after Extended ASCII with the most commonly occurring in the UK being Latin-1 (ISO-8859-1) and Microsoft CP1252. Both encodings mapped GP Pound, and Yen symbols as well as several Western European accented characters. The sets not 100% compatible with each other and neither included the Euro symbol.

The Euro was eventually defined in ISO-8859-15 which is essentially 8859-1 with some minor changes to characters so many sites may use 8859-15 yet refer to it as 8859-1. The Euro was added to CP1252 version 3.

All of this just added to the general confusion over character set mappings and there is a list of at least 50 different character encoding standards on Wikipedia.

Unicode

By the late 1980’s attempts were made to define a universal character set (Unicode) using two bytes that would uniquely define over 65,000 different characters. This included western, Greek, Cyrillic, Arabic, Coptic and other characters in Europe and the Middle east, plus Asian characters sets including Kanji used in Japan and China.

The first 256 characters of Unicode are the same as the ISO-8859-1 (Latin-1) characters. The Euro is officially defined as code #20AC.

Unicode was intended to encode characters widely used in modern languages using 2 bytes. In 1996 Unicode 2 defined extension planes that allowed four byte character codes to support mappings for historic and specialized character sets.

Unicode 2 defined characters codes D800–DBFF as a high-surrogate code point which must be followed by a second two byte code (the low-surrogate point). Despite requiring four bytes, the two combined together form a three byte code for characters from 10000 to 10FFF (in the current specification). Emojis are a good example of codes requiring the four byte extension points: the happy face ? is character code 1F642 (encoded as D83DDE42).

As of this blog post there are over 137,000 Unicode characters. There are also several unofficial Unicode mappings including constructed scripts for languages such as pIqad Klingon.

Drawbacks to Unicode

The obvious disadvantage of Unicode is the need to to use two bytes for every character. This increases memory consumption, disk usage, I/O times, and reduces data transmission rates.

To support a more efficient method of handling western character sets a UTF-8 encoding scheme was defined that uses up to four bytes to store any Unicode character. ASCII codes require a single byte, codes up to 7FF require two bytes (this includes most European, Middle Eastern and Cyrillic characters). Character codes up to FFFF require three bytes and the rest require four.  This means the Euro symbol (20AC) requires three bytes while emojis like the happy face (1F642) require four bytes.

Python and Unicode

Your first brush with Python Unicode strings may happen when reading a text file and you get an encoding error, or the characters do not display on the screen correctly.

Python 3 creates a TextIO object when reading text files and this uses a default encoding for mapping bytes in the file into Unicode characters. Under Linux and OSX the default encoding is UTF-8, while Windows assumes CP1252.

If your text file does not use the default encoding assumed by Python you will need to specify the encoding when you open the file. To read a file encoded using Latin-1 (ISO-8859-1) use ‘latin_1’:

with open('example.txt', mode='r', encoding='latin_1'):
    pass

A full list of supported encodings is on the Python API codecs page.

All Python 3 string literals are Unicode. Use the lower case \u escape sequence with 4 digits (\uxxxx) to define codes up to FFFF. To define the Euro symbol use:

euro = '\u20AC'

For Unicode values above FFFF use a upper case \U with 8 digits (\Uxxxxxxxx). The happy face emoji would be:

smile = '\U0001F642'

If you print this value it will only display the happy face if your output device supports emojis (command prompts and IDEs typically do not).

If you have used Unicode text in Python 2 you’ll be familiar with Unicode string literals defined using a u prefix to the string literal (u’Hello world!’). This notation is still supported in Python 3 – just no longer required.

Python and Byte Strings

If you work with low level data connections such as serial lines or network sockets (which include web connections and Bluetooth) you will find that Python 3 transfers data as byte strings: data type bytes. Similarly if you open a file in binary mode you’ll be working with byte strings.

Byte strings support most of the methods provided with Unicode strings (data type str).  If your code uses string methods, subscripts and slices it is quite likely to continue working with byte strings.

However methods that take string parameters (such as startswith) will fail if you pass a string literal (constant) because your code will be passing a Unicode string literal to a byte string method. To define a byte string use a b prefix to string literals. For example:

byte_hello = b'Hello world!'

Byte strings support the usual backslash escape characters and can contain hexadecimal codes as in ‘\xA3’ for the GB pound sign (assuming a Latin_1 character set). You can define raw byte strings to disable backslash escape recognition using either br or rb string prefixes.

Note that the two string representations are incompatible so the following expression is always False:

'Hello world' == b'Hello world!'

There is no support for a format method in byte strings – string formatting is only supported with Unicode strings. If you print out a byte string using a format string you will always see the byte string representation form. For example:

message = b'world'
print('Hello {}!'.format(message))

will output as

Hello b'world'!

Encoding and Decoding Strings

To convert byte strings to Unicode use the str.decode() method which accepts an encoding parameter. The default encoding is UTF-8 for all platforms. We can correct the previous example to:

print('Hello {}!'.format(message.decode()))

If we are working with Windows CP1252 character sets and had read the text from a binary file we would use (data is the bytes string):

print('Hello {}!'.format(data.decode('cp1252'))

To convert Unicode to a byte string use the bytes.encode() method , again with an optional encoding parameter (default UTF-8.) To write our hello world message to a Windows CP1252 text file we’d use:

message = 'world'
with open('hello.txt', 'wb') as fp:
    fp.write('Hello {}!'.format(message).encode('cp1252'))

In reality, for this example we’d probably just write the byte strings separately, or perhaps use string concatenation:

b'Hello '+message.encode('cp1252')+b'!'

Format Strings

As a Python 2 programmer, particularly with a C background, you may have used the % operator for formatting output. This is still supported in Python 3 for both Unicode and byte strings. However the format string and any string parameters must be of the same type. To use byte string printf style formatting use:

print(b'Hello %s!' % (b'world'))

But if you haven’t yet moved over to Format Strings (since Python 2.7) you should really do so as they are far more powerful than the printf style. The recommended approach for formatting output is to use the str.format() method . In its simplest form the format string uses braces to represent a replaceable parameter with similar data type and field width conventions to printf:

print('Hello {:s}!'.format('world'))

Python 3.6 introduced a new Formatted String Literals feature which uses an f prefix on a Unicode string to allow any Python expression within the braces. Often abbreviated to f-strings these avoid the need to call the format() method for simple formatting cases such as this variant on the hello world example:

message = 'world'
print(f'Hello {message}!')

In fact any valid Python expression can be using with the braces so it’s certainly possible to call functions and perform arithmetic such as:

import random
print('Random value {random.random()**2:.2f}')

But this probably isn’t a good idea in practice as it is very difficult to identify the expression (random.random()**2) from the formatting (:.2f) within the string literal. Stick to simple variable names as in the first example.

If you use formatted string literals then an IDE that parses f-strings is indispensable as it will highlight syntax errors and incorrect variable names.

At the time of writing PyCharm highlights both syntax errors and incorrect variable names. Of the other popular Python IDEs (Atom, Eclipse/pydev, Spyder and VS Code) there is limited support for syntax checking (not always correctly) but no semantic checks for variable names. That may well have changed by the time you read this so make sure your IDE is kept up to date so you get f-string support when it is available.

Summary

Python 3 string class (str) stores Unicode strings and a new byte string (bytes) class supports single byte strings. The two are different types so string expressions must use one form or the other. String literals are Unicode unless prefixed with a lower case b.

Conversion to Unicode requires knowledge of the underlying character set encoding with UTF-8 being the most commonly used, especially on web pages. To convert byte strings to Unicode use the bytes.decode() method and use str.encode() to convert Unicode to a byte string. Both methods allow the character set encoding to be specified as an optional parameter if something other than UTF-8 is required.

A new formatted string literal (or f-string) allows expressions to be evaluated within the braces of a formatting string providing an alternative approach to using the str.format() method.

Martin Bond
Latest posts by Martin Bond (see all)
Dislike (1)
+ posts

An independent IT trainer Martin has over 40 years academic and commercial experience in open systems software engineering. He has worked with a range of technologies from real time process controllers, through compilers, to large scale parallel processing systems; and across multiple sectors including industrial systems, semi-conductor manufacturing, telecomms, banking, MoD, and government.

About Martin Bond

An independent IT trainer Martin has over 40 years academic and commercial experience in open systems software engineering. He has worked with a range of technologies from real time process controllers, through compilers, to large scale parallel processing systems; and across multiple sectors including industrial systems, semi-conductor manufacturing, telecomms, banking, MoD, and government.
This entry was posted in Python, Python3. Bookmark the permalink.

2 Responses to Python 3 Unicode and Byte Strings

  1. SultansOfSwing says:

    Many thanks for this tutorial -- it shows exactly the traps that will catch us.
    I'm currently trying to port a Python2 code to Python3. I don't have a full overview of how modules call each other, so it's quite cumbersome to find all the places where (possibly) different string types (bytes and unicode) are compared. (In contrast, any other string operations throw exceptions that show the exact place where I have to correct something). Is there a simple method to make the `==` operator throw exceptions whenever different string types are being compared?

    Like (0)
    Dislike (0)
  2. Martin Bond says:

    In simple terms the behaviour of == operator is to call the __eq__() function of the object using the normal attribute resolution. But if the objects are of different types (with some qualifications when comparing inherited sub types) == always returns False in Python 3 regardless of the implementation of the __eq__ method. This is part of the language specification and cannot be changed. My advice would be to write a function (call it something like any_string_compare) that takes two parameters and check the types of the parameters to define any comparison rules you want; this technique is used with languages that don't support operator overloading.

    Like (0)
    Dislike (0)

Leave a Reply