Making New Collations with Python SQLite create_collation

Collations are functions used to compare TEXT data in a SQLite database. They are invoked by expressions with relational operators, by sorting with GROUP BY and ORDER BY clauses, and by indices. The default collation sorts TEXT data in a case-sensitive way, but it lacks any support for locale collations. In this tutorial we will describe all predefined collations, and we will learn how to define our own. The Python source for this tutorial can be found here

Using Collations
- Associating a Column with a Collation
- The COLLATE Operator
  - Collations Affect the ORDER BY and GROUP BY Clauses
  - Column Collation Has the Lowest Precedence
Predefined Collations
Defining New Collations
- A Collation for Latin-1 Characters
- Collation Functions and Unicode
  - Case-insensitive Unicode Equality
  - Sorting TEXT Data with locale
Remarks on Collation Functions
Closing Thoughts

Using Collations

Collations are used to compare TEXT values in a SQL database. Once they have been registered in a SQLite connection, they will be applied to relational operators, ORDER BY and GROUP BY clauses, or indices. Given a table with values of various data types, SQLite will sort them using the following algorithm:

NULLs comes first.
Then, numeric types are listed, using the usual sorting algorithm for real values; e.g. 1 < 1.2 < 1.23
TEXT values follow, using BINARY collation or an explicit collation.
Finally, BLOBs are sorted using BINARY collation.

Regardless of the collation, SQLite calls collation functions only on TEXT data. Let’s see how these rules work:

import sqlite3

def test_only_text():
    """ Collation functions are called only on TEXT data. """

    def collate_verbose(test1, test2):
        print('collate(%s, %s)' % (test1, test2))
        return 1 if test1 > test2 \
            else -1 if test1 < test2 else 0

    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('verbose', collate_verbose)
        
        conn.execute('CREATE TABLE test(c);')
        
        conn.executemany('INSERT INTO test VALUES(?);', 
            ( (1, ), ('lantern', ), (b'coffee', ),
              ('dog', ), (b'dish', ), (b'kettle', ),
              (2.3, ), (None, ), ('home', )
            ))
        
        for item in conn.execute(
            'SELECT c FROM test ORDER BY c COLLATE verbose;'):
            print(item[0])
            
test_only_text()

The collate_verbose() statement works like typical Python string comparison, except it prints its arguments before returning a value. This function has been registered under the name 'verbose'. In this example, we create a test table, holding at least one sample for each SQLite native type. Finally, we use a SELECT statement to sort all data in the table in ascending order. The COLLATE operator assigns the verbose collation to the c column. We will get back to this topic soon. As we can see from the output of test_only_text(), SQLite calls collate_verbose() only when both parameters are of TEXT type:

collate(lantern, dog)
collate(dog, home)
collate(lantern, home)
None
1
2.3
dog
home
lantern
b'coffee'
b'dish'
b'kettle'

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Associating a Column with a Collation

Before using a collation, we must associate it with a set of values. There are two ways doing that:

either by stating the collation in the column declaration, or
by using the COLLATE operator (as we will see in the next section).

The COLLATE coll clause, which comes after the name and type of the column, associates a coll collation with the column; e.g. col TEXT COLLATE nocase, where a col column has type TEXT and uses the nocase collation. Let’s see an example:

import sqlite3

def test_column_declaration():
    """ Testing collation via column declaration. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('CREATE TABLE animals(' \
            'name TEXT COLLATE nocase);')
        
        conn.executemany('INSERT INTO animals VALUES(?);', 
             ( ('dog', )  , ('cat', ) , ('crocodile', ),
               ('OTTER', ), ('duck', ), ('BEAVER', ),
               ('bear', ) , ('FISH', ), ('pigeon', )
             ) )
        
        for animal in conn.execute(
            'SELECT * FROM animals ORDER BY name ASC;'):
            print(animal[0])
        
test_column_declaration()

Here we define an animals table, holding 9 animal’s names, either lowercase or uppercase. It has a single name column of type TEXT and using the predefined NOCASE collation. Finally, we sort all animal’s names in the table in descending order using a SELECT statement. The output of the test_column_declaration() function will be:

bear
BEAVER
cat
crocodile
dog
duck
FISH
OTTER
pigeon

As we can see, all data have been compared in a case-insensitive way, so that, for example, 'dog' < 'FISH'. The default algorithm would have listed all uppercase strings first, then the lowercase strings.

**The COLLATE Operator**

The collate() operator associates a specific collation with an expression. It doesn’t perform any transformation on the result of the expression, but it changes how equality and ordering are tested, according to the following rules:

use the explicit collation of the leftmost operand;
use the explicit collation of the rightmost operand;
if the leftmost expression is a direct reference to a column with a collation, the column collation is used;
if the rightmost expression is a direct reference to a column with a collation, the column collation is used;
otherwise, the default BINARY collation is used.

SQLite follows this algorithm from 1 to 5, and stops at the first rule it can apply. Here’s an example demonstrating this concept:

import sqlite3

def test_collation_and_equality():
    """ Effects of collation on equality testing. """
    with sqlite3.connect(':memory:') as conn:
        
        # Using leftmost collation
        print("collate('home', NOCASE) == 'HOME'? %s" \
            % bool(conn.execute(
            "SELECT 'home' COLLATE NOCASE == 'HOME' ;").fetchone()))
        
        print("collate('HOME', BINARY) == collate('home', NOCASE)? %s" \
            % bool(conn.execute(
            "SELECT 'HOME' COLLATE BINARY == 'home' COLLATE NOCASE;").fetchone()))
        
        # Using rightmost collation
        print("'HOME' == collate('home', NOCASE)? %s" \
            % bool(conn.execute(
            "SELECT 'HOME' == 'home' COLLATE NOCASE;").fetchone()))

test_collation_and_equality()

In this example we execute three SELECT statements, all comparing two string samples for equality. In the first case, the left operand has an explicit NOCASE collation, while the right operand has no collation. According to rule 1, this test will succeed. In the second case, both the left and the right operands have an explicit collation, but rule 1 is applied first, so the comparison uses the BINARY collation, and the test will fail. Finally, in the third cases only the rightmost operand has an explicit NOCASE collation, so, according to rule 2, the test will succeed. Let’s see the results of the test_collation_and_equality() function:

collate('home', NOCASE) == 'HOME'? True
collate('HOME', BINARY) == collate('home', NOCASE)? False
'HOME' == collate('home', NOCASE)? True

Beware that the same results holds for any other relational operator: !=, <, <=, >, >=, IS and IS NOT. Let’s see an example involving all these operators:

import sqlite3

def test_relational_operators():
    """ Using collations with other relational operators. """
    with sqlite3.connect(':memory:') as conn:
        for op in ('!=', '<', '<=', '>', '>=', 'IS', 'IS NOT'):
            ans = conn.execute(
                "SELECT 'Chuck' COLLATE NOCASE %s 'chuck';" % op).fetchone()
            print("'Chuck' %-6s 'chuck'? %s" \
                % (op, bool(ans[0])))
    
test_relational_operators()

The IS operator behaves like the = operator, except that it will return 1 if both operands are NULL, and 0 if one operand is NULL and the other is not. Since the first operand has NOCASE collation, all expressions will be evaluated as both operands were equal, so the output will be:

'Chuck' !=     'chuck'? False
'Chuck' <      'chuck'? False
'Chuck' <=     'chuck'? True
'Chuck' >      'chuck'? False
'Chuck' >=     'chuck'? True
'Chuck' IS     'chuck'? True
'Chuck' IS NOT 'chuck'? False

Get Our Python Developer Kit for Free

**Collations Affect the ORDER BY and GROUP BY Clauses**

Collations affect the ORDER BY clause as well, since its execution involves comparisons between the values of a column:

import sqlite3

def test_collation_and_orderby():
    """ Effects of collation on the `ORDER BY` clause. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('CREATE TABLE wardrobe(item TEXT);')
        
        conn.executemany(
            'INSERT INTO wardrobe VALUES(?);',
            ( ('jacket', )   , ('skirt', ),
              ('TIE', )      , ('JACKET', ),
              ('shirt', )    , ('jumper', ),
              ('JUMPER', )   , ('SHIRT', ),
              ('SKIRT', )    , ('tie', ) ) )
        
        for item in conn.execute(
            'SELECT item FROM wardrobe '\
            'ORDER BY item COLLATE NOCASE;'):
            print(item[0])

test_collation_and_orderby()

We create a wardrobe table holding various clothing items’ names, some uppercase and others lowercase. The SELECT statement will sort them in a case-insensitive manner, since in its ORDER BY clause we assigned a NOCASE collation to the item column:

jacket
JACKET
jumper
JUMPER
shirt
SHIRT
skirt
SKIRT
TIE
tie

Setting a collation also affects the GROUP BY clause, and the related HAVING clause, as we can see from the following example:

import sqlite3

def test_collation_and_groupby():
    """ Effects of collations on the `GROUP BY` clause. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('CREATE TABLE artists(' \
            'name TEXT NOT NULL, ' \
            'country TEXT NOT NULL COLLATE NOCASE);')
        
        conn.executemany(
            'INSERT INTO artists VALUES (?, ?);', 
            ( ('Simone Simons', 'NLD'),
              ('Federica Lanna', 'ITA'),
              ('Joe Dassin', 'fra'),
              ('Sharon den Adel', 'nld'),
              ('Clementine Delauney', 'fra'),
              ('Pierre Perret', 'FRA'),
              ('Sara Squadrani', 'ita'),
              ('Edith Piaf', 'fra'),
              ('Floor Jansen', 'NLD'),
              ('Charles Aznavour', 'fra'),
              ('Sergio Endrigo', 'ITA'),
            ))
        
        for country, n in conn.execute(
            'SELECT country, count(country) AS n ' \
            'FROM artists GROUP BY country ' \
            "HAVING country=='ita' OR country=='fra' " \
            'ORDER BY n DESC;'):
            print('%-10s%d' % (country.upper(), n))

test_collation_and_groupby()

Here we create an artists table, holding some artists and their country of origin. Notice that the country column has NOCASE collation. The SELECT statement computes how many artists are French or Italian:

FRA       5
ITA       3

There are 5 French artists and 3 Italian artists. We’re able to extract this data because the HAVING clause filter its values in a case-insensitive manner, and so does the GROUP BY clause, Otherwise, we would have had 4 different entries: FRA, fra, ITA, and ita.

Column Collation Has the Lowest Precedence

If both column collation and the COLLATE operator are available, the COLLATE operator has higher precedence:

import sqlite3

def test_column_precedence():
    """ Precedence of the column collation. """
    
    def collate_nocase_reverse(test1, test2):
        t1, t2 = test1.lower(), test2.lower()
        return 1 if t2 > t1 else -1 if t1 > t2 else 0
    
    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('reverse',
            collate_nocase_reverse)
        
        conn.execute('CREATE TABLE words(w COLLATE NOCASE);')
        
        conn.executemany('INSERT INTO words VALUES(?);',
            ( ('pencil', ), ('STICKER', ), ('ERASER', ),
              ('BOOK', ), ('PEN', ), ('ink', )
            ))
        
        print('Using column collation:')
        for item in conn.execute(
            'SELECT w FROM words ORDER BY w ASC;'):
            print(item[0])
            
        print('\nUsing the `COLLATE` operator:')
        for item in conn.execute(
            'SELECT w FROM words ' \
            'ORDER BY w COLLATE reverse ASC;'):
            print(item[0])

test_column_precedence()

The collate_nocase_reverse() function is similar to the default NOCASE collation, but it sorts all data in reverse order. In the example above, the words table holds both uppercase and lowercase letters. Its w column has NOCASE collation, so the first SELECT statement sorts all data in ascending order, regardless of the characters’ case. The second SELECT statement specifies the reverse collation in the ORDER BY clause, so now the w column has two possible collations, but the COLLATE operator has precedence over column collation, so all data will be sorted in reverse order:

Using column collation:
BOOK
ERASER
ink
PEN
pencil
STICKER

Using the `COLLATE` operator:
STICKER
pencil
PEN
ink
ERASER
BOOK

Predefined Collations

Before diving into user-defined collations, we will describe all predefined collations. You can use the PRAGMA collation_list; statement to list all available collations for a SQLite connection:

import sqlite3

def test_default_collations():
    """ Listing all collations. """
    with sqlite3.connect(':memory:') as conn:
        print('Default collations:')
        for cid, name in conn.execute(
            'PRAGMA collation_list;'):
            print(cid, name)
    
test_default_collations()

which lists the following collations:

Default collations:
0 RTRIM
1 NOCASE
2 BINARY

Collations inserted last have lower integer indices.

Ignoring Trailing Spaces with RTRIM

The RTRIM (Right Trim) collation sorts all data ignoring trailing spaces. Leading spaces are still relevant.

import sqlite3

def test_rtrim_collation():
    """ Testing the RTRIM collation. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute(
            'CREATE TABLE paragraph(' \
            'id INTEGER PRIMARY KEY, ' \
            'textobj TEXT NOT NULL, ' \
            'background TEXT COLLATE RTRIM, ' \
            'foreground TEXT COLLATE RTRIM);')
        
        conn.executemany(
            'INSERT INTO paragraph( ' \
            'textobj, background, foreground) ' \
            'VALUES (?, ?, ?);', 
            ( ('home' , 'yellow  ', 'black  '),
              ('hound', 'green  ' , 'red '),
              ('chair', 'red '    , 'black     '),
              ('hat'  , 'yellow ' , 'blue '),
              ('cat'  , 'red    ' , 'black ')
            ))
        
        for text, foreg in conn.execute(
            'SELECT textobj, foreground FROM paragraph ' \
            "WHERE foreground == 'black';"):
            print('%r has a %s foreground.' \
                % (text, foreg.rstrip()))

test_rtrim_collation()

In this example we create a paragraph table, holding formatting information about textobj items. Both the background color and the foreground color have RTRIM collation. Then, we insert 5 records into the table, with colors having different trailing sequences of spaces. Finally, we use a SELECT statement to retrieve all text objects having a black foreground. Since we associated the foreground column with the RTRIM collation, all trailing spaces are ignored, and the SELECT statement returns 3 matches, as we can see from the test_trim_collation()’s output:

'home' has a black foreground.
'chair' has a black foreground.
'cat' has a black foreground.

Case-Insensitive Sorting with NOCASE

The NOCASE collation ignores a character’s case when comparing ASCII characters (i.e. from U+0000 to U+007F); e.g. 'A' == 'a', since they are, respectively, U+0041 and U+0061. But that doesn’t hold for 'ä' == 'a' since the leftmost character (U+00E4 LATIN SMALL LETTER A WITH DIAERESIS) is not in the ASCII range.

import sqlite3

def test_nocase_collation():
    """ Testing the NOCASE collation. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('CREATE TABLE words(' \
            'lower TEXT COLLATE NOCASE, upper TEXT);')
        
        conn.executemany(
            'INSERT INTO words VALUES(?, ?);', 
            ( ('desert' , 'DESERT'),
              ('country', 'COUNTRY'),
              ('città'  , 'CITTÀ'),    # non-ASCII
              ('hôpital', 'HÔPITAL'),  # non-ASCII
              ('niño'   , 'NIÑO'),     # non-ASCII
            ))
        
        for lower, upper, match in conn.execute(
            'SELECT lower, upper, lower == upper ' \
            'FROM words;'):
            print('%r == %r? %s' \ 
                % (lower, upper, bool(match)))

test_nocase_collation()

In this example we define a words table, holding 5 lowercase words and their respective uppercase values. The first 2 words contain only ASCII characters, while the last 3 have one non-ASCII character. When we use the SELECT statement to compare the two columns for equality, the NOCASE collation will be applied to both columns since the lower column has been declared with that collation, and, by the rules stated above, the rightmost term of comparison will be assigned to the same collation. SO, the output of the test_nocase_collation() will be:

'desert' == 'DESERT'? True
'country' == 'COUNTRY'? True
'città' == 'CITTÀ'? False
'hôpital' == 'HÔPITAL'? False
'niño' == 'NIÑO'? False

which reveals that, while the case-insensitive equality works fine in the first two samples, which evaluate to True, it fails for the last three samples since NOCASE collation sees no-ASCII characters with different case as distinct values. Don’t worry about it: Python itself will soon come to our rescue!

Treating TEXT Values as a Sequence of Bytes with BINARY

The BINARY collation treats text data as a sequence of bytes, comparing two samples byte by byte. If both sequence have the same bytes, the shorter comes first. As we have already stated above, all BLOB objects are compared using this collation. By assigning the BINARY collation to TEXT objects, we may expect that TEXT and BLOB objects will be sorted together, since they share the same collation. Let’s see if we are right:

import sqlite3

def test_binary_collation():
    """ Using BINARY collation on TEXT objects. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('CREATE TABLE spices('
            'name COLLATE BINARY);')
        
        conn.executemany('INSERT INTO spices VALUES(?);',
            ( (b'cinnamon', ), ('rosemary', ),
              ('thyme', )    , (b'jasmine', ),
              ('saffron', )  , ('sage', )
            ))
        
        for item in conn.execute(
            'SELECT name FROM spices ' \
            'ORDER BY name ASC;'):
            print(item[0])
    
test_binary_collation()

We set up a spices table, holding the names of some spices. We do this both with TEXT and BLOB values. This column has BINARY collation. When we try to sort it in ascending order, we get these results:

rosemary
saffron
sage
thyme
b'cinnamon'
b'jasmine'

All TEXT data are still sorted before BLOB data! So, even when using TEXT data with BINARY collation, we must remember the general rule for using Collations: all numeric data come first, then TEXT data, finally BLOBs.

Get Our Python Developer Kit for Free

Defining New Collations

There are times when more complex comparison algorithms are required, and we can’t always rely on predefined collations. SQLite allows developers to define their own collations by registering collation functions, which are invoked to compare strings as part of any sorting process. A collation function accepts two TEXT values as parameters, and it returns:

-1 (or any other negative value) if the second string is greater than the first .
0 if the two strings compare equal.
1 (or any other positive value) if the first string is greater than the second.

You must use the create_collation(name, func) method of a SQLite connection to register a new collation, where name is the name of the new collation, and func is the function implementing the collation algorithm. The func argument can either be a standalone function definition or can be entered in-line. Collations’ name are case-insensitive. You can call create_collation('coll', None) to delete a previously registered collation named coll. Let’s see an example:

import sqlite3
import re

def test_band_names():
    """ Defining a collation algorithm for band names. """

    def collate_band(name1, name2):
        """ A collation skipping leading articles from a name. """  
        n1 = re.sub(r'^(a |the |an )', '', name1.lower())
        n2 = re.sub(r'^(a |the |an )', '', name2.lower())
        return 1 if n1 > n2 else -1 if n1 < n2 else 0

    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('band', collate_band)

        conn.execute('CREATE TABLE bands(' \
            'id INTEGER PRIMARY KEY, ' \
            'name TEXT NOT NULL COLLATE band, ' \
            'country TEXT NOT NULL);')

        conn.executemany(
            'INSERT INTO bands(name, country) VALUES (?, ?);',
            ( ('A Taste of Honey', 'USA'),
              ('The Killers', 'USA'),
              ('A-ha', 'NOR'), 
              ('The Beatles', 'GBR'), 
              ('The Clash', 'GBR')
            ) )

        for band in conn.execute(
            'SELECT name FROM bands ORDER BY name ASC;'):
            print(band[0])

test_band_names()

In this example we define a collate_band collation, which sorts all band names by skipping the leading a, an, or the article, if any; e.g. we should use Supremes instead of The Supremes during comparison. The re.sub(pattern, repl, sample) function finds the leftmost occurrence of pattern in sample, and replaces it by repl. If there isn’t any occurrence of pattern, then sample is returned unmodified. We use this function to find the leading article (followed by a single space) from the band’s names, and replace it with an empty string, thus deleting it. After performing the substitution, these two names are compared using the Python built-in < operator for str objects. Then, we create a small database of bands. Finally, we use the SELECT statement to sort all band names in ascending order, and we get this list:

A-ha
The Beatles
The Clash
The Killers
A Taste of Honey

All bands have a leading article, except for A-ha, where ‘A’ is not followed by a single space, so it isn’t stripped before comparison.

A Collation for Latin-1 Characters

Here we describe a collation for Latin-1 characters, registered as 'latin1', which handles all characters in a TEXT value as follows:

It removes all non-alphanumeric characters (e.g. punctuation, spaces).
It maps all ASCII lowercase characters to their respective uppercase versions.
It maps all Latin-1 alphabetic characters to ASCII alphabetic characters which resemble them, e.g. 'ä' (U+00E4 LATIN SMALL LETTER A WITH DIAERESIS) is mapped to 'A' (U+0041 LATIN CAPITAL LETTER A).
It removes all characters outside the Latin-1 range.

After all characters in both strings have been converted, a case-insensitive comparison, resembling the default NOCASE collation, takes place. This time, we implement our collation as a class with a __call__() method (a callable object).

import sqlite3

def test_latin1_collation():
    """ Defining a Latin-1 collation. """
    
    class Latin1Collator:
        """ A collation for Latin-1 strings. """
        
        LATIN1_MAP =  [
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , '' , '' , '' , '' , '' , '' , # 00
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , '' , '' , '' , '' , '' , '' , # 10
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , '' , '' , '' , '' , '' , '' , # 20
            '0', '1', '2', '3', '4', '5', '6', '7', 
            '8', '9', '' , '' , '' , '' , '' , '' , # 30
            '' , 'A', 'B', 'C', 'D', 'E', 'F', 'G', 
            'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', # 40
            'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 
            'X', 'Y', 'Z', '' , '' , '' , '' , '' , # 50
            '' , 'A', 'B', 'C', 'D', 'E', 'F', 'G', 
            'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', # 60
            'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 
            'X', 'Y', 'Z', '' , '' , '' , '' , '' , # 70
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , '' , '' , '' , '' , '' , '' , # 80
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , '' , '' , '' , '' , '' , '' , # 90
            '' , '' , '' , '' , '' , '' , '' , '' , 
            '' , '' , 'A', '' , '' , '' , '' , '' , # A0
            '' , '' , '2', '3', '' , '' , 'P', '' , 
            '' , '1', 'O', '' , '' , '' , '' , '' , # B0
            'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 
            'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', # C0
            'D', 'N', 'O', 'O', 'O', 'O', 'O', '' , 
            'O', 'U', 'O', 'U', 'U', 'Y', '' , 'B', # D0
            'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 
            'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', # E0
            'D', 'N', 'O', 'O', 'O', 'O', 'O', '' , 
            'O', 'U', 'O', 'U', 'U', 'Y', '' , 'Y'  # F0
            ]
        
        def map_string(self, s):
            res = ''
            for c in s:
                if 0 <= ord(c) <= 255:
                    res += self.LATIN1_MAP[ord(c)]
            return res
        
        def __call__(self, x, y):
            s1, s2 = self.map_string(x), self.map_string(y)
            return 1 if s1 > s2 else -1 if s1 < s2 else 0
                
    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('latin1', Latin1Collator())
                    
        conn.execute('CREATE TABLE words(' \
            'w1 TEXT NOT NULL COLLATE latin1, ' \
            'w2 TEXT NOT NULL COLLATE latin1, ' \
            'translation TEXT NOT NULL);')
        
        conn.executemany(
            'INSERT INTO words VALUES(?, ?, ?);', 
            ( ('schön'  , 'sch-on'  , 'beautiful'),# German
              ('Hyvästi', 'HYVASTI!', 'goodbye'),  # Finnish
              ('igår'   , 'igar'    , 'yesterday'),# Swedish
              ('perché' , 'Perchè?' , 'why'),      # Italian
              ('être'   , 'etre'    , 'to be'),    # French
              ('año'    , 'A.Ñ.O'   , 'year'),     # Spanish
            ))
                
        for w1, w2, match in conn.execute(
            'SELECT w1, w2, w1 == w2 FROM words;'):
            print('%r == %r? %s' \
                % (w1, w2, bool(match)))

test_latin1_collation()

The Latin1Collator class uses the LATIN1_MAP list to quickly map all characters within the Latin-1 range (U+0000 to U+00FF) into a character in the ASCII alphanumeric range (U+0030 to U+0039 for digits, and U+0041 to U+005A for uppercase letters). The map_to_latin1() method implements the conversion algorithm described above. The __call__() method converts both its arguments to ASCII uppercase strings, then performs the comparison. We define a words table holding all our sample data. The w1 column includes valid words in 6 languages, while the w2 column has the same words, but with additional punctuation characters, or replacing uppercase characters with their respective lowercase character (or vice versa). Finally, the SELECT statement performs an equality test using the latin1 collation. Using this special Python SQLite3 collation function, all the left strings will match the right strings, as we can see from the output of the test_latin1_collation() function:

'schön' == 'sch-on'? True
'Hyvästi' == 'HYVASTI!'? True
'igår' == 'igar'? True
'perché' == 'Perchè?'? True
'être' == 'etre'? True
'año' == 'A.Ñ.O'? True

The conversion enforced by LATIN1_MAP may seem a little bit esoteric, but it is useful when dealing with file systems strictly complying with the ISO 9660 standard, which allows for a limited character set for directories and file names.

Get Our Python Developer Kit for Free

Collation Functions and Unicode

As we have seen earlier in this tutorial, the NOCASE collation is not Unicode-aware. Neither are the upper() and lower() predefined scalar functions; they both only convert ASCII characters, so we can’t use them to build a case-insensitive comparison function of Unicode characters.

The bottom line is this: if we want to sort Unicode TEXT data in a case-insensitive way, we must write our own function.

Case-insensitive Unicode Equality

If we want to test two arbitrary Unicode strings for equality in a case-insensitive way, we can’t use the NOCASE collation, since it only properly handles ASCII characters. Luckily, Python has an extensive built-in support for Unicode strings. Notably, the upper() and lower() methods of the str class are Unicode-aware; e.g. â (U+00E2) is mapped into Â (U+00C2) by upper(). Using these Python features, we can readily implement our custom case-insensitive equality test collation as follows:

import sqlite3

def test_unicode_equality():
    """ Equality of Unicode strings. """
    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('unicode_nocase',
            lambda x, y : 1 if x.lower() > y.lower() \
                else -1 if x.lower() < y.lower() else 0)
        
        conn.execute('CREATE TABLE words(' \
            'lower TEXT COLLATE unicode_nocase, ' \
            'upper TEXT NOT NULL, ' \
            'translation TEXT NOT NULL);')
        
        conn.executemany(
            'INSERT INTO words VALUES(?, ?, ?);', 
            ( ('Glück'  , 'GLÜCK'  , 'luck'),     # German
              ('hedelmä', 'HEDELMÄ', 'fruit'),    # Finnish
              ('kärlek' , 'KÄRLEK' , 'love'),     # Swedish
              ('λoγος'  , 'ΛOΓΟΣ'  , 'reason'),   # Greek
              ('città'  , 'CITTÀ'  , 'city'),     # Italian
              ('hôpital', 'HÔPITAL', 'hospital'), # French
              ('niño'   , 'NIÑO'   , 'boy'),      # Spanish
            ))
        
        for lower, upper, match in conn.execute(
            'SELECT lower, upper, lower == upper ' \
            'FROM words;'):
            print('%r == %r? %s' \
                % (lower, upper, bool(match)))
            
test_unicode_equality()

Take a look at the create_collation statement. Thanks to Python’s built-in support for Unicode, we can easily overcome the shortcomings of the SQLite NOCASE collation. By simply defining a unicode_nocase collation, we now have a collation function capable of correctly handling Unicode characters’ equality tests. Much like our earlier NOCASE collation example, we define a words table, filling it with the lowercase and uppercase versions of words. Each word includes at least one non-ASCII character. Let’s take a look at the output of the SELECT statement:

'Glück' == 'GLÜCK'? True
'hedelmä' == 'HEDELMÄ'? True
'kärlek' == 'KÄRLEK'? True
'λoγος' == 'ΛOΓΟΣ'? True
'città' == 'CITTÀ'? True
'hôpital' == 'HÔPITAL'? True
'niño' == 'NIÑO'? True

This time all equality tests are successful, since Python knows how to handle the upper() method properly for Unicode characters.

Sorting TEXT Data with locale

Comparing two Unicode strings is harder than testing for equality, since we need to define the < or > relational operators as well. The default implementation of __lt__ and __gt__ for the str class just sorts Unicode characters using their code points; e.g. 'A' < 'Z' < 'a' because ord('A') < ord('Z') < ord('a'), where ord() returns the code point for the character.

We usually expect lowercase characters and their respective uppercase values to be next to each other. Once again, Python comes to the rescue with its locale module, which allows for locale-dependent comparisons. Locale settings are applied using multiple locale categories, prefixed by LC_. The locale.LC_ALL category affects all comparison and sorting functions. The locale.LC_COLLATE category only affects the locale.strcoll() and the locale.strxfrm() functions. Let’s see them at work:

import locale

def test_locale_comparison():
    """ Compare strings using locale settings. """
        
    samples = 'aäàAboözZ1.'
        
    languages = { 
        'English' : 'en_US.UTF-8',
        'French'  : 'fr_FR.UTF-8',
        'German'  : 'de_DE.UTF-8',
        'Italian' : 'it_IT.UTF-8',
        'Swedish' : 'sv_SE.UTF-8' } 
    
    # Query current locale categories    
    default_collate = locale.setlocale(locale.LC_COLLATE, '')
    print('Default collation is %s' % default_collate)

    print('%-10s: %s' \
        % ('default', sorted(samples)))
    
    print('%-10: %s' \
        % ('codepoints', sorted(samples, key=ord)))
                
    for lang in languages:
        try:
            locale.setlocale(locale.LC_COLLATE, languages[lang])
            print('%-10s: %s' \
                % (lang, sorted(samples, key=locale.strxfrm)))
        except locale.Error:
            print('%s language package not available.' % lang)               

    # Reset previous locale settings
    locale.setlocale(locale.LC_COLLATE, default_collate)

test_locale_comparison()

The test_locale_comparison tests various comparison algorithms on the samples string. The locale.setlocale() function retrieves the current value for the LC_COLLATE category (notice the second argument of the function call is an empty string). The sorted() function sorts all items from an iterable object. If the optional key parameter is specified, it applies that function to each item before the comparison; e.g. sorted('aA', key=ord) will call ord('a') < ord('A'), and evaluates to [ 'A', 'a' ].

The locale.strxfrm(s) function formats a string using the current locale. We will demonstrate the behavior of various locale settings. Notice how the locale-dependent comparison is enclosed within a try statement. We do this because some locale settings may not be available on your operating system, in which case a locale.Error exception will be raised.

Default collation is en_US.UTF-8
default   : ['.', '1', 'A', 'Z', 'a', 'b', 'o', 'z', 'à', 'ä', 'ö']
codepoints: ['.', '1', 'A', 'Z', 'a', 'b', 'o', 'z', 'à', 'ä', 'ö']
English   : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z']
French    : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z']
German    : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z']
Italian   : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z']
Swedish   : ['.', '1', 'a', 'A', 'à', 'b', 'o', 'z', 'Z', 'ä', 'ö']

As we can see from the output, each locale applies a different sorting algorithm. Now we are ready to define our locale-dependent collation:

import sqlite3
import locale

def test_locale_aware_collation():
    """ A locale-aware collation for sorting Unicode data. """
    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('unicode', locale.strcoll)
        
        conn.execute(
            'CREATE TABLE words(w COLLATE unicode);')
        
        conn.executemany('INSERT INTO words VALUES(?);',
        ( ('château', )  , ('Château', ), 
          ('chambre', )  , ('chanteuse', ),
          ('Città', )    , ('città', ),
          ('Schlösser', ), ('schlösser', ),
          ('Zenit', )    , ('zenit', )
        ) )
        
        for word in conn.execute(
            'SELECT w FROM words ORDER BY w ASC;'):
            print(word[0])

test_locale_aware_collation()

The locale.strcoll(x, y) function compares x to y using the current value for the LC_COLLATE category. It returns 1 if x > y, -1 if x < y, otherwise 0. We use this function to create a new locale named unicode. Then, we build a words table, holding 10 French, German, and Italian words. Finally, we use a SELECT statement to sort all data in the table. The ORDER BY clause will call the locale.strcoll() function on all pairs of words in the table. The following output refers to an en_US locale:

chambre
chanteuse
château
Château
città
Città
schlösser
Schlösser
zenit
Zenit

See how much more powerful these custom collation functions are than the default collate sequences? By combining Python with SQLite3, you’re able to build robust collations for handling your database data.

Get Our Python Developer Kit for Free

Remarks on Collation Functions

In this section we consolidated useful information you should be aware of before trying to define your own SQLite collation functions using Python.

Handling Exceptions

Unlike scalar and aggregate functions, whose exceptions are caught by SQLite and re-raised as sqlite3.Error objects, all exceptions raised by collation functions will propagate to its caller. In other words, trying to catch just sqlite3.Error exceptions won’t be enough.

import sqlite3

def test_collation_exception():
    """ A collation function raising a ValueError exception. """
    with sqlite3.connect(':memory:') as conn:
        
        def faulty_collation(x, y):
            raise ValueError(
                'Faulty Collation for (%s, %s)' % (x, y))
        
        conn.create_collation('faulty', faulty_collation)
        
        try:
            print("'home' > 'house'? %s" % conn.execute(
                "SELECT 'home' COLLATE faulty > 'house';"))
        except sqlite3.Error:
            print("This handler won't work.")
        except ValueError:
            print('This handler will be called instead.')

test_collation_exception()

The faulty_collation function raises a ValueError exception regardless of its parameters. When we try to compare ‘home’ to ‘house’, the faulty collation raises an exception, which is not caught by the first except clause. The first clause deals with all exceptions from the sqlite3 module. Instead, the exception is caught by the second clause, as we can see from the output:

This handler will be called instead.

**Collations and the DISTINCT Keyword**

Since collations affect equality testing, they also control the way the DISTINCT keyword works.

import sqlite3
import random

def test_distinct_keyword():
    """ Collations affect the `DISTINCT` keyword. """
    
    class Choice:
        
        def __init__(self):
            self.choices = []
            
        def step(self, item):
            print('adding %r' % item)
            self.choices.append(item)
                
        def finalize(self):
            return random.choice(self.choices)
    
    with sqlite3.connect(':memory:') as conn:
        conn.create_aggregate('choice', 1, Choice)
        
        conn.execute(
            'CREATE TABLE icecream(' \
            'flavor TEXT COLLATE NOCASE);')
                    
        conn.executemany(
            'INSERT INTO icecream VALUES(?);',
            ( ('chocolate', ) , ('CHOCOLATE', ),
              ('lemon', )     , ('LEMON', ),
              ('strawberry', ), ('STRAWBERRY', ),
              ('vanilla', )   , ('VANILLA', ),
              ('watermelon', ), ('WATERMELON', )
            ) )
                    
        print("Pick an ice cream's flavor: %s" \
            % conn.execute(
                'SELECT choice(DISTINCT flavor) FROM icecream;'
            ).fetchone()[0] )

test_distinct_keyword()

First, we define a Choice class, and register it as an aggregate function under the name 'choice'. At each call, this function prints its parameter, and then inserts it in a set of choices. Then, we define an icecream table, including both the uppercase and the lowercase version of some ice cream’s flavors. Finally, we use the choice() function to pick a random flavor from the table. A typical output of the test_distinct_keyword() is:

adding 'chocolate'
adding 'lemon'
adding 'strawberry'
adding 'vanilla'
adding 'watermelon'
Pick an ice cream's flavor: strawberry

The DISTINCT keyword would usually tell the uppercase and lowercase version of the same word apart, but since the flavor column has been declared with NOCASE collation, only one version of the same word will be selected. So, the choice() function will choose from a set having just an instance for each ice cream’s flavor.

**The Strange Case of the LIKE Operator**

While collations change the way relational operators work, they don’t affect the LIKE, GLOB, MATCH, and REGEX operators, which can be defined by registering suitable scalar functions. As for the LIKE operator, if you want case-insensitive comparisons you don’t need the COLLATE NOCASE operator (or column definition). The PRAGMA case_sensitive_like controls the behavior of the LIKE operator. For example:

import sqlite3

def test_like_operator():
    """ Collations don't affect the `LIKE` operator. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute('PRAGMA case_sensitive_like = 1;')
        
        print("'Python' LIKE 'py%%'? %s" \
            % bool(conn.execute(
                "SELECT 'Python' COLLATE NOCASE " \
                "LIKE 'py%';").fetchone()[0]))

test_like_operator()

The test_like_operator() function uses a PRAGMA statement to switch to case-sensitive comparison for the LIKE operator. Then, we set the collation for the first argument of the LIKE operator to NOCASE. This move doesn’t affect the behavior of the LIKE operator, which still uses case-sensitive comparison, as we can see from the function’s output:

'Python' LIKE 'py%'? False

This PRAGMA statement applies to all databases attached to the connection which executes the statement. It affects only the built-in version of the LIKE operator, which can be overwritten by registering a custom like() scalar function with two arguments.

Get Our Python Developer Kit for Free

Collations and Indices

Since collations affect comparison, indices rely on them, too. So, when you define an index on a column with an explicit collation, the index will use that collation to sort all data.

import sqlite3

def test_collation_and_indices():
    """ Collations also affect indices. """
    with sqlite3.connect(':memory:') as conn:
        conn.execute(
            'CREATE TABLE people(' \
            'id INTEGER PRIMARY KEY, ' \
            'name TEXT NOT NULL, ' \
            'country TEXT NOT NULL, ' \
            'behavior TEXT COLLATE NOCASE);')
        
        conn.executemany(
            'INSERT INTO people (name, country, behavior) ' \
            'VALUES(?, ?, ?);', 
            ( ('John Hurt', 'GBR', 'kind'),
              ('Elsa Dwight', 'USA', 'Stubborn'),
              ('Mario Rossi', 'ITA', 'reckless'),
              ('Françoise Rauchelle', 'FRA', 'selfish'),
              ('Juan Gutierrez', 'ESP', 'Generous'),
              ('Olga Persson', 'SWE', 'Shy'),
              ('Georg Schelling', 'DEU', 'stubborn'),
              ('Mary Fitzgerald', 'USA', 'Kind')
            ))
        
        conn.execute(
            'CREATE INDEX person_behavior ' \
            'ON people(behavior);')
        
        for b in conn.execute(
            'SELECT behavior FROM people;'):
            print(b[0])

test_collation_and_indices()

In this example, we define a people table, which holds a set of people described by their name, country of origin, and behavior. Then, we create an index on the behavior column, which has a NOCASE collation, meaning that all data will be treated in a case-insensitive manner, as we can see from the output:

Generous
kind
Kind
reckless
selfish
Shy
Stubborn
stubborn

Think about the typical case-sensitive comparison where 'A' < ... < 'Z' < 'a' < ... < 'z'. In our example, uppercase data and their respective lowercase values appear one after the other, instead of the records being grouped apart.

Registering Multiple Collations Under the Same Name

The sqlite3 API differs slightly from the SQLite C API, which allows you to register multiple collation functions under the same name, as long as you specify a different string encoding form. In the C API we use the sqlite3_create_collation(conn, name, enc, udp, func) function to register a new collation, where:

the conn parameter represents the connection.
we register the collation under name.
the enc parameter is the encoding form of the string to which the collation applies. The available string encoding forms are: UTF-8, UTF-16LE, and UTF-16BE.
we can pass additional data to the collation function using the udp.
the func parameter is the collation function.

When using C, if more than one comparison function is available under the same name, SQLite picks the one requiring the least amount of conversion. For example, if we have database encoded in UTF-16LE, and we register both a UTF-8 and a UTF-16LE collation function under the same name, then SQLite will pick the UTF-16LE version, since it requires no conversion. If the UTF-16LE version weren’t available, SQLite would have chosen the UTF-8 version, converting all samples to UTF-8 before comparing them.

On the other hand, the sqlite3 API for Python registers all collation functions under the UTF-8 version, so each time we register a new collation function under the same name, we really overwrite the previous one. Let’s see an example:

import sqlite3

def test_same_name():
    """ Registering multiple collations under the same name. """
    
    with sqlite3.connect(':memory:') as conn:
        conn.create_collation('str',
            lambda x, y: 1 if x > y \
                else -1 if x < y else 0)
        print("First version of 'str' : 'cat' > 'fish' is %s" \
            % bool(conn.execute(
                "SELECT 'cat' > 'fish' COLLATE str;").fetchone()[0]))
                        
        conn.create_collation('str', 
            lambda x, y: 1 if x < y \
                else -1 if x > y else 0)
        print("Second version of 'str': 'cat' > 'fish' is %s" \
            % bool(conn.execute(
                "SELECT 'cat' > 'fish' COLLATE str;").fetchone()[0]))

test_same_name()

The test_same_name() function creates a new str collation, which acts like the usual Python string comparison. The first SELECT statement uses this version of the collation. Then, we register a second collation function under the same name, which sorts all strings in the reverse order, as we can see from the output of the second SELECT statement:

First version of 'str' : 'cat' > 'fish' is False
Second version of 'str': 'cat' > 'fish' is True

Beware that registering multiple collations under the same name may fail, raising an exception, since SQLite has to make sure there aren’t pending operations on the database still requiring the old collation function.

Closing Thoughts

In this tutorial we took a closer look at how comparisons and sorting work in a SQLite database. We saw that collation functions are called only on TEXT values, while all other data types define their own non-customizable sorting algorithm. Then, we described all predefined collations: NOCASE for case-insensitive comparisons, RTRIM for ignoring trailing spaces, and BINARY for treating strings like sequences of bytes (which is the default collation for BLOBs).

In the second part of the tutorial we learned how to define our own collations using Python and the sqlite3 create_collation method. A collation is simply a function (or any other callable object) returning a positive value, a negative value, or 0, according to its two arguments. We gave many examples of user-defined collation functions. When it comes to Unicode characters, the NOCASE collation has severe shortcomings, so we addressed an alternate solution using the locale module from the Python standard library.

Finally, in the last part of the tutorial, we warned the user against common mistakes and “gotchas” concerning collation functions. First, SQLite will propagate any exception raised by a collation function, so just catching sqlite3.Error exceptions won’t suffice. We must be aware that collation functions will also affect the sorting algorithm of indices, but it doesn’t control how the LIKE operator and all other pattern matching operators work. We also pointed out that registering multiple collations under the same name isn’t a good idea, since it merely overwrites previously registered functions.

Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come. After that, please subscribe to our email list for more Python tutorials.

Get Our Python Developer Kit for Free

List of All Examples in This Tutorial

Collation Functions Apply Only to TEXT Data
Using Column Declaration to Set a Collation
Effects of Collation on Equality Testing
Collation and Relational Operators
Effects of Collation on ORDER BY Clause
Effects of Collation on GROUP BY Clause
Precedence of Column Collation
Listing All Registered Collations
The RTIM Collation Ignores Trailing Spaces
NOCASE Ignores the case of ASCII Characters
Using BINARY Collation on TEXT Objects
A Collation Algorithm for Band Names
A Collation for the Latin-1 Character Set
Equality of Unicode Strings
Sorting Under Different Locale Settings
A Locale-Aware Collation for Sorting Unicode Data
A Collation Function Raising a ValueError Exception
How Collations Affect the DISTINCT Keyword
Collations and the LIKE Operator
Collations and Indices
Registering Multiple Collations Under the Same Name

Making New Collations with Python SQLite create_collation

The Python Tutorials Blog

Using Collations

Associating a Column with a Collation

**The COLLATE Operator**

**Collations Affect the ORDER BY and GROUP BY Clauses**

Column Collation Has the Lowest Precedence

Predefined Collations

Ignoring Trailing Spaces with RTRIM

Case-Insensitive Sorting with NOCASE

Treating TEXT Values as a Sequence of Bytes with BINARY

Defining New Collations

A Collation for Latin-1 Characters

Collation Functions and Unicode

Case-insensitive Unicode Equality

Sorting TEXT Data with locale

Remarks on Collation Functions

Handling Exceptions

**Collations and the DISTINCT Keyword**

**The Strange Case of the LIKE Operator**

Collations and Indices

Registering Multiple Collations Under the Same Name

Closing Thoughts

List of All Examples in This Tutorial

About The Python Tutorials Blog