Collations are functions used to compare TEXT data in a SQLite database. They are invoked by expressions with relational operators, by sorting with GROUP BY
and ORDER BY
clauses, and by indices. The default collation sorts TEXT data in a case-sensitive way, but it lacks any support for locale collations. In this tutorial we will describe all predefined collations, and we will learn how to define our own. The Python source for this tutorial can be found here
- Using Collations
- Predefined Collations
- Defining New Collations
- Remarks on Collation Functions
- Closing Thoughts
Using Collations
Collations are used to compare TEXT values in a SQL database. Once they have been registered in a SQLite connection, they will be applied to relational operators, ORDER BY
and GROUP BY
clauses, or indices. Given a table with values of various data types, SQLite will sort them using the following algorithm:
- NULLs comes first.
- Then, numeric types are listed, using the usual sorting algorithm for real values; e.g.
1 < 1.2 < 1.23
- TEXT values follow, using BINARY collation or an explicit collation.
- Finally, BLOBs are sorted using BINARY collation.
Regardless of the collation, SQLite calls collation functions only on TEXT data. Let’s see how these rules work:
import sqlite3
def test_only_text():
""" Collation functions are called only on TEXT data. """
def collate_verbose(test1, test2):
print('collate(%s, %s)' % (test1, test2))
return 1 if test1 > test2 \
else -1 if test1 < test2 else 0
with sqlite3.connect(':memory:') as conn:
conn.create_collation('verbose', collate_verbose)
conn.execute('CREATE TABLE test(c);')
conn.executemany('INSERT INTO test VALUES(?);',
( (1, ), ('lantern', ), (b'coffee', ),
('dog', ), (b'dish', ), (b'kettle', ),
(2.3, ), (None, ), ('home', )
))
for item in conn.execute(
'SELECT c FROM test ORDER BY c COLLATE verbose;'):
print(item[0])
test_only_text()
The collate_verbose()
statement works like typical Python string comparison, except it prints its arguments before returning a value. This function has been registered under the name 'verbose'
. In this example, we create a SELECT
statement to sort all data in the table in ascending order. The COLLATE
operator assigns the test_only_text()
, SQLite calls collate_verbose()
only when both parameters are of TEXT type:
collate(lantern, dog) collate(dog, home) collate(lantern, home) None 1 2.3 dog home lantern b'coffee' b'dish' b'kettle'
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Associating a Column with a Collation
Before using a collation, we must associate it with a set of values. There are two ways doing that:
- either by stating the collation in the column declaration, or
- by using the
COLLATE
operator (as we will see in the next section).
The COLLATE coll
clause, which comes after the name and type of the column, associates a col TEXT COLLATE nocase
, where a TEXT
and uses the
import sqlite3
def test_column_declaration():
""" Testing collation via column declaration. """
with sqlite3.connect(':memory:') as conn:
conn.execute('CREATE TABLE animals(' \
'name TEXT COLLATE nocase);')
conn.executemany('INSERT INTO animals VALUES(?);',
( ('dog', ) , ('cat', ) , ('crocodile', ),
('OTTER', ), ('duck', ), ('BEAVER', ),
('bear', ) , ('FISH', ), ('pigeon', )
) )
for animal in conn.execute(
'SELECT * FROM animals ORDER BY name ASC;'):
print(animal[0])
test_column_declaration()
Here we define an TEXT
and using the predefined NOCASE collation. Finally, we sort all animal’s names in the table in descending order using a SELECT
statement. The output of the test_column_declaration()
function will be:
bear BEAVER cat crocodile dog duck FISH OTTER pigeon
As we can see, all data have been compared in a case-insensitive way, so that, for example, 'dog' < 'FISH'
. The default algorithm would have listed all uppercase strings first, then the lowercase strings.
The COLLATE Operator
The collate()
operator associates a specific collation with an expression. It doesn’t perform any transformation on the result of the expression, but it changes how equality and ordering are tested, according to the following rules:
- use the explicit collation of the leftmost operand;
- use the explicit collation of the rightmost operand;
- if the leftmost expression is a direct reference to a column with a collation, the column collation is used;
- if the rightmost expression is a direct reference to a column with a collation, the column collation is used;
- otherwise, the default BINARY collation is used.
SQLite follows this algorithm from 1 to 5, and stops at the first rule it can apply. Here’s an example demonstrating this concept:
import sqlite3
def test_collation_and_equality():
""" Effects of collation on equality testing. """
with sqlite3.connect(':memory:') as conn:
# Using leftmost collation
print("collate('home', NOCASE) == 'HOME'? %s" \
% bool(conn.execute(
"SELECT 'home' COLLATE NOCASE == 'HOME' ;").fetchone()))
print("collate('HOME', BINARY) == collate('home', NOCASE)? %s" \
% bool(conn.execute(
"SELECT 'HOME' COLLATE BINARY == 'home' COLLATE NOCASE;").fetchone()))
# Using rightmost collation
print("'HOME' == collate('home', NOCASE)? %s" \
% bool(conn.execute(
"SELECT 'HOME' == 'home' COLLATE NOCASE;").fetchone()))
test_collation_and_equality()
In this example we execute three SELECT
statements, all comparing two string samples for equality. In the first case, the left operand has an explicit NOCASE collation, while the right operand has no collation. According to rule 1, this test will succeed. In the second case, both the left and the right operands have an explicit collation, but rule 1 is applied first, so the comparison uses the BINARY collation, and the test will fail. Finally, in the third cases only the rightmost operand has an explicit NOCASE collation, so, according to rule 2, the test will succeed. Let’s see the results of the test_collation_and_equality()
function:
collate('home', NOCASE) == 'HOME'? True collate('HOME', BINARY) == collate('home', NOCASE)? False 'HOME' == collate('home', NOCASE)? True
Beware that the same results holds for any other relational operator: !=
, <
, <=
, >
, >=
, IS
and IS NOT
. Let’s see an example involving all these operators:
import sqlite3
def test_relational_operators():
""" Using collations with other relational operators. """
with sqlite3.connect(':memory:') as conn:
for op in ('!=', '<', '<=', '>', '>=', 'IS', 'IS NOT'):
ans = conn.execute(
"SELECT 'Chuck' COLLATE NOCASE %s 'chuck';" % op).fetchone()
print("'Chuck' %-6s 'chuck'? %s" \
% (op, bool(ans[0])))
test_relational_operators()
The IS
operator behaves like the =
operator, except that it will return 1 if both operands are NULL, and 0 if one operand is NULL and the other is not. Since the first operand has NOCASE collation, all expressions will be evaluated as both operands were equal, so the output will be:
'Chuck' != 'chuck'? False 'Chuck' < 'chuck'? False 'Chuck' <= 'chuck'? True 'Chuck' > 'chuck'? False 'Chuck' >= 'chuck'? True 'Chuck' IS 'chuck'? True 'Chuck' IS NOT 'chuck'? False
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Collations Affect the ORDER BY and GROUP BY Clauses
Collations affect the ORDER BY
clause as well, since its execution involves comparisons between the values of a column:
import sqlite3
def test_collation_and_orderby():
""" Effects of collation on the `ORDER BY` clause. """
with sqlite3.connect(':memory:') as conn:
conn.execute('CREATE TABLE wardrobe(item TEXT);')
conn.executemany(
'INSERT INTO wardrobe VALUES(?);',
( ('jacket', ) , ('skirt', ),
('TIE', ) , ('JACKET', ),
('shirt', ) , ('jumper', ),
('JUMPER', ) , ('SHIRT', ),
('SKIRT', ) , ('tie', ) ) )
for item in conn.execute(
'SELECT item FROM wardrobe '\
'ORDER BY item COLLATE NOCASE;'):
print(item[0])
test_collation_and_orderby()
We create a SELECT
statement will sort them in a case-insensitive manner, since in its ORDER BY
clause we assigned a NOCASE collation to the
jacket JACKET jumper JUMPER shirt SHIRT skirt SKIRT TIE tie
Setting a collation also affects the GROUP BY
clause, and the related HAVING
clause, as we can see from the following example:
import sqlite3
def test_collation_and_groupby():
""" Effects of collations on the `GROUP BY` clause. """
with sqlite3.connect(':memory:') as conn:
conn.execute('CREATE TABLE artists(' \
'name TEXT NOT NULL, ' \
'country TEXT NOT NULL COLLATE NOCASE);')
conn.executemany(
'INSERT INTO artists VALUES (?, ?);',
( ('Simone Simons', 'NLD'),
('Federica Lanna', 'ITA'),
('Joe Dassin', 'fra'),
('Sharon den Adel', 'nld'),
('Clementine Delauney', 'fra'),
('Pierre Perret', 'FRA'),
('Sara Squadrani', 'ita'),
('Edith Piaf', 'fra'),
('Floor Jansen', 'NLD'),
('Charles Aznavour', 'fra'),
('Sergio Endrigo', 'ITA'),
))
for country, n in conn.execute(
'SELECT country, count(country) AS n ' \
'FROM artists GROUP BY country ' \
"HAVING country=='ita' OR country=='fra' " \
'ORDER BY n DESC;'):
print('%-10s%d' % (country.upper(), n))
test_collation_and_groupby()
Here we create an NOCASE
collation. The SELECT
statement computes how many artists are French or Italian:
FRA 5 ITA 3
There are 5 French artists and 3 Italian artists. We’re able to extract this data because the HAVING
clause filter its values in a case-insensitive manner, and so does the GROUP BY
clause, Otherwise, we would have had 4 different entries: FRA, fra, ITA, and ita.
Column Collation Has the Lowest Precedence
If both column collation and the COLLATE
operator are available, the COLLATE
operator has higher precedence:
import sqlite3
def test_column_precedence():
""" Precedence of the column collation. """
def collate_nocase_reverse(test1, test2):
t1, t2 = test1.lower(), test2.lower()
return 1 if t2 > t1 else -1 if t1 > t2 else 0
with sqlite3.connect(':memory:') as conn:
conn.create_collation('reverse',
collate_nocase_reverse)
conn.execute('CREATE TABLE words(w COLLATE NOCASE);')
conn.executemany('INSERT INTO words VALUES(?);',
( ('pencil', ), ('STICKER', ), ('ERASER', ),
('BOOK', ), ('PEN', ), ('ink', )
))
print('Using column collation:')
for item in conn.execute(
'SELECT w FROM words ORDER BY w ASC;'):
print(item[0])
print('\nUsing the `COLLATE` operator:')
for item in conn.execute(
'SELECT w FROM words ' \
'ORDER BY w COLLATE reverse ASC;'):
print(item[0])
test_column_precedence()
The collate_nocase_reverse()
function is similar to the default NOCASE collation, but it sorts all data in reverse order. In the example above, the SELECT
statement sorts all data in ascending order, regardless of the characters’ case. The second SELECT
statement specifies the ORDER BY
clause, so now the COLLATE
operator has precedence over column collation, so all data will be sorted in reverse order:
Using column collation: BOOK ERASER ink PEN pencil STICKER Using the `COLLATE` operator: STICKER pencil PEN ink ERASER BOOK
Predefined Collations
Before diving into user-defined collations, we will describe all predefined collations. You can use the PRAGMA collation_list;
statement to list all available collations for a SQLite connection:
import sqlite3
def test_default_collations():
""" Listing all collations. """
with sqlite3.connect(':memory:') as conn:
print('Default collations:')
for cid, name in conn.execute(
'PRAGMA collation_list;'):
print(cid, name)
test_default_collations()
which lists the following collations:
Default collations: 0 RTRIM 1 NOCASE 2 BINARY
Collations inserted last have lower integer indices.
Ignoring Trailing Spaces with RTRIM
The RTRIM (Right Trim) collation sorts all data ignoring trailing spaces. Leading spaces are still relevant.
import sqlite3
def test_rtrim_collation():
""" Testing the RTRIM collation. """
with sqlite3.connect(':memory:') as conn:
conn.execute(
'CREATE TABLE paragraph(' \
'id INTEGER PRIMARY KEY, ' \
'textobj TEXT NOT NULL, ' \
'background TEXT COLLATE RTRIM, ' \
'foreground TEXT COLLATE RTRIM);')
conn.executemany(
'INSERT INTO paragraph( ' \
'textobj, background, foreground) ' \
'VALUES (?, ?, ?);',
( ('home' , 'yellow ', 'black '),
('hound', 'green ' , 'red '),
('chair', 'red ' , 'black '),
('hat' , 'yellow ' , 'blue '),
('cat' , 'red ' , 'black ')
))
for text, foreg in conn.execute(
'SELECT textobj, foreground FROM paragraph ' \
"WHERE foreground == 'black';"):
print('%r has a %s foreground.' \
% (text, foreg.rstrip()))
test_rtrim_collation()
In this example we create a SELECT
statement to retrieve all text objects having a black foreground. Since we associated the SELECT
statement returns 3 matches, as we can see from the test_trim_collation()
’s output:
'home' has a black foreground. 'chair' has a black foreground. 'cat' has a black foreground.
Case-Insensitive Sorting with NOCASE
The NOCASE collation ignores a character’s case when comparing ASCII characters (i.e. from U+0000 to U+007F); e.g. 'A' == 'a'
, since they are, respectively, U+0041 and U+0061. But that doesn’t hold for 'ä' == 'a'
since the leftmost character (U+00E4 LATIN SMALL LETTER A WITH DIAERESIS) is not in the ASCII range.
import sqlite3
def test_nocase_collation():
""" Testing the NOCASE collation. """
with sqlite3.connect(':memory:') as conn:
conn.execute('CREATE TABLE words(' \
'lower TEXT COLLATE NOCASE, upper TEXT);')
conn.executemany(
'INSERT INTO words VALUES(?, ?);',
( ('desert' , 'DESERT'),
('country', 'COUNTRY'),
('città' , 'CITTÀ'), # non-ASCII
('hôpital', 'HÔPITAL'), # non-ASCII
('niño' , 'NIÑO'), # non-ASCII
))
for lower, upper, match in conn.execute(
'SELECT lower, upper, lower == upper ' \
'FROM words;'):
print('%r == %r? %s' \
% (lower, upper, bool(match)))
test_nocase_collation()
In this example we define a SELECT
statement to compare the two columns for equality, the NOCASE collation will be applied to both columns since the test_nocase_collation()
will be:
'desert' == 'DESERT'? True 'country' == 'COUNTRY'? True 'città' == 'CITTÀ'? False 'hôpital' == 'HÔPITAL'? False 'niño' == 'NIÑO'? False
which reveals that, while the case-insensitive equality works fine in the first two samples, which evaluate to True, it fails for the last three samples since NOCASE collation sees no-ASCII characters with different case as distinct values. Don’t worry about it: Python itself will soon come to our rescue!
Treating TEXT Values as a Sequence of Bytes with BINARY
The BINARY collation treats text data as a sequence of bytes, comparing two samples byte by byte. If both sequence have the same bytes, the shorter comes first. As we have already stated above, all BLOB objects are compared using this collation. By assigning the BINARY collation to TEXT objects, we may expect that TEXT and BLOB objects will be sorted together, since they share the same collation. Let’s see if we are right:
import sqlite3
def test_binary_collation():
""" Using BINARY collation on TEXT objects. """
with sqlite3.connect(':memory:') as conn:
conn.execute('CREATE TABLE spices('
'name COLLATE BINARY);')
conn.executemany('INSERT INTO spices VALUES(?);',
( (b'cinnamon', ), ('rosemary', ),
('thyme', ) , (b'jasmine', ),
('saffron', ) , ('sage', )
))
for item in conn.execute(
'SELECT name FROM spices ' \
'ORDER BY name ASC;'):
print(item[0])
test_binary_collation()
We set up a
rosemary saffron sage thyme b'cinnamon' b'jasmine'
All TEXT data are still sorted before BLOB data! So, even when using TEXT data with BINARY collation, we must remember the general rule for using Collations: all numeric data come first, then TEXT data, finally BLOBs.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Defining New Collations
There are times when more complex comparison algorithms are required, and we can’t always rely on predefined collations. SQLite allows developers to define their own collations by registering collation functions, which are invoked to compare strings as part of any sorting process. A collation function accepts two TEXT values as parameters, and it returns:
- -1 (or any other negative value) if the second string is greater than the first .
- 0 if the two strings compare equal.
- 1 (or any other positive value) if the first string is greater than the second.
You must use the create_collation(name, func)
method of a SQLite connection to register a new collation, where create_collation('coll', None)
to delete a previously registered collation named
import sqlite3
import re
def test_band_names():
""" Defining a collation algorithm for band names. """
def collate_band(name1, name2):
""" A collation skipping leading articles from a name. """
n1 = re.sub(r'^(a |the |an )', '', name1.lower())
n2 = re.sub(r'^(a |the |an )', '', name2.lower())
return 1 if n1 > n2 else -1 if n1 < n2 else 0
with sqlite3.connect(':memory:') as conn:
conn.create_collation('band', collate_band)
conn.execute('CREATE TABLE bands(' \
'id INTEGER PRIMARY KEY, ' \
'name TEXT NOT NULL COLLATE band, ' \
'country TEXT NOT NULL);')
conn.executemany(
'INSERT INTO bands(name, country) VALUES (?, ?);',
( ('A Taste of Honey', 'USA'),
('The Killers', 'USA'),
('A-ha', 'NOR'),
('The Beatles', 'GBR'),
('The Clash', 'GBR')
) )
for band in conn.execute(
'SELECT name FROM bands ORDER BY name ASC;'):
print(band[0])
test_band_names()
In this example we define a re.sub(pattern, repl, sample)
function finds the leftmost occurrence of <
operator for str
objects. Then, we create a small database of bands. Finally, we use the SELECT
statement to sort all band names in ascending order, and we get this list:
A-ha The Beatles The Clash The Killers A Taste of Honey
All bands have a leading article, except for A-ha, where ‘A’ is not followed by a single space, so it isn’t stripped before comparison.
A Collation for Latin-1 Characters
Here we describe a collation for Latin-1 characters, registered as 'latin1'
, which handles all characters in a TEXT value as follows:
- It removes all non-alphanumeric characters (e.g. punctuation, spaces).
- It maps all ASCII lowercase characters to their respective uppercase versions.
- It maps all Latin-1 alphabetic characters to ASCII alphabetic characters which resemble them, e.g.
'ä'
(U+00E4 LATIN SMALL LETTER A WITH DIAERESIS) is mapped to'A'
(U+0041 LATIN CAPITAL LETTER A). - It removes all characters outside the Latin-1 range.
After all characters in both strings have been converted, a case-insensitive comparison, resembling the default NOCASE collation, takes place. This time, we implement our collation as a class with a __call__()
method (a callable object).
import sqlite3
def test_latin1_collation():
""" Defining a Latin-1 collation. """
class Latin1Collator:
""" A collation for Latin-1 strings. """
LATIN1_MAP = [
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , '' , '' , '' , '' , '' , '' , # 00
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , '' , '' , '' , '' , '' , '' , # 10
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , '' , '' , '' , '' , '' , '' , # 20
'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', '' , '' , '' , '' , '' , '' , # 30
'' , 'A', 'B', 'C', 'D', 'E', 'F', 'G',
'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', # 40
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
'X', 'Y', 'Z', '' , '' , '' , '' , '' , # 50
'' , 'A', 'B', 'C', 'D', 'E', 'F', 'G',
'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', # 60
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
'X', 'Y', 'Z', '' , '' , '' , '' , '' , # 70
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , '' , '' , '' , '' , '' , '' , # 80
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , '' , '' , '' , '' , '' , '' , # 90
'' , '' , '' , '' , '' , '' , '' , '' ,
'' , '' , 'A', '' , '' , '' , '' , '' , # A0
'' , '' , '2', '3', '' , '' , 'P', '' ,
'' , '1', 'O', '' , '' , '' , '' , '' , # B0
'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C',
'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', # C0
'D', 'N', 'O', 'O', 'O', 'O', 'O', '' ,
'O', 'U', 'O', 'U', 'U', 'Y', '' , 'B', # D0
'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C',
'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', # E0
'D', 'N', 'O', 'O', 'O', 'O', 'O', '' ,
'O', 'U', 'O', 'U', 'U', 'Y', '' , 'Y' # F0
]
def map_string(self, s):
res = ''
for c in s:
if 0 <= ord(c) <= 255:
res += self.LATIN1_MAP[ord(c)]
return res
def __call__(self, x, y):
s1, s2 = self.map_string(x), self.map_string(y)
return 1 if s1 > s2 else -1 if s1 < s2 else 0
with sqlite3.connect(':memory:') as conn:
conn.create_collation('latin1', Latin1Collator())
conn.execute('CREATE TABLE words(' \
'w1 TEXT NOT NULL COLLATE latin1, ' \
'w2 TEXT NOT NULL COLLATE latin1, ' \
'translation TEXT NOT NULL);')
conn.executemany(
'INSERT INTO words VALUES(?, ?, ?);',
( ('schön' , 'sch-on' , 'beautiful'),# German
('Hyvästi', 'HYVASTI!', 'goodbye'), # Finnish
('igår' , 'igar' , 'yesterday'),# Swedish
('perché' , 'Perchè?' , 'why'), # Italian
('être' , 'etre' , 'to be'), # French
('año' , 'A.Ñ.O' , 'year'), # Spanish
))
for w1, w2, match in conn.execute(
'SELECT w1, w2, w1 == w2 FROM words;'):
print('%r == %r? %s' \
% (w1, w2, bool(match)))
test_latin1_collation()
The map_to_latin1()
method implements the conversion algorithm described above. The __call__()
method converts both its arguments to ASCII uppercase strings, then performs the comparison. We define a SELECT
statement performs an equality test using the latin1
collation. Using this special Python SQLite3 collation function, all the left strings will match the right strings, as we can see from the output of the test_latin1_collation()
function:
'schön' == 'sch-on'? True 'Hyvästi' == 'HYVASTI!'? True 'igår' == 'igar'? True 'perché' == 'Perchè?'? True 'être' == 'etre'? True 'año' == 'A.Ñ.O'? True
The conversion enforced by
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Collation Functions and Unicode
As we have seen earlier in this tutorial, the NOCASE collation is not Unicode-aware. Neither are the upper()
and lower()
predefined scalar functions; they both only convert ASCII characters, so we can’t use them to build a case-insensitive comparison function of Unicode characters.
The bottom line is this: if we want to sort Unicode TEXT data in a case-insensitive way, we must write our own function.
Case-insensitive Unicode Equality
If we want to test two arbitrary Unicode strings for equality in a case-insensitive way, we can’t use the NOCASE
collation, since it only properly handles ASCII characters. Luckily, Python has an extensive built-in support for Unicode strings. Notably, the upper()
and lower()
methods of the str
class are Unicode-aware; e.g. upper()
. Using these Python features, we can readily implement our custom case-insensitive equality test collation as follows:
import sqlite3
def test_unicode_equality():
""" Equality of Unicode strings. """
with sqlite3.connect(':memory:') as conn:
conn.create_collation('unicode_nocase',
lambda x, y : 1 if x.lower() > y.lower() \
else -1 if x.lower() < y.lower() else 0)
conn.execute('CREATE TABLE words(' \
'lower TEXT COLLATE unicode_nocase, ' \
'upper TEXT NOT NULL, ' \
'translation TEXT NOT NULL);')
conn.executemany(
'INSERT INTO words VALUES(?, ?, ?);',
( ('Glück' , 'GLÜCK' , 'luck'), # German
('hedelmä', 'HEDELMÄ', 'fruit'), # Finnish
('kärlek' , 'KÄRLEK' , 'love'), # Swedish
('λoγος' , 'ΛOΓΟΣ' , 'reason'), # Greek
('città' , 'CITTÀ' , 'city'), # Italian
('hôpital', 'HÔPITAL', 'hospital'), # French
('niño' , 'NIÑO' , 'boy'), # Spanish
))
for lower, upper, match in conn.execute(
'SELECT lower, upper, lower == upper ' \
'FROM words;'):
print('%r == %r? %s' \
% (lower, upper, bool(match)))
test_unicode_equality()
Take a look at the create_collation
statement. Thanks to Python’s built-in support for Unicode, we can easily overcome the shortcomings of the SQLite NOCASE
collation. By simply defining a unicode_nocase
collation, we now have a collation function capable of correctly handling Unicode characters’ equality tests. Much like our earlier NOCASE collation example, we define a SELECT
statement:
'Glück' == 'GLÜCK'? True 'hedelmä' == 'HEDELMÄ'? True 'kärlek' == 'KÄRLEK'? True 'λoγος' == 'ΛOΓΟΣ'? True 'città' == 'CITTÀ'? True 'hôpital' == 'HÔPITAL'? True 'niño' == 'NIÑO'? True
This time all equality tests are successful, since Python knows how to handle the upper()
method properly for Unicode characters.
Sorting TEXT Data with locale
Comparing two Unicode strings is harder than testing for equality, since we need to define the <
or >
relational operators as well. The default implementation of __lt__
and __gt__
for the str
class just sorts Unicode characters using their code points; e.g. 'A' < 'Z' < 'a'
because ord('A') < ord('Z') < ord('a')
, where ord()
returns the code point for the character.
We usually expect lowercase characters and their respective uppercase values to be next to each other. Once again, Python comes to the rescue with its locale
module, which allows for locale-dependent comparisons. Locale settings are applied using multiple locale categories, prefixed by LC_
. The locale.LC_ALL
category affects all comparison and sorting functions. The locale.LC_COLLATE
category only affects the locale.strcoll()
and the locale.strxfrm()
functions. Let’s see them at work:
import locale
def test_locale_comparison():
""" Compare strings using locale settings. """
samples = 'aäàAboözZ1.'
languages = {
'English' : 'en_US.UTF-8',
'French' : 'fr_FR.UTF-8',
'German' : 'de_DE.UTF-8',
'Italian' : 'it_IT.UTF-8',
'Swedish' : 'sv_SE.UTF-8' }
# Query current locale categories
default_collate = locale.setlocale(locale.LC_COLLATE, '')
print('Default collation is %s' % default_collate)
print('%-10s: %s' \
% ('default', sorted(samples)))
print('%-10: %s' \
% ('codepoints', sorted(samples, key=ord)))
for lang in languages:
try:
locale.setlocale(locale.LC_COLLATE, languages[lang])
print('%-10s: %s' \
% (lang, sorted(samples, key=locale.strxfrm)))
except locale.Error:
print('%s language package not available.' % lang)
# Reset previous locale settings
locale.setlocale(locale.LC_COLLATE, default_collate)
test_locale_comparison()
The test_locale_comparison
tests various comparison algorithms on the locale.setlocale()
function retrieves the current value for the LC_COLLATE
category (notice the second argument of the function call is an empty string). The sorted()
function sorts all items from an iterable object. If the optional sorted('aA', key=ord)
will call ord('a') < ord('A')
, and evaluates to [ 'A', 'a' ]
.
The locale.strxfrm(s)
function formats a string using the current locale. We will demonstrate the behavior of various locale settings. Notice how the locale-dependent comparison is enclosed within a try
statement. We do this because some locale settings may not be available on your operating system, in which case a locale.Error
exception will be raised.
Default collation is en_US.UTF-8 default : ['.', '1', 'A', 'Z', 'a', 'b', 'o', 'z', 'à', 'ä', 'ö'] codepoints: ['.', '1', 'A', 'Z', 'a', 'b', 'o', 'z', 'à', 'ä', 'ö'] English : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z'] French : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z'] German : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z'] Italian : ['.', '1', 'a', 'A', 'à', 'ä', 'b', 'o', 'ö', 'z', 'Z'] Swedish : ['.', '1', 'a', 'A', 'à', 'b', 'o', 'z', 'Z', 'ä', 'ö']
As we can see from the output, each locale applies a different sorting algorithm. Now we are ready to define our locale-dependent collation:
import sqlite3
import locale
def test_locale_aware_collation():
""" A locale-aware collation for sorting Unicode data. """
with sqlite3.connect(':memory:') as conn:
conn.create_collation('unicode', locale.strcoll)
conn.execute(
'CREATE TABLE words(w COLLATE unicode);')
conn.executemany('INSERT INTO words VALUES(?);',
( ('château', ) , ('Château', ),
('chambre', ) , ('chanteuse', ),
('Città', ) , ('città', ),
('Schlösser', ), ('schlösser', ),
('Zenit', ) , ('zenit', )
) )
for word in conn.execute(
'SELECT w FROM words ORDER BY w ASC;'):
print(word[0])
test_locale_aware_collation()
The locale.strcoll(x, y)
function compares LC_COLLATE
category. It returns 1 if x > y
, -1 if x < y
, otherwise 0. We use this function to create a new locale named unicode
. Then, we build a SELECT
statement to sort all data in the table. The ORDER BY
clause will call the locale.strcoll()
function on all pairs of words in the table. The following output refers to an en_US
locale:
chambre chanteuse château Château città Città schlösser Schlösser zenit Zenit
See how much more powerful these custom collation functions are than the default collate sequences? By combining Python with SQLite3, you’re able to build robust collations for handling your database data.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Remarks on Collation Functions
In this section we consolidated useful information you should be aware of before trying to define your own SQLite collation functions using Python.
Handling Exceptions
Unlike scalar and aggregate functions, whose exceptions are caught by SQLite and re-raised as sqlite3.Error
objects, all exceptions raised by collation functions will propagate to its caller. In other words, trying to catch just sqlite3.Error
exceptions won’t be enough.
import sqlite3
def test_collation_exception():
""" A collation function raising a ValueError exception. """
with sqlite3.connect(':memory:') as conn:
def faulty_collation(x, y):
raise ValueError(
'Faulty Collation for (%s, %s)' % (x, y))
conn.create_collation('faulty', faulty_collation)
try:
print("'home' > 'house'? %s" % conn.execute(
"SELECT 'home' COLLATE faulty > 'house';"))
except sqlite3.Error:
print("This handler won't work.")
except ValueError:
print('This handler will be called instead.')
test_collation_exception()
The ValueError
exception regardless of its parameters. When we try to compare ‘home’ to ‘house’, the except
clause. The first clause deals with all exceptions from the sqlite3
module. Instead, the exception is caught by the second clause, as we can see from the output:
This handler will be called instead.
Collations and the DISTINCT Keyword
Since collations affect equality testing, they also control the way the DISTINCT keyword works.
import sqlite3
import random
def test_distinct_keyword():
""" Collations affect the `DISTINCT` keyword. """
class Choice:
def __init__(self):
self.choices = []
def step(self, item):
print('adding %r' % item)
self.choices.append(item)
def finalize(self):
return random.choice(self.choices)
with sqlite3.connect(':memory:') as conn:
conn.create_aggregate('choice', 1, Choice)
conn.execute(
'CREATE TABLE icecream(' \
'flavor TEXT COLLATE NOCASE);')
conn.executemany(
'INSERT INTO icecream VALUES(?);',
( ('chocolate', ) , ('CHOCOLATE', ),
('lemon', ) , ('LEMON', ),
('strawberry', ), ('STRAWBERRY', ),
('vanilla', ) , ('VANILLA', ),
('watermelon', ), ('WATERMELON', )
) )
print("Pick an ice cream's flavor: %s" \
% conn.execute(
'SELECT choice(DISTINCT flavor) FROM icecream;'
).fetchone()[0] )
test_distinct_keyword()
First, we define a 'choice'
. At each call, this function prints its parameter, and then inserts it in a set of choices. Then, we define an choice()
function to pick a random flavor from the table. A typical output of the test_distinct_keyword()
is:
adding 'chocolate' adding 'lemon' adding 'strawberry' adding 'vanilla' adding 'watermelon' Pick an ice cream's flavor: strawberry
The DISTINCT
keyword would usually tell the uppercase and lowercase version of the same word apart, but since the NOCASE
collation, only one version of the same word will be selected. So, the choice()
function will choose from a set having just an instance for each ice cream’s flavor.
The Strange Case of the LIKE Operator
While collations change the way relational operators work, they don’t affect the LIKE
, GLOB
, MATCH
, and REGEX
operators, which can be defined by registering suitable scalar functions. As for the LIKE
operator, if you want case-insensitive comparisons you don’t need the COLLATE NOCASE
operator (or column definition). The PRAGMA case_sensitive_like
controls the behavior of the LIKE
operator. For example:
import sqlite3
def test_like_operator():
""" Collations don't affect the `LIKE` operator. """
with sqlite3.connect(':memory:') as conn:
conn.execute('PRAGMA case_sensitive_like = 1;')
print("'Python' LIKE 'py%%'? %s" \
% bool(conn.execute(
"SELECT 'Python' COLLATE NOCASE " \
"LIKE 'py%';").fetchone()[0]))
test_like_operator()
The test_like_operator()
function uses a PRAGMA
statement to switch to case-sensitive comparison for the LIKE
operator. Then, we set the collation for the first argument of the LIKE
operator to NOCASE
. This move doesn’t affect the behavior of the LIKE
operator, which still uses case-sensitive comparison, as we can see from the function’s output:
'Python' LIKE 'py%'? False
This PRAGMA
statement applies to all databases attached to the connection which executes the statement. It affects only the built-in version of the LIKE
operator, which can be overwritten by registering a custom like()
scalar function with two arguments.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
Collations and Indices
Since collations affect comparison, indices rely on them, too. So, when you define an index on a column with an explicit collation, the index will use that collation to sort all data.
import sqlite3
def test_collation_and_indices():
""" Collations also affect indices. """
with sqlite3.connect(':memory:') as conn:
conn.execute(
'CREATE TABLE people(' \
'id INTEGER PRIMARY KEY, ' \
'name TEXT NOT NULL, ' \
'country TEXT NOT NULL, ' \
'behavior TEXT COLLATE NOCASE);')
conn.executemany(
'INSERT INTO people (name, country, behavior) ' \
'VALUES(?, ?, ?);',
( ('John Hurt', 'GBR', 'kind'),
('Elsa Dwight', 'USA', 'Stubborn'),
('Mario Rossi', 'ITA', 'reckless'),
('Françoise Rauchelle', 'FRA', 'selfish'),
('Juan Gutierrez', 'ESP', 'Generous'),
('Olga Persson', 'SWE', 'Shy'),
('Georg Schelling', 'DEU', 'stubborn'),
('Mary Fitzgerald', 'USA', 'Kind')
))
conn.execute(
'CREATE INDEX person_behavior ' \
'ON people(behavior);')
for b in conn.execute(
'SELECT behavior FROM people;'):
print(b[0])
test_collation_and_indices()
In this example, we define a NOCASE
collation, meaning that all data will be treated in a case-insensitive manner, as we can see from the output:
Generous kind Kind reckless selfish Shy Stubborn stubborn
Think about the typical case-sensitive comparison where 'A' < ... < 'Z' < 'a' < ... < 'z'
. In our example, uppercase data and their respective lowercase values appear one after the other, instead of the records being grouped apart.
Registering Multiple Collations Under the Same Name
The sqlite3
API differs slightly from the SQLite C API, which allows you to register multiple collation functions under the same name, as long as you specify a different string encoding form. In the C API we use the sqlite3_create_collation(conn, name, enc, udp, func)
function to register a new collation, where:
- the
conn parameter represents the connection. - we register the collation under
name . - the
enc parameter is the encoding form of the string to which the collation applies. The available string encoding forms are: UTF-8, UTF-16LE, and UTF-16BE. - we can pass additional data to the collation function using the
udp . - the
func parameter is the collation function.
When using C, if more than one comparison function is available under the same name, SQLite picks the one requiring the least amount of conversion. For example, if we have database encoded in UTF-16LE, and we register both a UTF-8 and a UTF-16LE collation function under the same name, then SQLite will pick the UTF-16LE version, since it requires no conversion. If the UTF-16LE version weren’t available, SQLite would have chosen the UTF-8 version, converting all samples to UTF-8 before comparing them.
On the other hand, the sqlite3
API for Python registers all collation functions under the UTF-8 version, so each time we register a new collation function under the same name, we really overwrite the previous one. Let’s see an example:
import sqlite3
def test_same_name():
""" Registering multiple collations under the same name. """
with sqlite3.connect(':memory:') as conn:
conn.create_collation('str',
lambda x, y: 1 if x > y \
else -1 if x < y else 0)
print("First version of 'str' : 'cat' > 'fish' is %s" \
% bool(conn.execute(
"SELECT 'cat' > 'fish' COLLATE str;").fetchone()[0]))
conn.create_collation('str',
lambda x, y: 1 if x < y \
else -1 if x > y else 0)
print("Second version of 'str': 'cat' > 'fish' is %s" \
% bool(conn.execute(
"SELECT 'cat' > 'fish' COLLATE str;").fetchone()[0]))
test_same_name()
The test_same_name()
function creates a new SELECT
statement uses this version of the collation. Then, we register a second collation function under the same name, which sorts all strings in the reverse order, as we can see from the output of the second SELECT
statement:
First version of 'str' : 'cat' > 'fish' is False Second version of 'str': 'cat' > 'fish' is True
Beware that registering multiple collations under the same name may fail, raising an exception, since SQLite has to make sure there aren’t pending operations on the database still requiring the old collation function.
Closing Thoughts
In this tutorial we took a closer look at how comparisons and sorting work in a SQLite database. We saw that collation functions are called only on TEXT values, while all other data types define their own non-customizable sorting algorithm. Then, we described all predefined collations: NOCASE for case-insensitive comparisons, RTRIM for ignoring trailing spaces, and BINARY for treating strings like sequences of bytes (which is the default collation for BLOBs).
In the second part of the tutorial we learned how to define our own collations using Python and the sqlite3 create_collation
method. A collation is simply a function (or any other callable object) returning a positive value, a negative value, or 0, according to its two arguments. We gave many examples of user-defined collation functions. When it comes to Unicode characters, the NOCASE collation has severe shortcomings, so we addressed an alternate solution using the locale
module from the Python standard library.
Finally, in the last part of the tutorial, we warned the user against common mistakes and “gotchas” concerning collation functions. First, SQLite will propagate any exception raised by a collation function, so just catching sqlite3.Error
exceptions won’t suffice. We must be aware that collation functions will also affect the sorting algorithm of indices, but it doesn’t control how the LIKE
operator and all other pattern matching operators work. We also pointed out that registering multiple collations under the same name isn’t a good idea, since it merely overwrites previously registered functions.
Did you find this free tutorial helpful? Share this article with your friends, classmates, and coworkers on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come. After that, please subscribe to our email list for more Python tutorials.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
List of All Examples in This Tutorial
- Collation Functions Apply Only to TEXT Data
- Using Column Declaration to Set a Collation
- Effects of Collation on Equality Testing
- Collation and Relational Operators
- Effects of Collation on
ORDER BY
Clause - Effects of Collation on
GROUP BY
Clause - Precedence of Column Collation
- Listing All Registered Collations
- The RTIM Collation Ignores Trailing Spaces
- NOCASE Ignores the case of ASCII Characters
- Using BINARY Collation on TEXT Objects
- A Collation Algorithm for Band Names
- A Collation for the Latin-1 Character Set
- Equality of Unicode Strings
- Sorting Under Different Locale Settings
- A Locale-Aware Collation for Sorting Unicode Data
- A Collation Function Raising a ValueError Exception
- How Collations Affect the
DISTINCT
Keyword - Collations and the
LIKE
Operator - Collations and Indices
- Registering Multiple Collations Under the Same Name