Introduction to CSV Dialects with the Python CSV Module

Welcome to the first of a four part tutorial on the Python csv library. In this part we will get to know the CSV data format, and we will see how the csv library represents it and its dialects. The second part will address the Python CSV reading API. The third part will teach you how to write CSV files using the Python writing API, and the final part will introduce a useful extension to the library providing it with the ability to convert from Python types to CSV data (and the other way around).

The CSV Data Format
Handling CSV Data with the csv Library
- CSV Dialects
- Syntax Diagrams
  - How to Read Syntax Diagrams
  - Syntax Diagrams for the RFC Specification
Closing Thoughts
References

The CSV Data Format

The Comma-Separated Values (CSV) is a data format suitable for storing tabular data, i.e. data organized in rows with the same number of columns. This format is widely used by database management systems and spreadsheets as a data interchange format. The problem is that, as of today, there is no standard definition of such a format, so that different implementations use their own dialect. In this section we’ll set a common ground for all these dialects. In the following sections we’ll see how the Python csv library, or Python CSV module, deals with them.

What Do We Mean by CSV?

Since there are so many dialects of CSV, what do we really mean by CSV format? The closest attempt to a standard definition is RFC 4180:2005, which is just an informational memo. We’ll use this definition in the following sections. As we will see, the guidelines from the RFC memo will leave some questions unanswered, but we will deal with them later in the tutorial. Syntax diagrams are also provided to sum up all syntax rules for a conforming CSV file.

Headers and Records

Every CSV file is divided into many rows, called records, by a line delimiting sequence of chars that we’ll call newline sequence (NL). The RFC memo uses carriage return (CR) followed by a line feed (LF) character as newline sequence. The last line of the file may omit the line delimiter.

Before any record, there may be a list of field names called header. This list has the same newline sequence, field delimiter, and number of fields of the records that follow it.

The RFC memo gives no utter meaning to headers. For example, we cannot tell the type of data stored in a field by its name. Let’s not worry about that, though! Just know our tutorial on adapters and converters will suggest a loophole.

Fields

A row can be further split into fields by a field delimiting character. In CSV files the field delimiter is a comma. There should be the same number of fields for each record in the file.

There are times when our data contains the field delimiter or the line delimiter. How do you structure your CSV in such a way that it doesn’t think you’re trying to type a new delimiter? CSV allows you to do this by escaping the field, i.e. putting it between double quote characters; e.g. in a,b,"c,d",e the c,d field is escaped using " because it contains the field delimiter ,. Each " in a field must be escaped with a "". In other words, if your field already has a quotation mark, you must escape it with another quotation mark. For example, a "b" c must be escaped as a ""b"" c.

Questions Left Unanswered

Up to this point we concerned ourselves with the syntax of CSV data, but there are still some issues to be taken care of:

We stated that CSV is a text format, i.e. a sequence of characters, but we dodged the problem of encoding the characters, i.e. how each character is represented in binary code. The RFC memo uses ISO 646:1991 encoding, but that’s not mandatory. Each library dealing with CSV data must first decode the underlying text. We’ll see in the second part of this tutorial how the Python csv library does that.
All reading and writing operations on a CSV file are performed by some library. The RFC memo does not describe such a tool, notably, it does not deal with error reporting and it does not say how to treat ill-formed data. We’ll see how error handling is done in the Python csv library in the second part of this tutorial.
We often referred to CSV data as files, and the RFC memo does that, too. The word “file” gives the idea of something stored in a file system, for which some I/O operation must be performed in order to retrieve it. What about other data sources, like function generators and a database cursor? We’re going to show you how the Python csv library can handle data sources other than files both in part two and three.
The RFC memo requires that all records and the header have the same number of fields. How is such a requirement is enforced by a library? The Python csv library can perform these checks using the [DictReader class], as we will see in part two.
All data fields of a spreadsheet or a database table have some type, e.g. numeric, text, date. How can this type of information be retained in the CSV file? The RFC memo makes it clear that fields are plain character strings, so no data type is associated with them. The Python csv library has limited capabilities. Nevertheless, we’ll provide a solution to such a shortcoming using adapters and converters in part four of this tutorial.
We said that CSV data may or may not have a header. So how can we tell when there is a header? The RFC memo defines appropriate MIME parameters for that, but the MIME type may not be available to the library that handles the CSV file. Fortunately, the Python csv library provides a Sniffer class to detect headers from CSV data.

**Handling CSV Data with the Python csv Library**

In the first section of this tutorial we clarified the meaning of the CSV format. We saw that even the RFC memo leaves some questions unanswered. Now we’ll turn our attention to the Python csv module. We’ll see how it represents CSV dialects and how it addresses the issues listed above.

CSV Dialects

As stated previously, there are many dialects of the CSV format. This section summarizes the common differences between different CSV implementation methods:

Record delimiter: the RFC memo uses CR LF, other libraries may use LF or CR.
Field delimiter: the RFC memo uses COMMA, but many libraries and applications, notably Excel and LibreOffice Calc, allow the user to choose an alternative single-character field delimiter.
Field escaping: the RFC memo states that only fields containing LF, CR, COMMA and DOUBLE QUOTE must be escaped. Some libraries escape all non-numeric fields, others leave the choice to the user.
Enforcing record length: the RFC memo requires that all records (including the header) have the same number of fields, i.e. the same length. Some libraries don’t check this requirements.

In the following sections we will learn how the Python csv library deals with all these differences.

Formatting Parameters

To let its functions and classes handle a variety of CSV dialects, the Python csv library defines the following set of so-called formatting parameters:

Parameter	Description	Defaults to
delimiter	One-character field delimiter.	`,`
doublequote	If `True`, it uses two quotation marks to escape a quotation mark, else it uses an escape sequence.	`True`
escapechar	Character used to prefix escape sequence. If `None`, escaping is disabled.	`None`
lineterminator	Sequence of characters used to delimit a record.	`\r\n`
quotechar	Character used as a delimiter for escaped fields.	`"`
quoting	Controls whether quotes can be generated by the writer and read by the reader. The available options are: `csv.QUOTE_ALL`: the writer will quote all fields. `csv.QUOTE_MINIMAL`: the writer will quote just those fields in which there is a special character. `csv.QUOTE_NONNUMERIC`: the writer will quote all non-numeric fields. The reader will convert all numeric data to `float`. `csv.QUOTE_NONE`: the writer will not quote fields.	`csv.QUOTE_MINIMAL`
skipinitialspace	If `True`, all spaces after the field delimiter are ignored, i.e. they are not part of the field.	`True`
strict	If `True`, it raises and exception when the CSV file is ill-formed.	`False`

Code More, Distract Less: Support Our Ad-Free Site

You might have noticed we removed ads from our site - we hope this enhances your learning experience. To help sustain this, please take a look at our Python Developer Kit and our comprehensive cheat sheets. Each purchase directly supports this site, ensuring we can continue to offer you quality, distraction-free tutorials.

Get Our Python Dev Kit

As we’ll see in parts two and three, these parameters can be passed as arguments to functions dealing with CSV data, or as attributes of dialect classes. From what we said earlier about the CSV format, the default values for these parameters meet the requirements of the RFC memo, which we’re using as our “standard” for CSV formatting. Moreover, there is no way to toggle the field length check.

Defining a New Dialect

The Python csv library allows you to store all formatting parameters in classes inherited from the csv.Dialect class. The derived classes don’t need to have all the parameters listed in our formatting table. If you don’t list all the parameters, they’ll simply use their default values. For example, a dialect for the CSV format conforming the RFC memo may be defined as:

def dialect_report(dialect: csv.Dialect):
        """ Prints out all relevant formatting parameters of a dialect. """
        print(
            "delimiter        = %s\n"
            "doublequote      = %s\n"
            "lineterminator   = %s\n"
            "quotechar        = %s\n"
            "skipinitialspace = %s\n" % (
                repr(dialect.delimiter), dialect.doublequote,
                repr(dialect.lineterminator), dialect.quotechar,
                dialect.skipinitialspace))

    class RFCDialect(Dialect):
        delimiter      = ','
        doublequote    = True
        lineterminator = '\r\n'
        quotechar      = '"'
        quoting        = csv.QUOTE_MINIMAL

This dialect_report() function just prints out some useful formatting parameters. It’s designed to give you an idea of the Python csv module dialect syntax.

Before being used elsewhere, a name must be assigned to the new dialect using the function csv.register_dialect(name, dialect, **fmtparams), where dialect is a class derived form the csv.Dialect class, and fmtparams are keyword-only parameters from the formatting table we created earlier. If both are provided, keyword-only parameters will override the respective parameters in the class. A registered name for a dialect can be erased using csv.unregister_dialect(name). Let’s see another example:

class RFCUnixDialect(RFCDialect):
        """ Like RFCDialect, but uses \\n as a line terminator. """
        lineterminator = '\n'

    def register_rfc():
        """ Register the RFC dialect. """
        csv.register_dialect('rfc', RFCDialect)
        print(csv.list_dialects())

    def test_register():
        """ Test all code about dialect definition. """
        register_rfc()
        csv.register_dialect('rfc', RFCUnixDialect)
        dialect_report(csv.get_dialect('rfc'))

Here we defined another CSV dialect, RFCUnixDialect, which is like RFCDialect, but it uses \n as the new line delimiter. register_rfc() registers the RFCDialect as 'rfc'. csv.list_dialects() returns the list of all registered dialects. test_register() shows that if we register another dialect with the same name of a previously registered dialect, the former gets overwritten without any notification. There are some predefined dialects, like csv.excel. In order to avoid name clashes with predefined dialects, you may use the csv.list_dialects() as follows:

def is_available_name(name: str) -> bool:
        """ Return `True` if `name` hasn't been registered yet. """
        return not name in csv.list_dialects()

As of Python 3.8, the csv library defines the following dialects:

Parameter	default	excel	excel-tab	unix
delimiter	,	,	\t	,
doublequote	True	True	True	True
escapechar	None	None	None	None
lineterminator	\r\n	\r\n	\r\n	\n
quotechar	"	"	"	"
quoting	csv.QUOTE_MINIMAL	csv.QUOTE_MINIMAL	csv.QUOTE_MINIMAL	csv.QUOTE_ALL
skipinitialspace	True	False	False	False
strict	False	False	False	False

In parts two and three of this tutorial, we’re going to show you how to use dialect classes and format parameters for reading and writing CSV data in Python based on a certain dialect.

Detecting a CSV Dialect

The Python csv library features a Sniffer class, which can detect a dialect suitable for representing certain CSV data. We will use the studio discography of the Dutch symphonic metal band Epica as sample for our dialect detection function:

Title	Label	Release Date
The Phantom Agony	Transmission	2003-06-05
Consign to Oblivion	Transmission	2005-04-21
The Divine Conspiracy	Nuclear Blast	2007-09-07
Design Your Universe	Nuclear Blast	2009-10-16
Requiem for the Indifferent	Nuclear Blast	2012-03-09
The Quantum Enigma	Nuclear Blast	2014-05-02
The Holographic Principle	Nuclear Blast	2016-09-30

If you’d like to follow along, you can download the dialect-sample.csv file directly.

As you can see, the table has a header with 3 fields (Title, Label and Release Date), and 7 records. The detect_dialect_and_read() function is a wrapper around the csv.read() function, which we will describe in the second part of our tutorial.

import csv

    def detect_dialect_and_read(source: str):
        """ Detect the dialect of a source and read it. """
        with open(source, encoding='utf-8') as src:
            # Guess a suitable dialect by reading at most 1024 byte
            sample = src.read(1024)
            dialect = csv.Sniffer().sniff(sample)
            dialect_report(dialect)
            print("Does the source have a header? %s"
                  % csv.Sniffer().has_header(sample))

            # Reset the file cursor to the beginning of the file
            src.seek(0)

            # Create a new reader object using the CSV dialect
            for row in csv.reader(src, dialect):
                print(row)

    def test_detect():
        """ Tests all code about dialect detection. """
        detect_dialect_and_read('dialect-sample.csv')

Code More, Distract Less: Support Our Ad-Free Site

Get Our Python Dev Kit

open() creates a new context manager src, representing the file. The CSV data in the source file dialect-sample.csv has been encoded using UTF-8, so we must remember to set the encoding parameter accordingly. After we do this, we use the sniff() method to guess the CSV dialect from the first 1024 bytes (at most) read from the source.

It’s worth noting that the suggested algorithm for determining the encoding of an HTML page also uses at most 1024 bytes from the start of the HTML page to guess its encoding.

If the Sniffer can’t find a suitable dialect, it raises an exception. Finally, the dialect is passed to the csv.read() function to create an iterator object, from which all CSV data can be read one record at a time. This object has a dialect attribute representing the CSV dialect in use.

The Sniffer can also tackle the problem of determining whether a CSV file has a header by using its has_header() method on a sample data from the file. It returns False if no header was found in the sample, True otherwise. In our example, it returns True.

Now that you understand the nuances of working with CSV files in Python, you’re ready to use the Python csv module to read and write CSV files. Our next tutorial will show you how. All sample files referenced in our tutorial series on the Python csv module can be extracted from this .zip file.

Syntax Diagrams

How to Read Syntax Diagrams

Since syntax diagrams are new to this blog, we’re going to spend a few minutes describing what they are and how they’re read.

Syntax diagrams are a graphical way of representing the grammar of a language. Each diagram stands for a rule, and it is labeled with a name, by which it can be referred from other diagrams. Syntax diagrams are built up from a few graphic primitives:

A rectangle with a label referring to another syntax diagram (the so-called non-terminal symbol).
A rounded rectangle containing a sequence of one or more characters (the so-called terminal-symbols).
Lines connecting various parts of the diagram.
Arrows marking the only direction in which a line can be traversed.

Syntax diagrams should be read as follows:

Each diagram has a rule representing the whole language (a so-called start symbol); e.g. in the syntax diagrams below, file is the start symbol.
A diagram should be read from left to right, following the arrows.
When a line splits into two or more lines, you can follow only one line at a time. This case roughly equals to selection statements in programming languages.
Sometimes an arrow goes back to a point of the diagram which has already been traversed. This case roughly equals to iterative statements in programming languages.
All lines of the diagram will eventually be merged into a single line at the exiting point of the rule.

Let’s see an example. These are the syntax diagrams for a number’s exponent in the JSON data format (the start symbol is exponent):

Syntax Diagrams for JSON Exponents

The following diagrams shows how to build the exponent e+18 following the syntax diagrams. The path that we follow at each step is marked by a thick black line. You can pretend that you are following that path with your fingertip.

Building the exponent e+18

Syntax Diagrams for the RFC Specification

The following syntax diagrams sum up the grammar of the CSV data format, as defined by RFC 4180:2005. For reasons discussed previously, we chose not to tie these syntax diagrams to a specific charset, as the RFC memo does. So be aware that:

The NL symbol stands for the implementation-dependent newline sequence.
The COMMA, CR (carriage return), LF (line feed), DQUOTE (double quote) symbols stand for their respective encoding-dependent sequence of bytes. For example in the ASCII charset these symbols would have been encoded, respectively, as 0x2C, 0x0D, 0x0A and 0x22.
The textdata rule stands for all graphic characters of the charset other than COMMA and DQUOTE.
The name rule equals the field rule.

CSV Syntax Diagrams

Code More, Distract Less: Support Our Ad-Free Site

Get Our Python Dev Kit

Closing Thoughts

In this tutorial we discussed the CSV data format, and we came to terms with its dialects. Then we learned how the Python csv module handles all these flavors of CSV using formatting parameters and the Dialect class. We even built our own dialect conforming to the RFC specification.

Subscribe to our Python Tutorials Blog, so you won’t miss the next part of our CSV series, where we discuss reading CSV files. Once you subscribe, share this article with your friends on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.

References

This section features normative references for the CSV format and for the csv library.

Y. Shafranovich, Common Format and MIME Type for Comma-Separated Values (CSV) Files, RFC 4180:2005
Kevin Altis, Dave Cole, Andrew McNamara, Skip Montano, Cliff Wells, CSV File API, PEP 0305:2003

Introduction to CSV Dialects with the Python CSV Module

The Python Tutorials Blog

The CSV Data Format

What Do We Mean by CSV?

Headers and Records

Fields

Questions Left Unanswered

**Handling CSV Data with the Python csv Library**

CSV Dialects

Formatting Parameters

Defining a New Dialect

Detecting a CSV Dialect

Syntax Diagrams

How to Read Syntax Diagrams

Syntax Diagrams for the RFC Specification

Closing Thoughts

References

About The Python Tutorials Blog