Welcome to the first of a four part tutorial on the Python
csv library. In this part we will get to know the CSV data format, and we will see how the
csv library represents it and its dialects. The second part will address the Python CSV reading API. The third part will teach you how to write CSV files using the Python writing API, and the final part will introduce a useful extension to the library providing it with the ability to convert from Python types to CSV data (and the other way around).
- The CSV Data Format
- Handling CSV Data with the csv Library
- Closing Thoughts
The CSV Data Format
The Comma-Separated Values (CSV) is a data format suitable for storing tabular data, i.e. data organized in rows with the same number of columns. This format is widely used by database management systems and spreadsheets as a data interchange format. The problem is that, as of today, there is no standard definition of such a format, so that different implementations use their own dialect. In this section we’ll set a common ground for all these dialects. In the following sections we’ll see how the Python
csv library, or Python CSV module, deals with them.
What Do We Mean by CSV?
Since there are so many dialects of CSV, what do we really mean by CSV format? The closest attempt to a standard definition is RFC 4180:2005, which is just an informational memo. We’ll use this definition in the following sections. As we will see, the guidelines from the RFC memo will leave some questions unanswered, but we will deal with them later in the tutorial. Syntax diagrams are also provided to sum up all syntax rules for a conforming CSV file.
Headers and Records
Every CSV file is divided into many rows, called records, by a line delimiting sequence of chars that we’ll call newline sequence (NL). The RFC memo uses carriage return (CR) followed by a line feed (LF) character as newline sequence. The last line of the file may omit the line delimiter.
Before any record, there may be a list of field names called header. This list has the same newline sequence, field delimiter, and number of fields of the records that follow it.
The RFC memo gives no utter meaning to headers. For example, we cannot tell the type of data stored in a field by its name. Let’s not worry about that, though! Just know our tutorial on adapters and converters will suggest a loophole.
A row can be further split into fields by a field delimiting character. In CSV files the field delimiter is a comma. There should be the same number of fields for each record in the file.
There are times when our data contains the field delimiter or the line delimiter. How do you structure your CSV in such a way that it doesn’t think you’re trying to type a new delimiter? CSV allows you to do this by escaping the field, i.e. putting it between double quote characters; e.g. in
Questions Left Unanswered
Up to this point we concerned ourselves with the syntax of CSV data, but there are still some issues to be taken care of:
- We stated that CSV is a text format, i.e. a sequence of characters, but we dodged the problem of encoding the characters, i.e. how each character is represented in binary code. The RFC memo uses ISO 646:1991 encoding, but that’s not mandatory. Each library dealing with CSV data must first decode the underlying text. We’ll see in the second part of this tutorial how the Python
csvlibrary does that.
- All reading and writing operations on a CSV file are performed by some library. The RFC memo does not describe such a tool, notably, it does not deal with error reporting and it does not say how to treat ill-formed data. We’ll see how error handling is done in the Python
csvlibrary in the second part of this tutorial.
- We often referred to CSV data as files, and the RFC memo does that, too. The word “file” gives the idea of something stored in a file system, for which some I/O operation must be performed in order to retrieve it. What about other data sources, like function generators and a database cursor? We’re going to show you how the Python
csvlibrary can handle data sources other than files both in part two and three.
- The RFC memo requires that all records and the header have the same number of fields. How is such a requirement is enforced by a library? The Python
csvlibrary can perform these checks using the [DictReader class], as we will see in part two.
- All data fields of a spreadsheet or a database table have some type, e.g. numeric, text, date. How can this type of information be retained in the CSV file? The RFC memo makes it clear that fields are plain character strings, so no data type is associated with them. The Python
csvlibrary has limited capabilities. Nevertheless, we’ll provide a solution to such a shortcoming using adapters and converters in part four of this tutorial.
- We said that CSV data may or may not have a header. So how can we tell when there is a header? The RFC memo defines appropriate MIME parameters for that, but the MIME type may not be available to the library that handles the CSV file. Fortunately, the Python
csvlibrary provides a Sniffer class to detect headers from CSV data.
Handling CSV Data with the Python csv Library
In the first section of this tutorial we clarified the meaning of the CSV format. We saw that even the RFC memo leaves some questions unanswered. Now we’ll turn our attention to the Python
csv module. We’ll see how it represents CSV dialects and how it addresses the issues listed above.
As stated previously, there are many dialects of the CSV format. This section summarizes the common differences between different CSV implementation methods:
- Record delimiter: the RFC memo uses
CR LF, other libraries may use
- Field delimiter: the RFC memo uses
COMMA, but many libraries and applications, notably Excel and LibreOffice Calc, allow the user to choose an alternative single-character field delimiter.
- Field escaping: the RFC memo states that only fields containing
DOUBLE QUOTEmust be escaped. Some libraries escape all non-numeric fields, others leave the choice to the user.
- Enforcing record length: the RFC memo requires that all records (including the header) have the same number of fields, i.e. the same length. Some libraries don’t check this requirements.
In the following sections we will learn how the Python
csv library deals with all these differences.
To let its functions and classes handle a variety of CSV dialects, the Python
csv library defines the following set of so-called formatting parameters:
|delimiter||One-character field delimiter.||
|escapechar||Character used to prefix escape sequence. If
|lineterminator||Sequence of characters used to delimit a record.||
|quotechar||Character used as a delimiter for escaped fields.||
|quoting||Controls whether quotes can be generated by the writer and read by the reader. The available options are:
Enter your email address for more free Python tutorials and tips.
As we’ll see in parts two and three, these parameters can be passed as arguments to functions dealing with CSV data, or as attributes of dialect classes. From what we said earlier about the CSV format, the default values for these parameters meet the requirements of the RFC memo, which we’re using as our “standard” for CSV formatting. Moreover, there is no way to toggle the field length check.
Defining a New Dialect
The Python csv library allows you to store all formatting parameters in classes inherited from the
csv.Dialect class. The derived classes don’t need to have all the parameters listed in our formatting table. If you don’t list all the parameters, they’ll simply use their default values. For example, a dialect for the CSV format conforming the RFC memo may be defined as:
def dialect_report(dialect: csv.Dialect): """ Prints out all relevant formatting parameters of a dialect. """ print( "delimiter = %s\n" "doublequote = %s\n" "lineterminator = %s\n" "quotechar = %s\n" "skipinitialspace = %s\n" % ( repr(dialect.delimiter), dialect.doublequote, repr(dialect.lineterminator), dialect.quotechar, dialect.skipinitialspace)) class RFCDialect(Dialect): delimiter = ',' doublequote = True lineterminator = '\r\n' quotechar = '"' quoting = csv.QUOTE_MINIMAL
dialect_report() function just prints out some useful formatting parameters. It’s designed to give you an idea of the Python csv module dialect syntax.
Before being used elsewhere, a name must be assigned to the new dialect using the function
csv.register_dialect(name, dialect, **fmtparams), where
dialect is a class derived form the
csv.Dialect class, and
csv.unregister_dialect(name). Let’s see another example:
class RFCUnixDialect(RFCDialect): """ Like RFCDialect, but uses \\n as a line terminator. """ lineterminator = '\n' def register_rfc(): """ Register the RFC dialect. """ csv.register_dialect('rfc', RFCDialect) print(csv.list_dialects()) def test_register(): """ Test all code about dialect definition. """ register_rfc() csv.register_dialect('rfc', RFCUnixDialect) dialect_report(csv.get_dialect('rfc'))
Here we defined another CSV dialect,
RFCUnixDialect, which is like
RFCDialect, but it uses
\n as the new line delimiter.
register_rfc() registers the
csv.list_dialects() returns the list of all registered dialects.
test_register() shows that if we register another dialect with the same name of a previously registered dialect, the former gets overwritten without any notification. There are some predefined dialects, like
csv.excel. In order to avoid name clashes with predefined dialects, you may use the
csv.list_dialects() as follows:
def is_available_name(name: str) -> bool: """ Return `True` if `name` hasn't been registered yet. """ return not name in csv.list_dialects()
As of Python 3.8, the
csv library defines the following dialects:
In parts two and three of this tutorial, we’re going to show you how to use dialect classes and format parameters for reading and writing CSV data in Python based on a certain dialect.
Detecting a CSV Dialect
csv library features a
Sniffer class, which can detect a dialect suitable for representing certain CSV data. We will use the studio discography of the Dutch symphonic metal band Epica as sample for our dialect detection function:
|The Phantom Agony||Transmission||2003-06-05|
|Consign to Oblivion||Transmission||2005-04-21|
|The Divine Conspiracy||Nuclear Blast||2007-09-07|
|Design Your Universe||Nuclear Blast||2009-10-16|
|Requiem for the Indifferent||Nuclear Blast||2012-03-09|
|The Quantum Enigma||Nuclear Blast||2014-05-02|
|The Holographic Principle||Nuclear Blast||2016-09-30|
If you’d like to follow along, you can download the dialect-sample.csv file directly.
As you can see, the table has a header with 3 fields (Title, Label and Release Date), and 7 records. The
detect_dialect_and_read() function is a wrapper around the
csv.read() function, which we will describe in the second part of our tutorial.
import csv def detect_dialect_and_read(source: str): """ Detect the dialect of a source and read it. """ with open(source, encoding='utf-8') as src: # Guess a suitable dialect by reading at most 1024 byte sample = src.read(1024) dialect = csv.Sniffer().sniff(sample) dialect_report(dialect) print("Has the source a header? %s" % csv.Sniffer().has_header(sample)) # Reset the file cursor to the beginning of the file src.seek(0) # Create a new reader object using the CSV dialect for row in csv.reader(src, dialect): print(row) def test_detect(): """ Tests all code about dialect detection. """ detect_dialect_and_read('dialect-sample.csv')
Enter your email address for more free Python tutorials and tips.
open() creates a new context manager
dialect-sample.csv has been encoded using UTF-8, so we must remember to set the
encoding parameter accordingly. After we do this, we use the
sniff() method to guess the CSV dialect from the first 1024 bytes (at most) read from the source.
It’s worth noting that the suggested algorithm for determining the encoding of an HTML page also uses at most 1024 bytes from the start of the HTML page to guess its encoding.
Sniffer can’t find a suitable dialect, it raises an exception. Finally, the dialect is passed to the
csv.read() function to create an iterator object, from which all CSV data can be read one record at a time. This object has a
dialect attribute representing the CSV dialect in use.
Sniffer can also tackle the problem of determining whether a CSV file has a header by using its
has_header() method on a sample data from the file. It returns
False if no header was found in the sample,
True otherwise. In our example, it returns
Now that you understand the nuances of working with CSV files in Python, you’re ready to use the Python csv module to read and write CSV files. Our next tutorial will show you how. All sample files referenced in our tutorial series on the Python csv module can be extracted from this .zip file.
How to Read Syntax Diagrams
Since syntax diagrams are new to this blog, we’re going to spend a few minutes describing what they are and how they’re read.
Syntax diagrams are a graphical way of representing the grammar of a language. Each diagram stands for a rule, and it is labeled with a name, by which it can be referred from other diagrams. Syntax diagrams are built up from a few graphic primitives:
- A rectangle with a label referring to another syntax diagram (the so-called non-terminal symbol).
- A rounded rectangle containing a sequence of one or more characters (the so-called terminal-symbols).
- Lines connecting various parts of the diagram.
- Arrows marking the only direction in which a line can be traversed.
Syntax diagrams should be read as follows:
- Each diagram has a rule representing the whole language (a so-called start symbol); e.g. in the syntax diagrams below,
fileis the start symbol.
- A diagram should be read from left to right, following the arrows.
- When a line splits into two or more lines, you can follow only one line at a time. This case roughly equals to selection statements in programming languages.
- Sometimes an arrow goes back to a point of the diagram which has already been traversed. This case roughly equals to iterative statements in programming languages.
- All lines of the diagram will eventually be merged into a single line at the exiting point of the rule.
Let’s see an example. These are the syntax diagrams for a number’s exponent in the JSON data format (the start symbol is
The following diagrams shows how to build the exponent
e+18 following the syntax diagrams. The path that we follow at each step is marked by a thick black line. You can pretend that you are following that path with your fingertip.
Syntax Diagrams for the RFC Specification
The following syntax diagrams sum up the grammar of the CSV data format, as defined by RFC 4180:2005. For reasons discussed previously, we chose not to tie these syntax diagrams to a specific charset, as the RFC memo does. So be aware that:
NLsymbol stands for the implementation-dependent newline sequence.
DQUOTE(double quote) symbols stand for their respective encoding-dependent sequence of bytes. For example in the ASCII charset these symbols would have been encoded, respectively, as
textdatarule stands for all graphic characters of the charset other than
namerule equals the
Enter your email address for more free Python tutorials and tips.
In this tutorial we discussed the CSV data format, and we came to terms with its dialects. Then we learned how the Python
csv module handles all these flavors of CSV using formatting parameters and the
Dialect class. We even built our own dialect conforming to the RFC specification.
Subscribe to our Python Tutorials Blog, so you won’t miss the next part of our CSV series, where we discuss reading CSV files. Once you subscribe, share this article with your friends on Facebook and Twitter! When you spread the word on social media, you’re helping us grow so we can continue to provide free tutorials like this one for years to come.
This section features normative references for the CSV format and for the csv library.
- Y. Shafranovich, Common Format and MIME Type for Comma-Separated Values (CSV) Files, RFC 4180:2005
- Kevin Altis, Dave Cole, Andrew McNamara, Skip Montano, Cliff Wells, CSV File API, PEP 0305:2003