4.1. Working with Files
A very common programming pattern proceeds as follows:
Read the contents of one or more files from disk and load the data into one or more data structures.
Manipulate the data in some way.
(Optional) Write the resulting data back to disk.
Programmers often connect multiple programs written using this pattern into data-processing pipelines.
In this chapter, we’ll discuss how to work with several different file formats. We’ll start with the most basic mechanisms and work our way up to higher-level tools and more complex formats.
4.1.1. Basic file I/O
Reading file input and writing file output is often referred to as
I/O, or input/output. We’ll start our discussion of basic file I/O
with a simple example: load the contents of the text file
instructor-email.txt
into a sorted list. Here are the contents of
the file:
amr@cs.uchicago.edu
borja@cs.uchicago.edu
yanjingl@cs.uchicago.edu
mwachs@cs.uchicago.edu
dupont@cs.uchicago.edu
and here’s the desired result:
["amr@cs.uchicago.edu",
"borja@cs.uchicago.edu",
"dupont@cs.uchicago.edu",
"mwachs@cs.uchicago.edu",
"yanjingl@cs.uchicago.edu"]
To access the contents of a file, we first need to open it using the
built-in function open
:
>>> f = open("instructor-email.txt")
The open
function returns a data structure, known as a file
pointer, that we can use to work with the contents of the file. It is
common to use the word file to refer to both a file on disk and a file
pointer.
Once we have a file pointer, we can perform a number of
operations on it. In a sense, text files are simply very long strings
that are stored on disk. So, for example, we can read the entire
contents of the file into a string in a single operation using the
read
method:
>>> s = f.read()
>>> s
'amr@cs.uchicago.edu\nborja@cs.uchicago.edu\nyanjingl@cs.uchicago.edu\nmwachs@cs.uchicago.edu\ndupont@cs.uchicago.edu\n'
The characters \n
are what is known as an escape sequence, which
in this case encodes the newline character. Notice that if we print
the value of email_address
, all of the occurrences of the \n
escape sequence are converted into newlines as expected:
>>> print(s)
amr@cs.uchicago.edu
borja@cs.uchicago.edu
yanjingl@cs.uchicago.edu
mwachs@cs.uchicago.edu
dupont@cs.uchicago.edu
We can convert the string into a list of email addresses using the
string split
method, which, by default, breaks the string into
tokens using white space (that is, spaces, tabs, newlines, etc.) as a
the delimiter.
>>> s.split()
['amr@cs.uchicago.edu', 'borja@cs.uchicago.edu', 'yanjingl@cs.uchicago.edu', 'mwachs@cs.uchicago.edu', 'dupont@cs.uchicago.edu']
Finally, we can call the built-in function sorted
to to get the
desired result: a sorted list of email addresses:
>>> email_addresses = sorted(s.split())
When reading from a file, the operating system keeps track of the most recent
position it has read. In this case, the file pointer has already
reached the end-of-file (or EOF). So, if we call read
again, we
don’t get the contents of the file, we get an empty string instead:
>>> data = f.read()
>>> data
''
Once we’re done working with a file, we need to close the file pointer:
>>> f.close()
Following the close, you can no longer use that file pointer to access the file. (You’d have to reopen the file to use it again.) It is important to close files to free the associated resources and, as we’ll see later on when writing to files, to ensure that all of your updates to the file are written to disk.
It is easy to forget to close a file, so it is common to use a
with
statement, which guarantees that the file will be closed no
matter what happens in the body of the statement.
>>> with open("instructor-email.txt") as f:
... s = f.read()
... email_addresses = sorted(s.split())
...
>>> print(email_addresses)
['amr@cs.uchicago.edu', 'borja@cs.uchicago.edu', 'dupont@cs.uchicago.edu', 'mwachs@cs.uchicago.edu', 'yanjingl@cs.uchicago.edu']
The with
statement introduces a new name, in this case, f
,
that refers to the file pointer returned by the call to open
. At
the end of the with
block, file f
is closed automatically.
Instead of reading the file in one chunk, we can also read it line by
line. One approach is to use a for loop that iterates over a text
file line by line. For example, here’s some code that reads and
prints each line in the instructor-email.txt
file:
>>> with open("instructor-email.txt") as f:
... for line in f:
... print(line)
...
amr@cs.uchicago.edu
borja@cs.uchicago.edu
yanjingl@cs.uchicago.edu
mwachs@cs.uchicago.edu
dupont@cs.uchicago.edu
This result may look a bit funny to you. Why the extra empty line?
Each line from the file includes a newline at the end, and
print
adds a newline as well. We can see the actual representation
of the string using the built-in repr
function:
>>> with open("instructor-email.txt") as f:
... for line in f:
... print(repr(line))
...
'amr@cs.uchicago.edu\n'
'borja@cs.uchicago.edu\n'
'yanjingl@cs.uchicago.edu\n'
'mwachs@cs.uchicago.edu\n'
'dupont@cs.uchicago.edu\n'
When reading lines from a file, we can use the strip
method from the
string library to remove leading and trailing whitespace:
>>> with open("instructor-email.txt") as f:
... for line in f:
... print(line.strip())
...
amr@cs.uchicago.edu
borja@cs.uchicago.edu
yanjingl@cs.uchicago.edu
mwachs@cs.uchicago.edu
dupont@cs.uchicago.edu
To accomplish our goal of creating a sorted list of email addresses,
we can combine a familiar pattern for constructing lists with a use of
with
and a call to the list sort
method.
>>> email_addresses = []
>>> with open("instructor-email.txt") as f:
... for line in f:
... email = line.strip()
... email_addresses.append(email)
...
...
>>> email_addresses.sort()
4.1.2. Writing data to a file
To write to a file, we must open the file in write mode (note the use of
"w"
as a second parameter to open
to specify that we’re opening
the file in write mode):
open("names2.txt", "w")
It is very important to understand that when you open an existing file in write mode, all of its existing contents will be wiped away! If you open a file that doesn’t already exist in write mode, a new file will be created.
Once we have a writable file pointer, we can append a string to the file using
write
. For example, after we run this code:
with open("names.txt", "w") as f:
f.write("Anne Rogers\n")
f.write("Borja Sotomayor\n")
f.write("Yanjing Li\n")
f.write("Matthew Wachs\n")
f.write("Todd Dupont\n")
The file names.txt
will contain:
Anne Rogers
Borja Sotomayor
Yanjing Li
Matthew Wachs
Todd Dupont
We could also use the print
method to generate this output, which
has the advantage that it will add the newline automatically. The
file
keyword parameter allow us to specify a file pointer as the
destination of call to print
.
>>> with open("names2.txt", "w") as f:
... print("Anne Rogers", file=f)
... print("Borja Sotomayor", file=f)
... print("Yanjing Li", file=f)
... print("Matthew Wachs", file=f)
... print("Todd Dupont", file=f)
...
Internally, writes to files are often stored in a buffer and then written out to disk in batches. When you close a file, you flush any buffered data to disk. If you do not close your file, the data from the last few writes you do may remain in the buffer and thus may not get written back to disk.
Let’s put these pieces together to write a function that transforms a file
with a list of email addresses into a new file with the domain name
(that is, @cs.uchicago.edu
) stripped off.
>>> def strip_domain(input_filename, output_filename):
... '''
...
... Strip the domain names off the email address from the input
... file and write the resulting usernames to the output file.
...
... Inputs:
... input_filename: (string) name of a file with email addresses
... output_filename: (string) name for the output file.
... '''
...
... # Load data into a data structure (a list of strings)
... email_addresses = []
... with open(input_filename) as f:
... for line in f:
... email = line.strip()
... email_addresses.append(email)
...
... # Transform the data
... usernames = []
... for email in email_addresses:
... username, domain = email.split("@")
... usernames.append(username)
...
... # Write the data
... with open(output_filename, "w") as f:
... for username in usernames:
... print(username, file=f)
...
In this case, the operation is simple enough that we could’ve produced the usernames during the input loop or during the output loop, but in general, it’s good to separate input, transformation, and output into separate steps.
Python’s basic file I/O functionality is sufficient for working with simple files, but sometimes we need work with more complex data files. In the next few sections, we’ll introduce a few existing data formats and libraries that it make it easy to work with them. In general, it is better to use an existing format and its associated libraries, if you can, than to invent your own ad hoc format.
4.1.3. CSV
The acronym CSV stands for Comma Separated Values. CSV files contain values separated by commas (and sometimes by other delimiters) and are typically used to represent tabular data, that is, any data that can be organized into rows, each with the same columns (or fields).
Here are the contents of a CSV file named instructors.csv
:
id,lname,fname,email
amr,Rogers,Anne,amr@cs.uchicago.edu
borja,Sotomayor,Borja,borja@cs.uchicago.edu
yanjingl,Li,Yanjing,yanjingl@cs.uchicago.edu
mwachs,Wachs,Matthew,mwachs@cs.uchicago.edu
dupont,Dupont,Todd,dupont@cs.uchicago.edu
The first line is the header row; it includes the names of the
columns/fields (id
, lname
, fname
, and email
). The
remaining lines contain the data.
With what we’ve seen so far, we could just use the existing file functions to read this file and generate some simple output:
>>> with open("instructors.csv") as f:
... header = f.readline() # Skip the header row
...
... for row in f:
... fields = row.strip().split(",")
...
... id = fields[0]
... last_name = fields[1]
... first_name = fields[2]
... email = fields[3]
...
... print("{} {}'s e-mail is {}".format(first_name, last_name, email))
...
Anne Rogers's e-mail is amr@cs.uchicago.edu
Borja Sotomayor's e-mail is borja@cs.uchicago.edu
Yanjing Li's e-mail is yanjingl@cs.uchicago.edu
Matthew Wachs's e-mail is mwachs@cs.uchicago.edu
Todd Dupont's e-mail is dupont@cs.uchicago.edu
We could similarly use the existing file functions to write a CSV file.
This process, however, can be very error-prone, and it doesn’t account
for a number of peculiarities in the CSV file format e.g., what if a
column holds a string value with a comma in it? Such a value would be
represented in double-quotes, like "Hello, world!"
, but the above
code would fail. To make this concrete, let’s look at what, would
happen if we added this line to the file:
spade,"Spade, Jr",Sam,spade@cs.uchicago.edu
We’d get:
Jr" "Spade's e-mail is Sam
as the output for this line, because the split
method is not
designed to handle commas embedded in quoted sub-strings.
Fortunately, Python includes a csv
module that allows us to work
with CSV files more naturally.
>>> import csv
The DictReader
class allows us to iterate over the rows in a CSV
file, and access each field in a row via a dictionary (with the same
field names specified in the header row or using the optional
fieldnames
parameter).
>>> with open("instructors.csv") as f:
... reader = csv.DictReader(f)
...
... for row in reader:
... print("{} {}'s e-mail is {}".format(row["fname"], row["lname"], row["email"]))
...
Anne Rogers's e-mail is amr@cs.uchicago.edu
Borja Sotomayor's e-mail is borja@cs.uchicago.edu
Yanjing Li's e-mail is yanjingl@cs.uchicago.edu
Matthew Wachs's e-mail is mwachs@cs.uchicago.edu
Todd Dupont's e-mail is dupont@cs.uchicago.edu
We can use a similar class, DictWriter
, to write CSV file.
fieldnames = ["id", "lname", "fname", "email"]
with open("instructors-122.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
row = {"id": "amr",
"lname": "Rogers",
"fname": "Anne",
"email": "amr@cs.uchicago.edu"}
writer.writerow(row)
row = {"id": "mwachs",
"lname":"Wachs",
"fname": "Matthew",
"email": "mwachs@cs.uchicago.edu"}
writer.writerow(row)
The csv
module also has reader
and writer
classes that use
lists to represent rows rather than dictionaries. This allows us to
interact with the fields positionally instead of by name.
4.1.4. JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format that web services commonly use. It is also used as a data storage format. JSON supports a few different types:
Object: looks like a python dictionary with string-value pairs (string:value) separated by commas,
Array: empty list or list of values,
Value: string, number, object, array, true, false, null
Note the nesting of object and array in the definition of value.
An application might receive a string in JSON format from another
application, like a web service, or as a file. Python has a json
module for handling JSON data in either form. We’ll start by looking
at the functions that operate on strings, but first, we need to import
the library.
>>> import json
The dumps
function takes a Python data structure and encodes it in
JSON format:
>>> l = ['baz', None, 1.0, 2]
>>> json.dumps(l)
'["baz", null, 1.0, 2]'
Notice that the result is a string representation of the data, with some minor differences from Python (e.g., “null” instead of “None”)
JSON allows for nested data structures and, so, the
dumps
function handles them as well:
>>> data = [ 'foo', {'bar': l} ]
>>> json.dumps(data)
'["foo", {"bar": ["baz", null, 1.0, 2]}]'
The loads
function takes a string containing data encoded
in JSON format and decodes it, yielding the corresponding Python data
structures:
>>> json.loads("42")
42
>>> json.loads("[1,2,3]")
[1, 2, 3]
>>> json.loads('["foo", {"bar": ["baz", null, 1.0, 2]}]')
['foo', {'bar': ['baz', None, 1.0, 2]}]
The load
and dump
functions perform analogous operations, but
on files. The dump
function takes two arguments—an appropriate
data structure and a file pointer—and writes a JSON encoding of the
data structure to the file. Here’s a sample call:
>>> with open("saved_data.json", "w") as f:
... json.dump(data, f)
...
that yields a file named saved_data.json
with contents of:
["foo", {"bar": ["baz", null, 1.0, 2]}]
For a large and complex data structure, the standard encoding can be
hard to read. You can use the optional indent
parameter, which
allows you to specify the number of spaces to indent for each level of
nesting, and the sort_keys
parameter to generate output that is
easier to read. This code, for example,
>>> with open("saved_data.json", "w") as f:
... json.dump(data, f, indent=4, sort_keys=True)
...
yields a version of saved_data.json
with the following contents:
[
"foo",
{
"bar": [
"baz",
null,
1.0,
2
]
}
]
The load
function takes a file pointer, reads the data from the
file, and returns the decoded data structure:
>>> saved_data = None
>>> with open("saved_data.json") as f:
... saved_data = json.load(f)
...
>>> saved_data
['foo', {'bar': ['baz', None, 1.0, 2]}]
4.1.5. YAML
YAML (YAML Ain’t Markup Language) is, like JSON, a lightweight data format that is intended to be both machine-readable and human-readable. It uses indentation rather than braces and brackets to represent nesting. We won’t describe this format in detail, other than to point out that these files are easy to read and to process automatically.
YAML is often used for configuration files, as well as files that
need to be processed by a program, but also need to be readable
by a non-technical user. For example, the following rubric.yml
file could
be used to provide the results of grading a programming assignment:
Points:
Tests:
Points Possible: 50
Points Obtained: 45
Implementing foo():
Points Possible: 20
Points Obtained: 10
Implementing bar():
Points Possible: 20
Points Obtained: 20
Code Style:
Points Possible: 10
Points Obtained: 7.5
Penalties:
Code comments are written in Old English: -5
Bonuses:
Worked alone: 10
Total Points: 87.5 / 100
Comments: >
Well done!
As with any library, we need to import the yaml
library before we
can use it:
>>> import yaml
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'yaml'
We can load a YAML file using yaml.load
with a file pointer:
>>> with open("rubric.yml") as f:
... rubric = yaml.load(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
NameError: name 'yaml' is not defined
The result will be a dictionary:
>>> rubric
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'rubric' is not defined
Notice that the nesting of the dictionary reflects the nesting of the indentation above.