2.2. Dictionaries and Sets
This chapter introduces two new structures: dictionaries and sets.
2.2.1. Introducing Dictionaries
In the previous chapter, we saw that lists and tuples allow us to store multiple values in a single data structure, but they do so in a very specific way: the values are stored in a sequence. Python makes it very easy to iterate over that sequence, as well as access items at specific positions of the sequence. This design makes lists, tuples, etc great data structures for some use cases but not for others.
For example, suppose we were working with data on contributions to political campaigns, with the following information for each contribution:
First name of contributor
Last name of contributor
ZIP Code where contributor lives
Campaign that received the contribution
Contribution amount
We could represent an individual contribution like this:
c = ("John", "Doe", "60637", "Kang for President 2016", 27.50)
And a list of contributions like this:
contributions = [
("John", "Doe", "60637", "Kang for President 2016", 27.50),
("Jane", "Doe", "60637", "Kodos for President 2016", 100.00),
("James", "Roe", "07974", "Kang for President 2016", 50.00)
]
Accessing the individual values inside a contribution requires using the values’ positions in the tuple, which can lead to code that is difficult to read. For example, suppose we wanted to write a function that adds up all the contributions made to a given campaign. The function would look something like this:
def total_contributions_candidate(contributions, campaign):
"""
Compute total contributions to a candidate's campaign
Args:
contributions (List[Tuple(str, str, str, str, int)]]): one list per contribution
campaign (str): the name of the campaign
Returns (float): total amount contributed
"""
total = 0
for contribution in contributions:
if campaign == contribution[3]:
total += contribution[4]
return total
>>> total_contributions_candidate(contributions, "Kang for President 2016")
77.5
>>> total_contributions_candidate(contributions, "Kodos for President 2016")
100.0
Using contribution[3]
to access the name of the campaign and
contribution[4]
to access the contribution amount makes the code
hard to read. There are ways to make it a bit more readable, such as
defining variables to store the position of each value (e.g., we could
define CAMPAIGN_INDEX = 3
and then access the campaign by writing
contribution[CAMPAIGN_INDEX]
), but this approach is error-prone. We
could easily use the wrong value for the constant or the format of the data might
change over time. In the latter case, we would need to check the
fields we are using carefully to ensure that we are still accessing
the correct positions.
For this application, it would be better to use a data structure called a dictionary. Like lists, dictionaries can be used to store multiple values, but do so by associating each value with a unique key rather than with a position in a sequence. This arrangement is similar to a physical dictionary: the words are the “keys” and the definitions are the “values”.
We could store an individual contribution in a dictionary like this:
d = {"first_name": "John",
"last_name": "Doe",
"zip_code": "60637",
"campaign": "Kang for President 2016",
"amount": 27.50}
Notice that we have a sequence of entries separated by commas where
each entry is a key-value pair separated by a colon. Instead of
accessing values by their positions in the data structure, we access
them by their keys. For example, if we wanted to access the ZIP Code
of this contribution, we would use the key "zip_code"
:
>>> d["zip_code"]
'60637'
Because values are accessed by key rather than by position, dictionaries are also referred to in other programming languages as “associative arrays” or “maps” (in the sense that they associate or map a key to a value).
In this example, all of the keys are strings. Although this design is fairly common, it is not required. In fact, many different Python types can be used for keys in a dictionary. For now, we’ll restrict ourselves to types with values that cannot be changed (i.e. immutable types), which includes strings, integers, booleans, tuples of strings, etc. We’ll come back to the question of acceptable types for dictionary keys briefly after we introduce Python classes.
While the types used for dictionary keys are restricted, the values
can have any type and different keys can have values with different
types. Notice, for example, that not all of the keys in our example
have values with the same type: the value for "amount"
is a float,
while the rest are strings.
Our list of contributions would now look like this:
contributions = [
{"first_name": "John",
"last_name": "Doe",
"zip_code": "60637",
"campaign": "Kang for President 2016",
"amount": 27.50},
{"first_name": "Jane",
"last_name": "Doe",
"zip_code": "60637",
"campaign": "Kodos for President 2016",
"amount": 100.00},
{"first_name": "James",
"last_name": "Roe",
"zip_code": "07974",
"campaign": "Kang for President 2016",
"amount": 50.00}
]
And our function to compute the total contributions now looks like this:
def total_contributions_candidate(contributions, campaign):
"""
Compute total contributions to a candidate's campaign
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
campaign (string): the name of the campaign
Returns (float): total amount contributed
"""
total = 0
for contribution in contributions:
if campaign == contribution["campaign"]:
total += contribution["amount"]
return total
>>> total_contributions_candidate(contributions, "Kang for President 2016")
77.5
>>> total_contributions_candidate(contributions, "Kodos for President 2016")
100.0
The new function looks very similar to the original, but it is easier to read and maintain.
Dictionaries do more than simply provide convenient syntax for accessing values by key instead of position. As we’ll see later in the chapter, dictionaries can also perform certain operations much more efficiently than lists and allow us to implement certain algorithms more elegantly.
It is very common to use a dictionary to represent a single “object” (in this case, a single contribution to a campaign, with each attribute of the contribution represented by a key in the dictionary), especially when exchanging data between different systems. In fact, later in the book, we will see a data format, JSON, that works well with Python dictionaries and lists. Later on, we will also see how object-oriented programming provides a third way of working with “objects” in our programs.
2.2.1.1. Practice Problems
The practice problems in this chapter use a data set that contains customer complaints. Each complaint includes:
the name of the company,
the date the complaint was received,
a unique ID,
the issue,
the product,
the consumer’s complaint narrative,
the company’s public response, and
the consumer’s home state.
This information will be stored in a dictionary where both the keys
and the values are strings (i.e., the type of the dictionaries will be
Dict[str, str]
) . Here’s an example complaint:
complaint_as_dict =
{'Company': "Smith's Forge",
'Date received': '07/29/2013',
'Complaint ID': '468882',
'Issue': 'Managing the loan or lease',
'Product': 'Consumer Loan',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA'}
Problem 1
Write a function, count_complaints_about
, that takes a list of
complaints and the name of a company and returns the number of
complaints in the list for that company.
What might make good test cases for this function?
A possible pitfall: floats as keys
Before we move on talking about useful dictionary operations, we’d like to stop and consider a simple example an example that may surprise you:
>>> d = {0.3: "found"}
>>> d.get(0.1 + 0.1 + 0.1, "not found")
'not found'
Using your understanding of real numbers, you might expect get
to return "found"
, the value associated with 0.3
. Using
floats, however, 0.1 + 0.1 + 0.1
does not equal 0.3
and so,
get
does not find the key in the dictionary and returns the
default value "not_found"
.
So, while you technically use floats as keys for a dictionary, you should only do so when you are using values that can be represented exactly with a power of two or as the sum of powers of two.
2.2.2. Useful operations with dictionaries
Earlier, we saw that we can define a dictionary like this:
d = {"first_name": "John",
"last_name": "Doe",
"zip_code": "60637",
"campaign": "Kang for President 2016",
"amount": 27.50}
And access individual values by their keys like this:
>>> d["zip_code"]
'60637'
Note that when we attempt to access values in the dictionary, specifying a key that doesn’t exist in the dictionary produces an error:
>>> d["affiliation"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'affiliation'
In addition to the square-bracket notation, we can access a value
using the get
method, which method returns None
if the
dictionary does not contain the specified key.
>>> st = d.get("first_name")
>>> print(st)
John
>>> st = d.get("affiliation")
>>> print(st)
None
The get
method also takes an optional second parameter to specify a default value that
should be returned if the key doesn’t exist in the dictionary:
>>> d.get("affiliation", "Unknown")
'Unknown'
We can check if a dictionary contains a given key using the in
operator:
>>> "first_name" in d
True
>>> "affiliation" in d
False
Dictionaries are also mutable: we can update the value associated with a key in a dictionary and add new key-value pairs. We assign a new value to a key like this:
>>> d["zip_code"] = "94305"
>>> d["zip_code"]
'94305'
and add a new key-value pair using the same syntax:
>>> d["affiliation"] = "Kodosican"
>>> d
{'first_name': 'John', 'last_name': 'Doe', 'zip_code': '94305', 'campaign': 'Kang for President 2016', 'amount': 27.5, 'affiliation': 'Kodosican'}
We can also create a dictionary by starting from an empty (or partial) dictionary and assigning values:
d = {}
d["first_name"] = "John"
d["last_name"] = "Doe"
d["zip_code"] = "60637"
d["campaign"] = "Kang for President 2016"
d["amount"] = 27.50
d["affiliation"] = "Kodosican"
>>> d
{'first_name': 'John', 'last_name': 'Doe', 'zip_code': '60637', 'campaign': 'Kang for President 2016', 'amount': 27.5, 'affiliation': 'Kodosican'}
We can remove an entry in the dictionary using the del
operator:
>>> "affiliation" in d
True
>>> del d["affiliation"]
>>> "affiliation" in d
False
We can iterate over the keys in a dictionary using the dictionary’s
keys
method:
>>> for key in d.keys():
... print(key)
...
first_name
last_name
zip_code
campaign
amount
This operation is sufficiently common that Python provides a shorthand: iterating over the dictionary itself is equivalent to iterating over the keys. That is, when you iterate over a dictionary, the loop variable will iterate over the keys of the dictionary:
>>> for key in d:
... print(key)
...
first_name
last_name
zip_code
campaign
amount
We can also iterate over the values using the dictionary’s values
method:
>>> for value in d.values():
... print(value)
...
John
Doe
60637
Kang for President 2016
27.5
And, finally, we can iterate over both the keys and the values using
the dictionary’s items
method:
>>> for key, value in d.items():
... print(key, value)
...
first_name John
last_name Doe
zip_code 60637
campaign Kang for President 2016
amount 27.5
The keys and values are not printed in any particular order. Most
notably, the keys are not shown in alphabetical order. (They may
happen to be shown in the order that they are added to the dictionary,
but that is not guaranteed.) This behavior is another big difference
between dictionaries and lists: lists store values in a specific
order, and iterating over a list will always yield those values in
that order. There is no such guarantee with dictionaries: if we
iterate over the contents of a dictionary, we cannot assume any
specific order and, in fact, that order can even change from one
for
loop to another!
A common pitfall: changing the set of keys in a dictionary as you iterate over it
There is an algorithm that allows us to compute the K most frequently seen items in a stream of data. We are not going to present the whole algorithm here. Instead, we will focus on a specific task within that algorithm that is used to exclude keys that have not been seen recently: take a dictionary that maps strings to positive integers, decrement all the values by one and remove any key-value pair where the resulting value is zero.
You might be tempted to write the following incorrect function to do this task:
def decr_and_remove(d):
"""
Given a dictionary that maps keys to positive integers,
decrement the values and remove any key-value pair in which
the decremented value becomes zero
Modifies the dictionary in-place.
Args:
d (Dict[str, int]): maps keys to positive integers
Returns: None
"""
for key, value in d.items():
d[key] = value - 1
if d[key] == 0:
del d[key]
This code is not valid because you cannot modify a data structure as you iterate over it! Here’s one alternative solution:
def decr_and_remove(d):
"""
Given a dictionary that maps keys to positive integers,
decrement the values and remove any key-value pair in which
the decremented value becomes zero
Modifies the dictionary in-place.
Args:
d (Dict[str, int]): maps keys to positive integers
Returns: None
"""
keys_to_remove = []
for key, value in d.items():
d[key] = value - 1
if d[key] == 0:
keys_to_remove.append(key)
for key in keys_to_remove:
del d[key]
Here’s another that computes and returns a new dictionary with the desired keys rather than removing the keys from the input dictionary.
def decr_and_keep_pos(d):
"""
Given a dictionary that maps keys to positive integers,
subtract one from each value and include in the result only
those key-value pairs that still have positive values after
the decrement.
Args:
d (Dict[str, int]): keys to positive integers
Returns (Dict[str, int]): a new dictionary that maps a key to
an integer value, where the keys are the keys from the
original dictionary that had values greater than one and
the original values have been decremented by one.
"""
rv = {}
for key, value in d.items():
if value > 1:
rv[key] = value - 1
return rv
2.2.3. Other ways to construct dictionaries
As we have seen, we can construct dictionaries using dictionary
literals or updating an empty dictionary with
new key-value pairs. We can also construct dictionaries use the
dict
function and dictionary comprehensions.
The dict
function allows us to construct a new dictionary from a
variety of different types of values including a list of key-value
pairs or another dictionary. For example, here are three different
ways to construct the same dictionary: the first uses a dictionary
literal, the second uses a call to dict
with a list of pairs, and
the third uses a call to dict
with a dictionary as the argument.
d = {"first_name": "John",
"last_name": "Doe",
"zip_code": "60637",
"campaign": "Kang for President 2016",
"amount": 27.50}
keys_and_data = [("first_name", "John"),
("last_name", "Doe"),
("zip_code", "60637"),
("campaign", "Kang for President 2016"),
("amount", 27.50)]
d_from_list = dict(keys_and_data)
d_from_dict = dict(d)
Dictionary comprehensions are similar to list comprehensions. In place of one transformation expression, dictionary comprehensions require two: one that yields a key and another that yields a value. For example, here’s another way to construct our sample dictionary:
>>> d = {key: value for key, value in keys_and_data}
This example uses very simple expressions for the key and the value,
but in general, you can use arbitrarily complex expressions for either.
You can also specify a boolean expression to use as a filter. For
example, we could rewrite the decr_and_keep_pos
function from the
previous section using a dictionary comprehension with a filter:
def decr_and_keep_pos(d):
"""
Given a dictionary that maps keys to positive integers,
subtract one from each value and include in the result only
those key-value pairs that still have positive values after
the decrement.
Args:
d (Dict[str, int]): keys to positive integers
Returns (Dict[str, int]): a new dictionary that maps a key to
an integer value, where the keys are the keys from the
original dictionary that had values greater than one.
The values have been decremented by one.
"""
return {key: value - 1 for key, value in d.items() if value > 1}
A key-value pair from the input dictionary will be used in the construction of the result only if the original value is greater than one.
2.2.4. Accumulating values with dictionaries
Dictionaries are often used to accumulate a value based on a key. For example, earlier we computed the total number of contributions for a specific campaign. What if we wanted to compute the total contributions for each campaign? We could call our previous function over and over again, once per campaign, but there is a better approach. A dictionary that maps a campaign’s name to the total contributions to that campaign is a perfect tool for this task. We start with an empty dictionary and add new entries as we encounter new campaigns:
def total_contributions_by_campaign(contributions):
"""
Compute total contributions by campaign
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
Returns (Dict[str, float]): a dictionary that maps a campaign name to the
total contributions to the campaign.
"""
rv = {}
for contribution in contributions:
campaign = contribution["campaign"]
if campaign not in rv:
rv[campaign] = 0
rv[campaign] += contribution["amount"]
return rv
Here’s a sample run of this function using the sample contributions list defined earlier:
>>> total_contributions_by_campaign(contributions)
{'Kang for President 2016': 77.5, 'Kodos for President 2016': 100.0}
Notice that the result is a dictionary.
This implementation uses the not in
operator to identify campaigns
that are being seen for the first time. An alternative is to use
get
with a default value of zero for previously unseen campaigns:
def total_contributions_by_campaign(contributions):
"""
Compute total contributions by campaign
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
Returns (Dict[str, float]): a dictionary that maps a campaign name to the
total contributions to the campaign.
"""
rv = {}
for contribution in contributions:
campaign = contribution["campaign"]
rv[campaign] = rv.get(campaign, 0) + contribution["amount"]
return rv
Accumulating values in this way is very common.
2.2.4.1. Practice Problems
Problem 2
Write a function, count_complaints_by_state
, that take a list of
complaint dictionaries and computes a dictionary that maps each state
to the count the number of complaints received for that state. The
result should include only states that have at least one complaint.
2.2.5. Nested dictionaries
Dictionaries can be nested, meaning that the value associated with a key can itself be a dictionary. For example, we might have a dictionary that maps each candidate to a dictionary that maps a ZIP Code as a string to the total contributions to that candidate from that ZIP Code. Here’s a version of this dictionary constructed using the contributions listed above:
contributions_by_cand_by_zipcode = {
"Kodos for President 2016": {"60637": 100.00},
"Kang for President 2016": {"07974": 50.00,
"60637": 27.50}
}
Note that we chose to represent ZIP Codes as strings, not integers, to make sure ZIP Codes like 07974 don’t lose their leading zero and become 7974.
We can extract the total amount of contributions from 60637
to
the Kang for President 2016
campaign using this expression:
>>> contributions_by_cand_by_zipcode["Kang for President 2016"]["60637"]
27.5
The first set of square brackets extracts the sub-dictionary
associated with "Kang for President 2016"
in
contributions_by_cand_by_zipcode
, while the second retrieves the
value associated with "60637"
from that sub-dictionary.
The code to compute this dictionary is not that much more complex than the code for accumulating total contributions by candidate.
def total_contributions_by_campaign_by_zip(contributions):
"""
Compute the total contributions from each ZIP Code for each
campaign
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
Returns (Dict[str, Dict[str, float]]): a dictionary that maps a
campaign name to a sub-dictionary that map a ZIP Codes (represented as
a string) to the total contributions to that campaign from that zipcode
(represented as a float).
"""
rv = {}
for contribution in contributions:
campaign = contribution["campaign"]
zipcode = contribution["zip_code"]
if campaign not in rv:
rv[campaign] = {}
rv[campaign][zipcode] = rv[campaign].get(zipcode, 0) + contribution["amount"]
return rv
For each contribution, we first ensure that the campaign is
initialized properly in the return value. We then use the get
method to retrieve the total for contributions seen thus far from this
zipcode for the campaign, which will be zero for the first
contribution from this zipcode. Once we have that value , we can just
add in the amount of the current contribution and update the
dictionary. Here’s a sample run of the function using the list of
contributions defined earlier in the chapter:
>>> total_contributions_by_campaign_by_zip(contributions)
{'Kang for President 2016': {'60637': 27.5, '07974': 50.0}, 'Kodos for President 2016': {'60637': 100.0}}
Notice that, as expected, the result contains nested dictionaries.
2.2.5.1. Practice Problems
Problem 3
Here is an example dictionary, complaints_by_company
that maps a
company name to a list of complaints about that company:
complaints_by_company = \
{"Smith's Forge": [{'Company': "Smith's Forge",
'Date received': '07/29/2013',
'Complaint ID': '468882',
'Issue': 'Managing the loan or lease',
'Product': 'Consumer Loan',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA'}],
'Acme Anvils': [{'Company': 'Acme Anvils',
'Date received': '08/23/2013',
'Complaint ID': '567783',
'Issue': 'Not heavy enough',
'Product': 'Anvil',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed without explanation',
'State': 'IL'},
{'Company': 'Acme Anvils',
'Date received': '09/17/2013',
'Complaint ID': '779986',
'Issue': 'too heavy enough',
'Product': 'Anvil',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA'}]}
What is the result of evaluating the following expressions:
len(complaints_by_company['Acme Anvils'])
complaints_by_company["Smith's Forge"][0]
complaints_by_company['Acme Anvils'][1]
Write an expression that extracts the state from the first complaint in the list associated with 'Acme Anvils`
.
Problem 4
Write a function, organize_complaints_by_company
, that takes list
of complaint dictionaries, creates a new dictionary that maps the name
of a company to a list of the complaint dictionaries that concern that
company. For example, the dictionary shown in Problem 3 was created
by our implementation of this function.
2.2.6. Data structures and their complexity
The complexity of a computational solution to a problem, and how well that solution performs, depends not only on the algorithm we use, but also on our choice of data structures for that solution. Now that you know about two data structures, lists and dictionaries, it is important to understand the implications of using one or the other. This decision will not be just a matter of personal preference: your choice of data structure can have a considerable impact on how well (or how poorly) your code performs.
For example, our political contribution data only has postal ZIP Codes for the contributions, but not full addresses. One thing we might want to do is compute the total contributions in each state and, to do so, we need some way to map ZIP Codes to states. For the purposes of this example, let’s assume we only care about three ZIP Codes: 60637 in Illinois, 07974 in New Jersey, and 94043 in California.
We could represent this data easily enough with a dictionary:
zip_codes_dict = {"60637": "IL",
"07974": "NJ",
"94305": "CA"}
But we could also store this mapping from ZIP Codes to states in a list of tuples, where each tuple represents a mapping from a ZIP Code to that ZIP Code’s state:
zip_codes_list = [("60637", "IL"), ("07974", "NJ"), ("94305", "CA")]
Of course, the actual list of all the ZIP Codes would be much larger, with more than 40,000 entries.
Given a ZIP Code, obtaining that ZIP Code’s state would involve iterating through the list until we find the tuple that contains the mapping from that ZIP Code to the corresponding state. We can write a simple function to perform this operation:
def get_state_from_zip(zip_code, zip_codes_list):
"""
Find the state associated with a given ZIP Code
Args:
zip_code (str): a ZIP Code
zip_codes_list (List[Tuple[str, str]]): pairs of ZIP Codes and state abbreviations
Returns (str | None): state abbreviation or None, if the ZIP Code
does not appear in the list
"""
for zc, st in zip_codes_list:
if zc == zip_code:
return st
return None
>>> get_state_from_zip("60637", zip_codes_list)
'IL'
>>> get_state_from_zip("90210", zip_codes_list)
Notice that, if there is no corresponding ZIP Code in our list, the
function simply returns None
.
So, at this point we have two solutions that, essentially, accomplish the same task:
>>> zip_codes_dict
{'60637': 'IL', '07974': 'NJ', '94305': 'CA'}
>>> zip_codes_list
[('60637', 'IL'), ('07974', 'NJ'), ('94305', 'CA')]
>>> zip_codes_dict.get("60637")
'IL'
>>> get_state_from_zip("60637", zip_codes_list)
'IL'
>>> zip_codes_dict.get("07974")
'NJ'
>>> get_state_from_zip("07974", zip_codes_list)
'NJ'
>>> zip_codes_dict.get("94305")
'CA'
>>> get_state_from_zip("94305", zip_codes_list)
'CA'
>>> zip_codes_dict.get("11111")
>>> get_state_from_zip("11111", zip_codes_list)
Finding a given ZIP Code’s state in the list, however, takes an amount of time that is proportional to the size of the list. If we had 40,000 ZIP Codes, and the ZIP Code we’re looking for is at the end of the list, we would have to traverse the entire list to find that value. While it’s true that sometimes we’ll search for a value that is towards the start of the list, the time to find a value will on average be proportional to the number of elements in the list. If we had 20,000,000 values in the list, the average time to find a value would be larger than if we have 40,000 values in the list.
Dictionaries, on the other hand, are implemented internally using a data structure called a hash table that is optimized to access key-value mappings very quickly. In fact, the amount of time it takes to access a key-value mapping is not proportional to the size of the input (in this case, the ZIP Codes). So, finding a value in a dictionary with 40,000 values will (roughly) take the same time as finding it in a dictionary with 20,000,000 values.
Big-O notation
The performance or complexity of an algorithm or, in this case, of an operation on a data structure, is typically specified using big-O notation.
While big-O notation has a formal definition, it is not essential to understand the concept. In a nutshell, if \(n\) is the size of the input to a problem (e.g., in this example, \(n\) would be the number of ZIP Codes), then saying that something runs in \(O(n^2)\) means that its running time is roughly proportional to \(n^2\) (i.e., as n gets bigger, the running time grows quadratically). On the other hand, if an algorithm runs in \(O(\log n)\), its running time grows logarithmically.
Big-O notation helps us compare the performance of algorithms or individual operations on data structures. If one data structure can perform an operation in \(O(n)\) and another data structure can perform that same operation in \(O(n^2)\), we know that, on average, the first data structure performs that operation faster than the second data structure.
However, there is no “golden data structure” that beats every other data structure in every possible operation. In fact, when a data structure provides good performance in one operation, there will usually be a trade-off: other operations may be less efficient, or even not supported by the data structure. So, choosing the right data structure often involves asking yourself what operations you will be using the most, and whether each data structure can perform those operations efficiently.
In this case, we are looking at just one operation: finding a value in a list versus finding a value (through its key) in a dictionary. If \(n\) is the number of values, this operation can be done on lists in \(O(n)\) (the running time is linearly proportional to \(n\)), while this operation can be done on dictionaries in \(O(1)\), meaning that the running time is not proportional to \(n\) (this is often referred to as “constant time”; strictly speaking, this operation is done in something called “amortized constant time”, but you don’t need to concern yourself with this distinction). So, in this case, a dictionary would be a better choice.
However, you should not jump to the conclusion that we should now use dictionaries for everything. If we were evaluating a different operation, it could turn out that lists would be a better choice. We explore this proposition in more detail below.
We can observe this difference in performance empirically by running our code with
the full ZIP Code database. We will show you the result of doing this experiment but,
if you would like to follow along, you can use the following files provided in our
example code: the data_structures/dictionaries/zipcodes.py
module,
as well as the ZIP Code database, data_structures/dictionaries/us_postal_codes.csv
.
You will also need to use IPython, not the regular Python interpreter (make sure to run
it in the same directory that contains the zipcodes.py
and us_postal_codes.csv
files.
From the IPython interpreter, run the following:
In [1]: run zipcodes.py
This command will run the code in zipcodes.py
, which reads the ZIP
Code database and loads it into both a list (zip_codes_list
) and a
dictionary (zip_codes_dict
). It will also define the
get_state_from_zip
function we wrote earlier, as well as a
function called gen_random_zip
that generates a random ZIP
Code. We can see that they work as expected:
In [2]: get_state_from_zip("60637", zip_codes_list)
Out[2]: 'IL'
In [3]: get_state_from_zip("90210", zip_codes_list)
Out[3]: 'CA'
In [4]: zip_codes_dict.get("60637")
Out[4]: 'IL'
In [5]: zip_codes_dict.get("90210")
Out[5]: 'CA'
In [6]: gen_random_zip()
Out[6]: '48344'
In [7]: gen_random_zip()
Out[7]: '22219'
Note: You will very likely get different ZIP Codes when you call gen_random_zip
.
When you make these individual calls, they may seem to be returning nearly instantaneously. Unfortunately, testing code performance informally in the interpreter obscures what will happen when you write a program that involves calling a function thousands or even millions of times.
IPython includes a handy %timeit
command that will allow us to see how a piece of code performs when
run many times. As it turns out, the dictionary version is nearly 200 times faster than the list-based
version!
In [8]: %timeit zip_codes_dict.get(gen_random_zip())
100000 loops, best of 3: 4.9 µs per loop
In [9]: %timeit get_state_from_zip(gen_random_zip(), zip_codes_list)
1000 loops, best of 3: 955 µs per loop
Note: You will likely get different running times, but the order of magnitude between the two times should be roughly the same.
Things get more interesting if we actually look at how these running times change depending
on the size of the dataset. Our zipcodes.py
file also defines a small_zip_codes_list
and a small_zip_codes_dict
with just 500 ZIP Codes, and a medium_zip_codes_list
and a medium_zip_codes_dict
with just 2000 ZIP Codes. If we use the %timeit
command
to time our list-based and dictionary-based implementations with these datasets, we get
the following times:
n = 5000 |
n = 20000 |
n = 43624 |
|
---|---|---|---|
Lists |
0.142 ms |
0.493 ms |
1.01 ms |
Dictionaries |
0.00477 ms |
0.00478 ms |
0.0048 ms |
So, not only are dictionaries faster than lists in an absolute sense, their performance is independent of the size of the input. On the other hand, notice that as the dataset grows the list-based implementation takes more and more time.
This result occurs because, as we saw earlier, finding the mapping from a ZIP Code to a state in a list takes \(O(n)\) time, while finding a mapping in a dictionary takes \(O(1)\) time. If \(n\) is the number of ZIP Codes in our dataset, then the running time of a \(O(n)\) solution will increase (roughly) linearly as our dataset gets bigger, while the running time of a \(O(1)\) operation will remain (roughly) constant.
So, based on all this, it sounds like we should just use dictionaries everywhere instead of lists, right? Not so fast! As we mentioned earlier, there is no such thing as a “golden data structure” that performs every possible operation in an optimal fashion. There is, however, “the right data structure for this job” and, if “this job” is finding a mapping between a key and a value, then dictionaries are definitely the right data structure.
There are other operations, however, for which dictionaries are not so great. Most notably,
as we saw earlier, dictionaries are unordered data structures, which means that iterating
over a dictionary is not guaranteed to yield the keys in any particular order (and that
order can even change from one for
loop to another). On top of that, even if we don’t
care about the order in which the values are processed, a for
loop over a dictionary
will usually be slower than a for
loop over a list containing the same values.
So, if the order of the data matters, or if we are going to process it in sequence often,
then using a list will probably be a better choice.
Note
Why did we use random ZIP Codes instead of just testing the code with a fixed ZIP Code? If we used a fixed ZIP Code, the results could be skewed depending on where that ZIP Code is located in the list. For example, if we used 00210 (the first ZIP Code in the list), our list implementation would always find that ZIP Code very quickly (because it doesn’t have to iterate through the rest of the list), making it look like the list version is just as good as the dictionary version:
In [10]: %timeit zip_codes_dict.get("00210")
The slowest run took 21.25 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 63.3 ns per loop
In [11]: %timeit get_state_from_zip("00210", zip_codes_list)
The slowest run took 17.80 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 127 ns per loop
In this case, the dictionary version is only two times faster than the list version. Notice that we also get a message about “an intermediate result being cached”. This phrase refers to the fact that computers are usually smart about detecting when a location in memory is being accessed frequently, and making sure it is “cached” in higher-speed memory in the computer’s processor (instead of having to go to the computer’s main memory, which is slower). Testing our implementation with the same inputs over and over again could show performance gains that are due to this caching mechanism, not the efficiency of the data structure itself.
2.2.7. Sets
Dictionaries are great for associating values with unique keys, but sometimes we may simply want to have a collection of unique keys that we can access as efficiently as a dictionary, but without associating a value with each key. To do this, we can simply use a set data structure, which allows us to store an unordered collection of distinct elements.
Typical set operations include:
membership (\(e \in S\)): which tests whether a given element, \(e\), appears in the set \(S\),
subset (\(S \subseteq T\)): which tests whether every element in \(S\) also appears as an element in \(T\).
union (\(S \cup T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and/or \(e \in T\),
intersection (\(S \cap T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and \(e \in T\), and
difference (\(S - T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and \(e \notin T\).
Sets are sufficiently useful that Python provides a built-in set
data structure.
We can define a set literal by surrounding the initial elements of the
set with curly braces or using the built-in set
function with a
sequence (list, string, etc.) as the parameter:
>>> zipcodes = {"60637", "07974"}
>>> zipcodes
{'60637', '07974'}
>>> zipcodes = set(["60637", "07974"])
>>> zipcodes
{'60637', '07974'}
>>> vowels = set("aeiou")
>>> vowels
{'o', 'u', 'e', 'i', 'a'}
To construct an empty set, we use the set
function with no arguments:
>>> empty_set = set()
We cannot use curly braces to create an empty set, because Python
interprets {}
as an empty dictionary.
We can determine the number of elements in a set using the len
function:
>>> len(empty_set)
0
>>> len(zipcodes)
2
>>> len(vowels)
5
and we can test for set membership using the in
operator:
>>> "14850" in zipcodes
False
>>> "60637" in zipcodes
True
We can determine whether one set is a subset of another set using the
issubset
method. Specifically, a call of the form
S.issubset(T)
tests whether the set S
is a subset of the set
T
. For example:
>>> {"07974"}.issubset(zipcodes)
True
>>> {"60637", "14850"}.issubset(zipcodes)
False
Sets are mutable: we can add and remove elements. We can add elements
to a set using the add
method:
>>> zipcodes
{'60637', '07974'}
>>> zipcodes.add("14850")
>>> zipcodes.add("60637")
>>> zipcodes
{'60637', '14850', '07974'}
The second example illustrates an important point about sets: adding an element that is already a member of the set has no effect.
We can take elements out of a set using either the remove
method
or the discard
method. The remove
method removes a value from
a set, if it is a member, and throws a KeyError
exception if it is
not. The discard
method, in contrast, removes the value if it is
a member of the set and does nothing if the value is not a member of
the set.
>>> zipcodes
{'60637', '14850', '07974'}
>>> zipcodes.remove("60637")
>>> zipcodes.remove("60615")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '60615'
>>> zipcodes
{'14850', '07974'}
>>> zipcodes.discard("14850")
>>> zipcodes.discard("60615")
>>> zipcodes
{'07974'}
Using the add
method and len
function, we can easily write
a function to determine how many different ZIP Codes had at least one
contribution.
def compute_num_zipcodes(contributions):
"""
Compute the number of zipcodes with at least one contribution
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
Returns (int): the number of zipcodes with at least one
contribution.
"""
zipcodes_seen = set()
for contribution in contributions:
zipcodes_seen.add(contribution["zip_code"])
return len(zipcodes_seen)
>>> compute_num_zipcodes(contributions)
2
Alternatively, we can use set comprehensions, which are like dictionary comprehensions with only keys, to do the same computation:
def compute_num_zipcodes(contributions):
"""
Compute the number of zipcodes with at least one contribution
Args:
contributions (List[Dict[str, (str | float)]]): one dictionary per
contribution
Returns (int): the number of zipcodes with at least one
contribution
"""
zipcodes_seen = {c["zip_code"] for c in contributions}
return len(zipcodes_seen)
>>> compute_num_zipcodes(contributions)
2
Sets come with a variety of useful operations. We can compute a new
set that is the union of two sets using the |
operator or the
union
method, the intersection of two sets using the &
operator or the intersection
method, and the difference of two
sets using the -
operator or the difference
method:
>>> zipcodes = {"60637", "07974"}
>>> zipcodes | {"60637", "60615"}
{'60637', '07974', '60615'}
>>> zipcodes.union({"60637", "60615"})
{'60637', '07974', '60615'}
>>> zipcodes & {"60637", "60615"}
{'60637'}
>>> zipcodes.intersection({"60637", "60615"})
{'60637'}
>>> zipcodes - {"60637", "60615"}
{'07974'}
>>> zipcodes.difference({"60637", "60615"})
{'07974'}
Finally, since sets are collections, we can iterate over the elements in a set:
>>> for zc in zipcodes:
... print(zc)
...
60637
07974
As with dictionaries, sets are unordered and so the elements will be printed in an arbitrary order.
2.2.7.1. Practice Problem
Problem 5
Write a function, count_unique_states_per_company
, that takes a
list of complaint dictionaries and returns a dictionary that maps a
company name to the number of different states that had complaints
about that company. For example, if a company had two complaints from
Viginia and two from New Jersey, the result for that company would be
two.
Hint: build an intermediate data structure that maps the name of a company to set a of the states that had at least one complaint for that company.
2.2.8. Practice Problem Solutions
Here is an example list of complaints that will be used in test code for the practice problems.
complaint_list = [
{'Company': "Smith's Forge",
'Date received': '07/29/2013',
'Complaint ID': '468882',
'Issue': 'Managing the loan or lease',
'Product': 'Consumer Loan',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA'},
{'Company': 'Acme Anvils',
'Date received': '08/23/2013',
'Complaint ID': '567783',
'Issue': 'Not heavy enough',
'Product': 'Anvil',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed without explanation',
'State': 'IL'},
{'Company': 'Acme Anvils',
'Date received': '09/17/2013',
'Complaint ID': '779986',
'Issue': 'too heavy enough',
'Product': 'Anvil',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA'}]
Problem 1
Here is a simple solution for this problem:
def count_complaints_about(complaints, company_name):
"""
Count complaints about the specified company
Args:
complaints (List[Dict[str, str]) A list of complaints, where each
complaint is a dictionary
company_name [str]: The company name to count complaints for
Returns [int]: count of complaints
"""
count = 0
for c in complaints:
if c["Company"] == company_name:
count = count + 1
return count
And here is some simple test code:
def test_count_complaints_about():
""" Simple tests for count_complaints_about """
# Test the empty list
assert count_complaints_about([], "Smith's Forge") == 0
# Test one element lists using a complaint about Wells Fargo.
assert count_complaints_about([complaint_list[0]], "Smith's Forge") == 1
assert count_complaints_about([complaint_list[0]], 'Acme Anvils') == 0
# Test a list with multiple complaints
longer_list = complaint_list + [complaint_list[0]] * 2
assert count_complaints_about(longer_list, "Smith's Forge") == 3
assert count_complaints_about(longer_list, 'Acme Anvils') == 2
assert count_complaints_about(longer_list, 'Acme Sweaters') == 0
Problem 2
Here is a solution for this problem that uses the in
operator:
def count_complaints_by_state(complaints):
"""
Compute the number of complaints for each state.
Args:
complaints (List[Dict[str, str]]): A list of complaints, where each
complaint is a dictionary
Returns: (Dict[str, int]): a dictionary that maps the name of a state to
the number of complaints from that state.
"""
d = {}
for c in complaints:
state = c["State"]
if state in d:
d[state] = d[state] + 1
else:
d[state] = 1
return d
And here is a solution that uses the dictionary get
method:
def count_complaints_by_state(complaints):
"""
Compute the number of complaints for each state.
Args:
complaints (List[Dict[str, str]]): A list of complaints, where each
complaint is a dictionary
Returns: (Dict[str, int]): a dictionary that maps the name of a state to
the number of complaints from that state.
"""
d = {}
for c in complaints:
state = c["State"]
if state in d:
d[state] = d[state] + 1
else:
d[state] = 1
return d
And here is some simple test code:
def test_count_complaints_by_state():
""" Simple tests for count_complaints_by_state """
# Test the empty list
assert count_complaints_by_state([]) == {}
# Test one element lists using a complaint from VA
assert count_complaints_by_state([complaint_list[0]]) == {'VA': 1}
# Test a list with multiple complaints
longer_list = complaint_list + [complaint_list[0]] * 2
assert count_complaints_by_state(longer_list) == {'VA' : 4, 'IL' : 1}
Problem 3
>>> complaints_by_company = \
... {"Smith's Forge": [{'Company': "Smith's Forge",
... 'Date received': '07/29/2013',
... 'Complaint ID': '468882',
... 'Issue': 'Way way too hot',
... 'Product': 'Consumer Loan',
... 'Consumer complaint narrative': '',
... 'Company response to consumer': 'Closed with explanation',
... 'State': 'VA'}],
... 'Acme Anvils': [{'Company': 'Acme Anvils',
... 'Date received': '08/23/2013',
... 'Complaint ID': '567783',
... 'Issue': 'Not heavy enough',
... 'Product': 'Anvil',
... 'Consumer complaint narrative': '',
... 'Company response to consumer': 'Closed without explanation',
... 'State': 'IL'},
... {'Company': 'Acme Anvils',
... 'Date received': '09/17/2013',
... 'Complaint ID': '779986',
... 'Issue': 'too heavy',
... 'Product': 'Anvil',
... 'Consumer complaint narrative': '',
... 'Company response to consumer': 'Closed with explanation',
... 'State': 'VA'}]}
>>> len(complaints_by_company['Acme Anvils'])
2
>>> complaints_by_company["Smith's Forge"][0]
{'Company': "Smith's Forge", 'Date received': '07/29/2013', 'Complaint ID': '468882', 'Issue': 'Way way too hot', 'Product': 'Consumer Loan', 'Consumer complaint narrative': '', 'Company response to consumer': 'Closed with explanation', 'State': 'VA'}
>>> complaints_by_company['Acme Anvils'][1]
{'Company': 'Acme Anvils', 'Date received': '09/17/2013', 'Complaint ID': '779986', 'Issue': 'too heavy', 'Product': 'Anvil', 'Consumer complaint narrative': '', 'Company response to consumer': 'Closed with explanation', 'State': 'VA'}
>>> complaints_by_company['Acme Anvils'][0]['State']
'IL'
Problem 4
Here is a solution for this problem:
def organize_complaints_by_company(complaints):
"""
Create a dictionary that maps the name of a company to a list of the
complaint dictionaries that concern that company.
Args:
complaints (List[Dict[str, str]) A list of complaints, where each
complaint is a dictionary
Returns: (Dict[str, List[Dict[str, str]]]): a dictionary that names the name
of a company to a list of complaints for that company.
"""
d = {}
for complaint in complaints:
c = complaint["Company"]
if c not in d:
d[c] = []
# we are guaranteed that c will be in d at this
# point and that d[c] will be a list.
d[c].append(complaint)
return d
And here is some simple test code:
def test_organize_complaints_by_company():
""" Simple tests for organize_complaints_by_company """
# Empty list of complaints
assert organize_complaints_by_company([]) == {}
# List with one complaint
c = complaint_list[0]
assert organize_complaints_by_company([c]) == {c["Company"] : [c]}
# Test a list with multiple complaints
longer_list = complaint_list
assert organize_complaints_by_company(longer_list) == \
{"Smith's Forge" : [complaint_list[0]],
'Acme Anvils' : complaint_list[1:]}
Problem 5
Here is a solution for this problem:
def test_count_unique_states_per_company():
""" Simple tests for count_unique_states_per_company """
# Empty list of complaints
assert count_unique_states_per_company([]) == {}
# List with one complaint
c = complaint_list[0]
assert count_unique_states_per_company([c]) == {c["Company"] : 1}
# Test a list with multiple complaints
longer_list = complaint_list
assert count_unique_states_per_company(longer_list) == \
{"Smith's Forge" : 1,
'Acme Anvils' : 2}
Our solution uses an intermediate data structure that maps the name of
a company to a set of the states that had at least one complaint from
that company. It then uses a dictionary comprehesion to compute the
final result. Notice the use of the items
method in the
dictionary comprehension.
Here is some simple test code for this problem:
def test_count_unique_states_per_company():
""" Simple tests for count_unique_states_per_company """
# Empty list of complaints
assert count_unique_states_per_company([]) == {}
# List with one complaint
c = complaint_list[0]
assert count_unique_states_per_company([c]) == {c["Company"] : 1}
# Test a list with multiple complaints
longer_list = complaint_list
assert count_unique_states_per_company(longer_list) == \
{"Smith's Forge" : 1,
'Acme Anvils' : 2}