2.2. Dictionaries and Sets

This chapter introduces two new structures: dictionaries and sets.

2.2.1. Introducing Dictionaries

In the previous chapter, we saw that lists and tuples allow us to store multiple values in a single data structure, but they do so in a very specific way: the values are stored in a sequence. Python makes it very easy to iterate over that sequence, as well as access items at specific positions of the sequence. This design makes lists, tuples, etc great data structures for some use cases but not for others.

For example, suppose we were working with data on contributions to political campaigns, with the following information for each contribution:

  • First name of contributor

  • Last name of contributor

  • ZIP Code where contributor lives

  • Campaign that received the contribution

  • Contribution amount

We could represent an individual contribution like this:

c = ("John", "Doe", "60637", "Kang for President 2016", 27.50)

And a list of contributions like this:

contributions = [
    ("John", "Doe", "60637", "Kang for President 2016", 27.50),
    ("Jane", "Doe", "60637", "Kodos for President 2016", 100.00),
    ("James", "Roe", "07974", "Kang for President 2016", 50.00)
    ]

Accessing the individual values inside a contribution requires using the values’ positions in the tuple, which can lead to code that is difficult to read. For example, suppose we wanted to write a function that adds up all the contributions made to a given campaign. The function would look something like this:

def total_contributions_candidate(contributions, campaign):
    """
    Compute total contributions to a candidate's campaign

    Args:
        contributions (List[Tuple(str, str, str, str, int)]]): one list per contribution
        campaign (str): the name of the campaign

    Returns (float): total amount contributed
    """
    total = 0
    for contribution in contributions:
        if campaign == contribution[3]:
            total += contribution[4]
    return total
>>> total_contributions_candidate(contributions, "Kang for President 2016")
77.5
>>> total_contributions_candidate(contributions, "Kodos for President 2016")
100.0

Using contribution[3] to access the name of the campaign and contribution[4] to access the contribution amount makes the code hard to read. There are ways to make it a bit more readable, such as defining variables to store the position of each value (e.g., we could define CAMPAIGN_INDEX = 3 and then access the campaign by writing contribution[CAMPAIGN_INDEX]), but this approach is error-prone. We could easily use the wrong value for the constant or the format of the data might change over time. In the latter case, we would need to check the fields we are using carefully to ensure that we are still accessing the correct positions.

For this application, it would be better to use a data structure called a dictionary. Like lists, dictionaries can be used to store multiple values, but do so by associating each value with a unique key rather than with a position in a sequence. This arrangement is similar to a physical dictionary: the words are the “keys” and the definitions are the “values”.

We could store an individual contribution in a dictionary like this:

d = {"first_name": "John",
     "last_name": "Doe",
     "zip_code": "60637",
     "campaign": "Kang for President 2016",
     "amount": 27.50}

Notice that we have a sequence of entries separated by commas where each entry is a key-value pair separated by a colon. Instead of accessing values by their positions in the data structure, we access them by their keys. For example, if we wanted to access the ZIP Code of this contribution, we would use the key "zip_code":

>>> d["zip_code"]
'60637'

Because values are accessed by key rather than by position, dictionaries are also referred to in other programming languages as “associative arrays” or “maps” (in the sense that they associate or map a key to a value).

In this example, all of the keys are strings. Although this design is fairly common, it is not required. In fact, many different Python types can be used for keys in a dictionary. For now, we’ll restrict ourselves to types with values that cannot be changed (i.e. immutable types), which includes strings, integers, booleans, tuples of strings, etc. We’ll come back to the question of acceptable types for dictionary keys briefly after we introduce Python classes.

While the types used for dictionary keys are restricted, the values can have any type and different keys can have values with different types. Notice, for example, that not all of the keys in our example have values with the same type: the value for "amount" is a float, while the rest are strings.

Our list of contributions would now look like this:

contributions = [
   {"first_name": "John",
    "last_name": "Doe",
    "zip_code": "60637",
    "campaign": "Kang for President 2016",
    "amount": 27.50},
   {"first_name": "Jane",
    "last_name": "Doe",
    "zip_code": "60637",
    "campaign": "Kodos for President 2016",
    "amount": 100.00},
   {"first_name": "James",
    "last_name": "Roe",
    "zip_code": "07974",
    "campaign": "Kang for President 2016",
    "amount": 50.00}
]

And our function to compute the total contributions now looks like this:

def total_contributions_candidate(contributions, campaign):
    """
    Compute total contributions to a candidate's campaign

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
        contribution
      campaign (string): the name of the campaign

    Returns (float): total amount contributed
    """

    total = 0
    for contribution in contributions:
        if campaign == contribution["campaign"]:
            total += contribution["amount"]
    return total
>>> total_contributions_candidate(contributions, "Kang for President 2016")
77.5
>>> total_contributions_candidate(contributions, "Kodos for President 2016")
100.0

The new function looks very similar to the original, but it is easier to read and maintain.

Dictionaries do more than simply provide convenient syntax for accessing values by key instead of position. As we’ll see later in the chapter, dictionaries can also perform certain operations much more efficiently than lists and allow us to implement certain algorithms more elegantly.

It is very common to use a dictionary to represent a single “object” (in this case, a single contribution to a campaign, with each attribute of the contribution represented by a key in the dictionary), especially when exchanging data between different systems. In fact, later in the book, we will see a data format, JSON, that works well with Python dictionaries and lists. Later on, we will also see how object-oriented programming provides a third way of working with “objects” in our programs.

2.2.1.1. Practice Problems

The practice problems in this chapter use a data set that contains customer complaints. Each complaint includes:

  • the name of the company,

  • the date the complaint was received,

  • a unique ID,

  • the issue,

  • the product,

  • the consumer’s complaint narrative,

  • the company’s public response, and

  • the consumer’s home state.

This information will be stored in a dictionary where both the keys and the values are strings (i.e., the type of the dictionaries will be Dict[str, str]) . Here’s an example complaint:

complaint_as_dict =
   {'Company': "Smith's Forge",
    'Date received': '07/29/2013',
    'Complaint ID': '468882',
    'Issue': 'Managing the loan or lease',
    'Product': 'Consumer Loan',
    'Consumer complaint narrative': '',
    'Company response to consumer': 'Closed with explanation',
    'State': 'VA'}

Problem 1

Write a function, count_complaints_about, that takes a list of complaints and the name of a company and returns the number of complaints in the list for that company.

What might make good test cases for this function?

A possible pitfall: floats as keys

Before we move on talking about useful dictionary operations, we’d like to stop and consider a simple example an example that may surprise you:

>>> d = {0.3: "found"}
>>> d.get(0.1 + 0.1 + 0.1, "not found")
'not found'

Using your understanding of real numbers, you might expect get to return "found", the value associated with 0.3. Using floats, however, 0.1 + 0.1 + 0.1 does not equal 0.3 and so, get does not find the key in the dictionary and returns the default value "not_found".

So, while you technically use floats as keys for a dictionary, you should only do so when you are using values that can be represented exactly with a power of two or as the sum of powers of two.

2.2.2. Useful operations with dictionaries

Earlier, we saw that we can define a dictionary like this:

d = {"first_name": "John",
     "last_name": "Doe",
     "zip_code": "60637",
     "campaign": "Kang for President 2016",
     "amount": 27.50}

And access individual values by their keys like this:

>>> d["zip_code"]
'60637'

Note that when we attempt to access values in the dictionary, specifying a key that doesn’t exist in the dictionary produces an error:

>>> d["affiliation"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'affiliation'

In addition to the square-bracket notation, we can access a value using the get method, which method returns None if the dictionary does not contain the specified key.

>>> st = d.get("first_name")
>>> print(st)
John
>>> st = d.get("affiliation")
>>> print(st)
None

The get method also takes an optional second parameter to specify a default value that should be returned if the key doesn’t exist in the dictionary:

>>> d.get("affiliation", "Unknown")
'Unknown'

We can check if a dictionary contains a given key using the in operator:

>>> "first_name" in d
True
>>> "affiliation" in d
False

Dictionaries are also mutable: we can update the value associated with a key in a dictionary and add new key-value pairs. We assign a new value to a key like this:

>>> d["zip_code"] = "94305"
>>> d["zip_code"]
'94305'

and add a new key-value pair using the same syntax:

>>> d["affiliation"] = "Kodosican"
>>> d
{'first_name': 'John', 'last_name': 'Doe', 'zip_code': '94305', 'campaign': 'Kang for President 2016', 'amount': 27.5, 'affiliation': 'Kodosican'}

We can also create a dictionary by starting from an empty (or partial) dictionary and assigning values:

d = {}
d["first_name"] = "John"
d["last_name"] = "Doe"
d["zip_code"] = "60637"
d["campaign"] = "Kang for President 2016"
d["amount"] = 27.50
d["affiliation"] = "Kodosican"
>>> d
{'first_name': 'John', 'last_name': 'Doe', 'zip_code': '60637', 'campaign': 'Kang for President 2016', 'amount': 27.5, 'affiliation': 'Kodosican'}

We can remove an entry in the dictionary using the del operator:

>>> "affiliation" in d
True
>>> del d["affiliation"]
>>> "affiliation" in d
False

We can iterate over the keys in a dictionary using the dictionary’s keys method:

>>> for key in d.keys():
...     print(key)
... 
first_name
last_name
zip_code
campaign
amount

This operation is sufficiently common that Python provides a shorthand: iterating over the dictionary itself is equivalent to iterating over the keys. That is, when you iterate over a dictionary, the loop variable will iterate over the keys of the dictionary:

>>> for key in d:
...     print(key)
... 
first_name
last_name
zip_code
campaign
amount

We can also iterate over the values using the dictionary’s values method:

>>> for value in d.values():
...     print(value)
... 
John
Doe
60637
Kang for President 2016
27.5

And, finally, we can iterate over both the keys and the values using the dictionary’s items method:

>>> for key, value in d.items():
...     print(key, value)
... 
first_name John
last_name Doe
zip_code 60637
campaign Kang for President 2016
amount 27.5

The keys and values are not printed in any particular order. Most notably, the keys are not shown in alphabetical order. (They may happen to be shown in the order that they are added to the dictionary, but that is not guaranteed.) This behavior is another big difference between dictionaries and lists: lists store values in a specific order, and iterating over a list will always yield those values in that order. There is no such guarantee with dictionaries: if we iterate over the contents of a dictionary, we cannot assume any specific order and, in fact, that order can even change from one for loop to another!

A common pitfall: changing the set of keys in a dictionary as you iterate over it

There is an algorithm that allows us to compute the K most frequently seen items in a stream of data. We are not going to present the whole algorithm here. Instead, we will focus on a specific task within that algorithm that is used to exclude keys that have not been seen recently: take a dictionary that maps strings to positive integers, decrement all the values by one and remove any key-value pair where the resulting value is zero.

You might be tempted to write the following incorrect function to do this task:

def decr_and_remove(d):
    """
    Given a dictionary that maps keys to positive integers,
    decrement the values and remove any key-value pair in which
    the decremented value becomes zero

    Modifies the dictionary in-place.

    Args:
       d (Dict[str, int]): maps keys to positive integers

    Returns: None
    """
    for key, value in d.items():
        d[key] = value - 1
        if d[key] == 0:
            del d[key]

This code is not valid because you cannot modify a data structure as you iterate over it! Here’s one alternative solution:

def decr_and_remove(d):
    """
    Given a dictionary that maps keys to positive integers,
    decrement the values and remove any key-value pair in which
    the decremented value becomes zero

    Modifies the dictionary in-place.

    Args:
       d (Dict[str, int]): maps keys to positive integers

    Returns: None
    """
    keys_to_remove = []
    for key, value in d.items():
        d[key] = value - 1
        if d[key] == 0:
            keys_to_remove.append(key)

    for key in keys_to_remove:
        del d[key]

Here’s another that computes and returns a new dictionary with the desired keys rather than removing the keys from the input dictionary.

def decr_and_keep_pos(d):
    """
    Given a dictionary that maps keys to positive integers,
    subtract one from each value and include in the result only
    those key-value pairs that still have positive values after
    the decrement.

    Args:
      d (Dict[str, int]): keys to positive integers

    Returns (Dict[str, int]): a new dictionary that maps a key to
        an integer value, where the keys are the keys from the
        original dictionary that had values greater than one and
        the original values have been decremented by one.
    """
    rv = {}
    for key, value in d.items():
        if value > 1:
            rv[key] = value - 1
    return rv

2.2.3. Other ways to construct dictionaries

As we have seen, we can construct dictionaries using dictionary literals or updating an empty dictionary with new key-value pairs. We can also construct dictionaries use the dict function and dictionary comprehensions.

The dict function allows us to construct a new dictionary from a variety of different types of values including a list of key-value pairs or another dictionary. For example, here are three different ways to construct the same dictionary: the first uses a dictionary literal, the second uses a call to dict with a list of pairs, and the third uses a call to dict with a dictionary as the argument.

d = {"first_name": "John",
     "last_name": "Doe",
     "zip_code": "60637",
     "campaign": "Kang for President 2016",
     "amount": 27.50}

keys_and_data = [("first_name", "John"),
                 ("last_name", "Doe"),
                 ("zip_code", "60637"),
                 ("campaign", "Kang for President 2016"),
                 ("amount", 27.50)]

d_from_list = dict(keys_and_data)

d_from_dict = dict(d)

Dictionary comprehensions are similar to list comprehensions. In place of one transformation expression, dictionary comprehensions require two: one that yields a key and another that yields a value. For example, here’s another way to construct our sample dictionary:

>>> d = {key: value for key, value in keys_and_data}

This example uses very simple expressions for the key and the value, but in general, you can use arbitrarily complex expressions for either. You can also specify a boolean expression to use as a filter. For example, we could rewrite the decr_and_keep_pos function from the previous section using a dictionary comprehension with a filter:

def decr_and_keep_pos(d):
   """
   Given a dictionary that maps keys to positive integers,
   subtract one from each value and include in the result only
   those key-value pairs that still have positive values after
   the decrement.

   Args:
       d (Dict[str, int]): keys to positive integers

   Returns (Dict[str, int]): a new dictionary that maps a key to
       an integer value, where the keys are the keys from the
       original dictionary that had values greater than one.
       The values have been decremented by one.
   """
   return {key: value - 1 for key, value in d.items() if value > 1}

A key-value pair from the input dictionary will be used in the construction of the result only if the original value is greater than one.

2.2.4. Accumulating values with dictionaries

Dictionaries are often used to accumulate a value based on a key. For example, earlier we computed the total number of contributions for a specific campaign. What if we wanted to compute the total contributions for each campaign? We could call our previous function over and over again, once per campaign, but there is a better approach. A dictionary that maps a campaign’s name to the total contributions to that campaign is a perfect tool for this task. We start with an empty dictionary and add new entries as we encounter new campaigns:

def total_contributions_by_campaign(contributions):
    """
    Compute total contributions by campaign

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
          contribution

    Returns (Dict[str, float]): a dictionary that maps a campaign name to the
        total contributions to the campaign.
    """
    rv = {}
    for contribution in contributions:
        campaign = contribution["campaign"]
        if campaign not in rv:
            rv[campaign] = 0
        rv[campaign] += contribution["amount"]
    return rv

Here’s a sample run of this function using the sample contributions list defined earlier:

>>> total_contributions_by_campaign(contributions)
{'Kang for President 2016': 77.5, 'Kodos for President 2016': 100.0}

Notice that the result is a dictionary.

This implementation uses the not in operator to identify campaigns that are being seen for the first time. An alternative is to use get with a default value of zero for previously unseen campaigns:

def total_contributions_by_campaign(contributions):
    """
    Compute total contributions by campaign

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
        contribution

    Returns (Dict[str, float]): a dictionary that maps a campaign name to the
        total contributions to the campaign.
    """
    rv = {}
    for contribution in contributions:
        campaign = contribution["campaign"]
        rv[campaign] = rv.get(campaign, 0) + contribution["amount"]
    return rv

Accumulating values in this way is very common.

2.2.4.1. Practice Problems

Problem 2

Write a function, count_complaints_by_state, that take a list of complaint dictionaries and computes a dictionary that maps each state to the count the number of complaints received for that state. The result should include only states that have at least one complaint.

2.2.5. Nested dictionaries

Dictionaries can be nested, meaning that the value associated with a key can itself be a dictionary. For example, we might have a dictionary that maps each candidate to a dictionary that maps a ZIP Code as a string to the total contributions to that candidate from that ZIP Code. Here’s a version of this dictionary constructed using the contributions listed above:

contributions_by_cand_by_zipcode = {
    "Kodos for President 2016": {"60637": 100.00},
    "Kang for President 2016": {"07974": 50.00,
                                "60637": 27.50}
}

Note that we chose to represent ZIP Codes as strings, not integers, to make sure ZIP Codes like 07974 don’t lose their leading zero and become 7974.

We can extract the total amount of contributions from 60637 to the Kang for President 2016 campaign using this expression:

>>> contributions_by_cand_by_zipcode["Kang for President 2016"]["60637"]
27.5

The first set of square brackets extracts the sub-dictionary associated with "Kang for President 2016" in contributions_by_cand_by_zipcode, while the second retrieves the value associated with "60637" from that sub-dictionary.

The code to compute this dictionary is not that much more complex than the code for accumulating total contributions by candidate.

def total_contributions_by_campaign_by_zip(contributions):
    """
    Compute the total contributions from each ZIP Code for each
    campaign

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
        contribution

    Returns (Dict[str, Dict[str, float]]): a dictionary that maps a
      campaign name to a sub-dictionary that map a ZIP Codes (represented as
      a string) to the total contributions to that campaign from that zipcode
      (represented as a float).
    """
    rv = {}
    for contribution in contributions:
        campaign = contribution["campaign"]
        zipcode = contribution["zip_code"]
        if campaign not in rv:
            rv[campaign] = {}
        rv[campaign][zipcode] = rv[campaign].get(zipcode, 0) + contribution["amount"]
    return rv

For each contribution, we first ensure that the campaign is initialized properly in the return value. We then use the get method to retrieve the total for contributions seen thus far from this zipcode for the campaign, which will be zero for the first contribution from this zipcode. Once we have that value , we can just add in the amount of the current contribution and update the dictionary. Here’s a sample run of the function using the list of contributions defined earlier in the chapter:

>>> total_contributions_by_campaign_by_zip(contributions)
{'Kang for President 2016': {'60637': 27.5, '07974': 50.0}, 'Kodos for President 2016': {'60637': 100.0}}

Notice that, as expected, the result contains nested dictionaries.

2.2.5.1. Practice Problems

Problem 3

Here is an example dictionary, complaints_by_company that maps a company name to a list of complaints about that company:

complaints_by_company = \
  {"Smith's Forge": [{'Company': "Smith's Forge",
                              'Date received': '07/29/2013',
                              'Complaint ID': '468882',
                              'Issue': 'Managing the loan or lease',
                              'Product': 'Consumer Loan',
                              'Consumer complaint narrative': '',
                              'Company response to consumer': 'Closed with explanation',
                              'State': 'VA'}],
   'Acme Anvils': [{'Company': 'Acme Anvils',
                    'Date received': '08/23/2013',
                    'Complaint ID': '567783',
                    'Issue': 'Not heavy enough',
                    'Product': 'Anvil',
                    'Consumer complaint narrative': '',
                    'Company response to consumer': 'Closed without explanation',
                    'State': 'IL'},
                   {'Company': 'Acme Anvils',
                    'Date received': '09/17/2013',
                    'Complaint ID': '779986',
                    'Issue': 'too heavy enough',
                    'Product': 'Anvil',
                    'Consumer complaint narrative': '',
                    'Company response to consumer': 'Closed with explanation',
                    'State': 'VA'}]}

What is the result of evaluating the following expressions:

  • len(complaints_by_company['Acme Anvils'])

  • complaints_by_company["Smith's Forge"][0]

  • complaints_by_company['Acme Anvils'][1]

Write an expression that extracts the state from the first complaint in the list associated with 'Acme Anvils`.

Problem 4

Write a function, organize_complaints_by_company, that takes list of complaint dictionaries, creates a new dictionary that maps the name of a company to a list of the complaint dictionaries that concern that company. For example, the dictionary shown in Problem 3 was created by our implementation of this function.

2.2.6. Data structures and their complexity

The complexity of a computational solution to a problem, and how well that solution performs, depends not only on the algorithm we use, but also on our choice of data structures for that solution. Now that you know about two data structures, lists and dictionaries, it is important to understand the implications of using one or the other. This decision will not be just a matter of personal preference: your choice of data structure can have a considerable impact on how well (or how poorly) your code performs.

For example, our political contribution data only has postal ZIP Codes for the contributions, but not full addresses. One thing we might want to do is compute the total contributions in each state and, to do so, we need some way to map ZIP Codes to states. For the purposes of this example, let’s assume we only care about three ZIP Codes: 60637 in Illinois, 07974 in New Jersey, and 94043 in California.

We could represent this data easily enough with a dictionary:

zip_codes_dict = {"60637": "IL",
                  "07974": "NJ",
                  "94305": "CA"}

But we could also store this mapping from ZIP Codes to states in a list of tuples, where each tuple represents a mapping from a ZIP Code to that ZIP Code’s state:

zip_codes_list = [("60637", "IL"), ("07974", "NJ"), ("94305", "CA")]

Of course, the actual list of all the ZIP Codes would be much larger, with more than 40,000 entries.

Given a ZIP Code, obtaining that ZIP Code’s state would involve iterating through the list until we find the tuple that contains the mapping from that ZIP Code to the corresponding state. We can write a simple function to perform this operation:

def get_state_from_zip(zip_code, zip_codes_list):
    """
    Find the state associated with a given ZIP Code

    Args:
      zip_code (str): a ZIP Code
      zip_codes_list (List[Tuple[str, str]]): pairs of ZIP Codes and state abbreviations

    Returns (str | None): state abbreviation or None, if the ZIP Code
      does not appear in the list
    """
    for zc, st in zip_codes_list:
        if zc == zip_code:
            return st

    return None
>>> get_state_from_zip("60637", zip_codes_list)
'IL'
>>> get_state_from_zip("90210", zip_codes_list)

Notice that, if there is no corresponding ZIP Code in our list, the function simply returns None.

So, at this point we have two solutions that, essentially, accomplish the same task:

>>> zip_codes_dict
{'60637': 'IL', '07974': 'NJ', '94305': 'CA'}
>>> zip_codes_list
[('60637', 'IL'), ('07974', 'NJ'), ('94305', 'CA')]
>>> zip_codes_dict.get("60637")
'IL'
>>> get_state_from_zip("60637", zip_codes_list)
'IL'
>>> zip_codes_dict.get("07974")
'NJ'
>>> get_state_from_zip("07974", zip_codes_list)
'NJ'
>>> zip_codes_dict.get("94305")
'CA'
>>> get_state_from_zip("94305", zip_codes_list)
'CA'
>>> zip_codes_dict.get("11111")
>>> get_state_from_zip("11111", zip_codes_list)

Finding a given ZIP Code’s state in the list, however, takes an amount of time that is proportional to the size of the list. If we had 40,000 ZIP Codes, and the ZIP Code we’re looking for is at the end of the list, we would have to traverse the entire list to find that value. While it’s true that sometimes we’ll search for a value that is towards the start of the list, the time to find a value will on average be proportional to the number of elements in the list. If we had 20,000,000 values in the list, the average time to find a value would be larger than if we have 40,000 values in the list.

Dictionaries, on the other hand, are implemented internally using a data structure called a hash table that is optimized to access key-value mappings very quickly. In fact, the amount of time it takes to access a key-value mapping is not proportional to the size of the input (in this case, the ZIP Codes). So, finding a value in a dictionary with 40,000 values will (roughly) take the same time as finding it in a dictionary with 20,000,000 values.

Big-O notation

The performance or complexity of an algorithm or, in this case, of an operation on a data structure, is typically specified using big-O notation.

While big-O notation has a formal definition, it is not essential to understand the concept. In a nutshell, if \(n\) is the size of the input to a problem (e.g., in this example, \(n\) would be the number of ZIP Codes), then saying that something runs in \(O(n^2)\) means that its running time is roughly proportional to \(n^2\) (i.e., as n gets bigger, the running time grows quadratically). On the other hand, if an algorithm runs in \(O(\log n)\), its running time grows logarithmically.

Big-O notation helps us compare the performance of algorithms or individual operations on data structures. If one data structure can perform an operation in \(O(n)\) and another data structure can perform that same operation in \(O(n^2)\), we know that, on average, the first data structure performs that operation faster than the second data structure.

However, there is no “golden data structure” that beats every other data structure in every possible operation. In fact, when a data structure provides good performance in one operation, there will usually be a trade-off: other operations may be less efficient, or even not supported by the data structure. So, choosing the right data structure often involves asking yourself what operations you will be using the most, and whether each data structure can perform those operations efficiently.

In this case, we are looking at just one operation: finding a value in a list versus finding a value (through its key) in a dictionary. If \(n\) is the number of values, this operation can be done on lists in \(O(n)\) (the running time is linearly proportional to \(n\)), while this operation can be done on dictionaries in \(O(1)\), meaning that the running time is not proportional to \(n\) (this is often referred to as “constant time”; strictly speaking, this operation is done in something called “amortized constant time”, but you don’t need to concern yourself with this distinction). So, in this case, a dictionary would be a better choice.

However, you should not jump to the conclusion that we should now use dictionaries for everything. If we were evaluating a different operation, it could turn out that lists would be a better choice. We explore this proposition in more detail below.

We can observe this difference in performance empirically by running our code with the full ZIP Code database. We will show you the result of doing this experiment but, if you would like to follow along, you can use the following files provided in our example code: the data_structures/dictionaries/zipcodes.py module, as well as the ZIP Code database, data_structures/dictionaries/us_postal_codes.csv. You will also need to use IPython, not the regular Python interpreter (make sure to run it in the same directory that contains the zipcodes.py and us_postal_codes.csv files.

From the IPython interpreter, run the following:

In [1]: run zipcodes.py

This command will run the code in zipcodes.py, which reads the ZIP Code database and loads it into both a list (zip_codes_list) and a dictionary (zip_codes_dict). It will also define the get_state_from_zip function we wrote earlier, as well as a function called gen_random_zip that generates a random ZIP Code. We can see that they work as expected:

In [2]: get_state_from_zip("60637", zip_codes_list)
Out[2]: 'IL'

In [3]: get_state_from_zip("90210", zip_codes_list)
Out[3]: 'CA'

In [4]: zip_codes_dict.get("60637")
Out[4]: 'IL'

In [5]: zip_codes_dict.get("90210")
Out[5]: 'CA'

In [6]: gen_random_zip()
Out[6]: '48344'

In [7]: gen_random_zip()
Out[7]: '22219'

Note: You will very likely get different ZIP Codes when you call gen_random_zip.

When you make these individual calls, they may seem to be returning nearly instantaneously. Unfortunately, testing code performance informally in the interpreter obscures what will happen when you write a program that involves calling a function thousands or even millions of times.

IPython includes a handy %timeit command that will allow us to see how a piece of code performs when run many times. As it turns out, the dictionary version is nearly 200 times faster than the list-based version!

In [8]: %timeit zip_codes_dict.get(gen_random_zip())
100000 loops, best of 3: 4.9 µs per loop

In [9]: %timeit get_state_from_zip(gen_random_zip(), zip_codes_list)
1000 loops, best of 3: 955 µs per loop

Note: You will likely get different running times, but the order of magnitude between the two times should be roughly the same.

Things get more interesting if we actually look at how these running times change depending on the size of the dataset. Our zipcodes.py file also defines a small_zip_codes_list and a small_zip_codes_dict with just 500 ZIP Codes, and a medium_zip_codes_list and a medium_zip_codes_dict with just 2000 ZIP Codes. If we use the %timeit command to time our list-based and dictionary-based implementations with these datasets, we get the following times:

n = 5000

n = 20000

n = 43624

Lists

0.142 ms

0.493 ms

1.01 ms

Dictionaries

0.00477 ms

0.00478 ms

0.0048 ms

So, not only are dictionaries faster than lists in an absolute sense, their performance is independent of the size of the input. On the other hand, notice that as the dataset grows the list-based implementation takes more and more time.

This result occurs because, as we saw earlier, finding the mapping from a ZIP Code to a state in a list takes \(O(n)\) time, while finding a mapping in a dictionary takes \(O(1)\) time. If \(n\) is the number of ZIP Codes in our dataset, then the running time of a \(O(n)\) solution will increase (roughly) linearly as our dataset gets bigger, while the running time of a \(O(1)\) operation will remain (roughly) constant.

So, based on all this, it sounds like we should just use dictionaries everywhere instead of lists, right? Not so fast! As we mentioned earlier, there is no such thing as a “golden data structure” that performs every possible operation in an optimal fashion. There is, however, “the right data structure for this job” and, if “this job” is finding a mapping between a key and a value, then dictionaries are definitely the right data structure.

There are other operations, however, for which dictionaries are not so great. Most notably, as we saw earlier, dictionaries are unordered data structures, which means that iterating over a dictionary is not guaranteed to yield the keys in any particular order (and that order can even change from one for loop to another). On top of that, even if we don’t care about the order in which the values are processed, a for loop over a dictionary will usually be slower than a for loop over a list containing the same values. So, if the order of the data matters, or if we are going to process it in sequence often, then using a list will probably be a better choice.

Note

Why did we use random ZIP Codes instead of just testing the code with a fixed ZIP Code? If we used a fixed ZIP Code, the results could be skewed depending on where that ZIP Code is located in the list. For example, if we used 00210 (the first ZIP Code in the list), our list implementation would always find that ZIP Code very quickly (because it doesn’t have to iterate through the rest of the list), making it look like the list version is just as good as the dictionary version:

In [10]: %timeit zip_codes_dict.get("00210")
The slowest run took 21.25 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 63.3 ns per loop

In [11]: %timeit get_state_from_zip("00210", zip_codes_list)
The slowest run took 17.80 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 127 ns per loop

In this case, the dictionary version is only two times faster than the list version. Notice that we also get a message about “an intermediate result being cached”. This phrase refers to the fact that computers are usually smart about detecting when a location in memory is being accessed frequently, and making sure it is “cached” in higher-speed memory in the computer’s processor (instead of having to go to the computer’s main memory, which is slower). Testing our implementation with the same inputs over and over again could show performance gains that are due to this caching mechanism, not the efficiency of the data structure itself.

2.2.7. Sets

Dictionaries are great for associating values with unique keys, but sometimes we may simply want to have a collection of unique keys that we can access as efficiently as a dictionary, but without associating a value with each key. To do this, we can simply use a set data structure, which allows us to store an unordered collection of distinct elements.

Typical set operations include:

  • membership (\(e \in S\)): which tests whether a given element, \(e\), appears in the set \(S\),

  • subset (\(S \subseteq T\)): which tests whether every element in \(S\) also appears as an element in \(T\).

  • union (\(S \cup T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and/or \(e \in T\),

  • intersection (\(S \cap T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and \(e \in T\), and

  • difference (\(S - T\)): which yields a new set that contains the elements \(e\) where \(e \in S\) and \(e \notin T\).

Sets are sufficiently useful that Python provides a built-in set data structure.

We can define a set literal by surrounding the initial elements of the set with curly braces or using the built-in set function with a sequence (list, string, etc.) as the parameter:

>>> zipcodes = {"60637", "07974"}
>>> zipcodes
{'60637', '07974'}
>>> zipcodes = set(["60637", "07974"])
>>> zipcodes
{'60637', '07974'}
>>> vowels = set("aeiou")
>>> vowels
{'o', 'i', 'u', 'a', 'e'}

To construct an empty set, we use the set function with no arguments:

>>> empty_set = set()

We cannot use curly braces to create an empty set, because Python interprets {} as an empty dictionary.

We can determine the number of elements in a set using the len function:

>>> len(empty_set)
0
>>> len(zipcodes)
2
>>> len(vowels)
5

and we can test for set membership using the in operator:

>>> "14850" in zipcodes
False
>>> "60637" in zipcodes
True

We can determine whether one set is a subset of another set using the issubset method. Specifically, a call of the form S.issubset(T) tests whether the set S is a subset of the set T. For example:

>>> {"07974"}.issubset(zipcodes)
True
>>> {"60637", "14850"}.issubset(zipcodes)
False

Sets are mutable: we can add and remove elements. We can add elements to a set using the add method:

>>> zipcodes
{'60637', '07974'}
>>> zipcodes.add("14850")
>>> zipcodes.add("60637")
>>> zipcodes
{'60637', '07974', '14850'}

The second example illustrates an important point about sets: adding an element that is already a member of the set has no effect.

We can take elements out of a set using either the remove method or the discard method. The remove method removes a value from a set, if it is a member, and throws a KeyError exception if it is not. The discard method, in contrast, removes the value if it is a member of the set and does nothing if the value is not a member of the set.

>>> zipcodes
{'60637', '07974', '14850'}
>>> zipcodes.remove("60637")
>>> zipcodes.remove("60615")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '60615'
>>> zipcodes
{'07974', '14850'}
>>> zipcodes.discard("14850")
>>> zipcodes.discard("60615")
>>> zipcodes
{'07974'}

Using the add method and len function, we can easily write a function to determine how many different ZIP Codes had at least one contribution.

def compute_num_zipcodes(contributions):
    """
    Compute the number of zipcodes with at least one contribution

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
        contribution

    Returns (int): the number of zipcodes with at least one
      contribution.
    """

    zipcodes_seen = set()
    for contribution in contributions:
        zipcodes_seen.add(contribution["zip_code"])
    return len(zipcodes_seen)
>>> compute_num_zipcodes(contributions)
2

Alternatively, we can use set comprehensions, which are like dictionary comprehensions with only keys, to do the same computation:

def compute_num_zipcodes(contributions):
    """
    Compute the number of zipcodes with at least one contribution

    Args:
      contributions (List[Dict[str, (str | float)]]): one dictionary per
        contribution

    Returns (int): the number of zipcodes with at least one
      contribution
    """

    zipcodes_seen = {c["zip_code"] for c in contributions}
    return len(zipcodes_seen)
>>> compute_num_zipcodes(contributions)
2

Sets come with a variety of useful operations. We can compute a new set that is the union of two sets using the | operator or the union method, the intersection of two sets using the & operator or the intersection method, and the difference of two sets using the - operator or the difference method:

>>> zipcodes = {"60637", "07974"}
>>> zipcodes | {"60637", "60615"}
{'60637', '60615', '07974'}
>>> zipcodes.union({"60637", "60615"})
{'60637', '60615', '07974'}
>>> zipcodes & {"60637", "60615"}
{'60637'}
>>> zipcodes.intersection({"60637", "60615"})
{'60637'}
>>> zipcodes - {"60637", "60615"}
{'07974'}
>>> zipcodes.difference({"60637", "60615"})
{'07974'}

Finally, since sets are collections, we can iterate over the elements in a set:

>>> for zc in zipcodes:
...      print(zc)
... 
60637
07974

As with dictionaries, sets are unordered and so the elements will be printed in an arbitrary order.

2.2.7.1. Practice Problem

Problem 5

Write a function, count_unique_states_per_company, that takes a list of complaint dictionaries and returns a dictionary that maps a company name to the number of different states that had complaints about that company. For example, if a company had two complaints from Viginia and two from New Jersey, the result for that company would be two.

Hint: build an intermediate data structure that maps the name of a company to set a of the states that had at least one complaint for that company.

2.2.8. Practice Problem Solutions

Here is an example list of complaints that will be used in test code for the practice problems.

complaint_list = [
    {'Company': "Smith's Forge",
     'Date received': '07/29/2013',
     'Complaint ID': '468882',
     'Issue': 'Managing the loan or lease',
     'Product': 'Consumer Loan',
     'Consumer complaint narrative': '',
     'Company response to consumer': 'Closed with explanation',
     'State': 'VA'},

    {'Company': 'Acme Anvils',
     'Date received': '08/23/2013',
     'Complaint ID': '567783',
     'Issue': 'Not heavy enough',
     'Product': 'Anvil',
     'Consumer complaint narrative': '',
     'Company response to consumer': 'Closed without explanation',
     'State': 'IL'},
    
    {'Company': 'Acme Anvils',
     'Date received': '09/17/2013',
     'Complaint ID': '779986',
     'Issue': 'too heavy enough',
     'Product': 'Anvil',
     'Consumer complaint narrative': '',
     'Company response to consumer': 'Closed with explanation',
     'State': 'VA'}]

Problem 1

Here is a simple solution for this problem:

def count_complaints_about(complaints, company_name):
    """
    Count complaints about the specified company

    Args:
        complaints (List[Dict[str, str]) A list of complaints, where each 
            complaint is a dictionary
        company_name [str]: The company name to count complaints for

    Returns [int]: count of complaints
    """
    count = 0
    for c in complaints:
        if c["Company"] == company_name:
            count = count + 1
    return count

And here is some simple test code:

def test_count_complaints_about():
    """ Simple tests for count_complaints_about """

    # Test the empty list
    assert count_complaints_about([], "Smith's Forge") == 0

    # Test one element lists using a complaint about Wells Fargo.
    assert count_complaints_about([complaint_list[0]], "Smith's Forge") == 1
    assert count_complaints_about([complaint_list[0]], 'Acme Anvils') == 0

    # Test a list with multiple complaints
    longer_list = complaint_list + [complaint_list[0]] * 2
    assert count_complaints_about(longer_list, "Smith's Forge") == 3
    assert count_complaints_about(longer_list, 'Acme Anvils') == 2
    assert count_complaints_about(longer_list, 'Acme Sweaters') == 0

Problem 2

Here is a solution for this problem that uses the in operator:

def count_complaints_by_state(complaints):
    """
    Compute the number of complaints for each state.

    Args:
         complaints (List[Dict[str, str]]): A list of complaints, where each 
            complaint is a dictionary

    Returns: (Dict[str, int]): a dictionary that maps the name of a state to
      the number of complaints from that state.
    """
    d = {}
    for c in complaints:
        state = c["State"]
        if state in d:
            d[state] = d[state] + 1
        else:
            d[state] = 1
    return d

And here is a solution that uses the dictionary get method:

def count_complaints_by_state(complaints):
    """
    Compute the number of complaints for each state.

    Args:
         complaints (List[Dict[str, str]]): A list of complaints, where each 
            complaint is a dictionary

    Returns: (Dict[str, int]): a dictionary that maps the name of a state to
      the number of complaints from that state.
    """
    d = {}
    for c in complaints:
        state = c["State"]
        if state in d:
            d[state] = d[state] + 1
        else:
            d[state] = 1
    return d

And here is some simple test code:

def test_count_complaints_by_state():
    """ Simple tests for count_complaints_by_state """

    # Test the empty list
    assert count_complaints_by_state([]) == {}

    # Test one element lists using a complaint from VA
    assert count_complaints_by_state([complaint_list[0]]) == {'VA': 1}

    # Test a list with multiple complaints
    longer_list = complaint_list + [complaint_list[0]] * 2
    assert count_complaints_by_state(longer_list) == {'VA' : 4, 'IL' : 1}

Problem 3

>>> complaints_by_company = \
...   {"Smith's Forge": [{'Company': "Smith's Forge",
...                       'Date received': '07/29/2013',
...                       'Complaint ID': '468882',
...                       'Issue': 'Way way too hot',
...                       'Product': 'Consumer Loan',
...                       'Consumer complaint narrative': '',
...                       'Company response to consumer': 'Closed with explanation',
...                       'State': 'VA'}],
...    'Acme Anvils': [{'Company': 'Acme Anvils',
...                     'Date received': '08/23/2013',
...                     'Complaint ID': '567783',
...                     'Issue': 'Not heavy enough',
...                     'Product': 'Anvil',
...                     'Consumer complaint narrative': '',
...                     'Company response to consumer': 'Closed without explanation',
...                     'State': 'IL'},
...                    {'Company': 'Acme Anvils',
...                     'Date received': '09/17/2013',
...                     'Complaint ID': '779986',
...                     'Issue': 'too heavy',
...                     'Product': 'Anvil',
...                     'Consumer complaint narrative': '',
...                     'Company response to consumer': 'Closed with explanation',
...                     'State': 'VA'}]}
>>> len(complaints_by_company['Acme Anvils'])
2
>>> complaints_by_company["Smith's Forge"][0]
{'Company': "Smith's Forge", 'Date received': '07/29/2013', 'Complaint ID': '468882', 'Issue': 'Way way too hot', 'Product': 'Consumer Loan', 'Consumer complaint narrative': '', 'Company response to consumer': 'Closed with explanation', 'State': 'VA'}
>>> complaints_by_company['Acme Anvils'][1]
{'Company': 'Acme Anvils', 'Date received': '09/17/2013', 'Complaint ID': '779986', 'Issue': 'too heavy', 'Product': 'Anvil', 'Consumer complaint narrative': '', 'Company response to consumer': 'Closed with explanation', 'State': 'VA'}
>>> complaints_by_company['Acme Anvils'][0]['State']
'IL'

Problem 4

Here is a solution for this problem:

def organize_complaints_by_company(complaints):
    """
    Create a dictionary that maps the name of a company to a list of the
    complaint dictionaries that concern that company.

    Args:
        complaints (List[Dict[str, str]) A list of complaints, where each 
            complaint is a dictionary

    Returns: (Dict[str, List[Dict[str, str]]]): a dictionary that names the name
      of a company to a list of complaints for that company.
    """
    d = {}
    for complaint in complaints:
        c = complaint["Company"]
        if c not in d:
            d[c] = []
        # we are guaranteed that c will be in d at this
        # point and that d[c] will be a list.
        d[c].append(complaint)
    return d

And here is some simple test code:

def test_organize_complaints_by_company():
    """ Simple tests for organize_complaints_by_company """

    # Empty list of complaints
    assert organize_complaints_by_company([]) == {}

    # List with one complaint
    c = complaint_list[0]
    assert organize_complaints_by_company([c]) == {c["Company"] : [c]}

    
    # Test a list with multiple complaints
    longer_list = complaint_list
    assert organize_complaints_by_company(longer_list) == \
        {"Smith's Forge" : [complaint_list[0]],
         'Acme Anvils' : complaint_list[1:]}

Problem 5

Here is a solution for this problem:

def test_count_unique_states_per_company():
    """ Simple tests for count_unique_states_per_company """

    # Empty list of complaints
    assert count_unique_states_per_company([]) == {}

    # List with one complaint
    c = complaint_list[0]
    assert count_unique_states_per_company([c]) == {c["Company"] : 1}

    
    # Test a list with multiple complaints
    longer_list = complaint_list
    assert count_unique_states_per_company(longer_list) == \
        {"Smith's Forge" : 1,
         'Acme Anvils' : 2}

Our solution uses an intermediate data structure that maps the name of a company to a set of the states that had at least one complaint from that company. It then uses a dictionary comprehesion to compute the final result. Notice the use of the items method in the dictionary comprehension.

Here is some simple test code for this problem:

def test_count_unique_states_per_company():
    """ Simple tests for count_unique_states_per_company """

    # Empty list of complaints
    assert count_unique_states_per_company([]) == {}

    # List with one complaint
    c = complaint_list[0]
    assert count_unique_states_per_company([c]) == {c["Company"] : 1}

    
    # Test a list with multiple complaints
    longer_list = complaint_list
    assert count_unique_states_per_company(longer_list) == \
        {"Smith's Forge" : 1,
         'Acme Anvils' : 2}