Source: this section is not based on external sources.
As we have seen in the previous section, lists can be used to store collections of data of arbitrary length. Often one is faced with the problem of finding information in a list. Consider the following example. We are given a list of computers in a large IT department:
computer_names = ["apple", "pear", "cherry", "banana", "mango", "grape", "peach"]
(Clearly, this IT department has drawn inspiration from fruit names.)
If a new computer needs to be installed, the IT department needs to check whether a computer with a given name (for instance, "pear" or "orange") already exists. How do we check this? In Python there is an easy answer to this question. We can write code such as this:
if "pear" in computer_names: print ( "Pear exists ") else: print ( "Pear does not exist")
Easy, right?
Actually, not always.
To understand the weaknesses of this code, it is important to have an understanding of algorithms. An algorithm is an unambiguous specification of how to solve a class of problems, such as the problem of finding a name in a list of names. Algorithms for finding elements in lists are often called search algorithms.
How can a computer find a name in a list? The simplest approach is the following:
def linear_search ( name, list_of_names ): for list_name in list_of_names: if name == list_name: return True return False
Underneath, the in statement in Python works similar to the above function. Consequently, this line of code:
if name in list_of_names:
will give you the same result as this code:
if linear_search(list, list_of_names):
What are the drawbacks of the linear_search function (as well as the in statement)? Let us execute our function on some examples to understand this. Let us first consider the example
if linear_search("apple", computer_names):
In this example, our algorithm is very fast. The first name that we will take from the list is apple. We will find out that this name is equal to the name we are looking for. The linear search function will immediately return the value True.
Let us consider a second example.
if linear_search("orange", computer_names):
The situation is very different in this case. Our algorithm will first look at apple; as we are not looking for apples, it will continue with pear, then cherry, ... and so on, till we arrive at the end of the list. The algorithm will then return False.
Hence, in the worst case, our algorithm will need to look at all elements in the list. If our list has only 9 elements, this is not problematic.
However, suppose that we would use this algorithm to search among all names of users on Facebook (2.3 billion in total). In that case the algorithm would take very long. If we would be looking for a name that is not present on Facebook, we would have to retrieve all 2.3 billion names from the list before we can answer True or False!
As in the worst case we would need to look at every element in the list one after the other, this algorithm is known as a linear search algorithm. The number of elements that we look at in the worst case is a linear function of the number of elements in the list.
This raises the question whether we can do better. Fortunately, in some cases we can.
Assume that we would have stored our computer names in a sorted order:
sorted_computer_names = ["apple", "banana", "cherry", "grape", "mango", "peach", "pear"]
Note that we can also obtain such a sorted list using the sorted function:
computer_names = ["apple", "pear", "cherry", "banana", "mango", "grape", "peach"] sorted_computer_names = sorted ( computer_names )
On these sorted lists, we can use an approach that is inspired by how we search for words in a dictionary: we start with the middle page of the dictionary, and based on the words at that page, decide where we continue looking in the dictionary. The following code in Python shows this approach:
def binary_search ( name, list_of_names ): first = 0 last = len(list_of_names)-1 found = False while first<=last and not found: middle = (first + last)//2 if list_of_names[middle] == name: found = True else: if name < list_of_names[middle]: last = middle-1 else: first = middle+1 return found
This code is a little bit more complex than the code we have seen till now!
Let's start with a simple case to understand how this code works. Consider binary_search("grape",sorted_computer_names). Before the start of the while loop, the value for last is 6. As 0<6, we calculate the middle next. The outcome of this is 3. This yields the following situation:
first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
We retrieve the 3rd value of sorted_computer_names next, which happens to be equal to grape. At this moment, the code stops and returns True.
Let's continue with a more complex case: binary_search("orange",sorted_computer_names). Similar to the previous case, we first check list_of_names[3]. Again, we are in the following situation:
first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
Here, we determine that "orange" >="grape". We set first=4, giving the following situation:
middle first last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
Basically, the program has decided at this moment that if the word orange is present in the sorted list, it must be at position 4 or higher.
As last is still 6, and 4<=6, the loop continues. We calculate a new middle: (4 + 6) // 2, which yields the value 5. This gives this situation:
first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
The name at position 5 is peach. As "orange"< "peach", the loop continues; we set last = 5-1, as orange must be somewhere before peach in the list:
first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
As first<=last, we continue:
first last middle +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
We test whether "mango"=="orange"; as "mango" < "orange" we obtain the following situation:
last first middle +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6
In words, the algorithm believes now that if orange is in the list, it must be after mango, at position 5 or higher. As at this moment last<first, the algorithm stops and returns False.
It is instructive to consider the number of elements in the list that the algorithm compared with orange:
grape peach mango
In total, only 3 elements were considered! This is significantly less than in our earlier algorithm.
At this moment, it can be instructive to use the algorithm to search for other elements in the sorted list. You will see that on this list, the binary_search algorithm will never look at more than 3 elements.
One can wonder how many elements would be considered in the worst case for a sorted list of any other length. To understand how we can generalize our results to other lists, let us first consider a list of 15 elements:
first middle last +---+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+ | | | | | | | | | | | | | | | | | +---+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
What would our algorithm do in this case? There are three cases:
In the last two cases, we know that the worst case is that we are looking at 3 elements. In total, we hence will look at 4 elements in the worst case.
We can continue this argument. On a list of 31 elements, the worst case number of elements considered will be 5. On a list of 63 elements, it will be 6:
7 3 15 4 31 5 63 6 127 7 255 8 ... ... 4294967295 32
In general, the pattern is this: for a list of length
in the worst case k elements are considered.
As a result, if we have more than 4 billion names in a sorted list, we only need to look at 32 of these names in the worst case to determine whether a given name is present in the list!
This is an enormous improvement over the linear search algorithm.
This improvement was the consequence of a careful consideration of two aspects:
These two aspects are the topics studied in courses on algorithms and data structures. A good understanding of algorithms and data structures can in practice help a lot to develop programs that work efficiently. You will be able to study these topics in more detail in later courses.
The code that we saw in the previous section is useful if we want to test that a given element occurs in a sorted list. We can use this two write programs such as
if binary_search ( computer_name, sorted_computer_names ): print ( "Welcome to our system" ) else: print ( "The specified computer does not exist")
In other words, programs in which we only wish to test the presence of a given element.
Often, however, one does not only want to check that a given computer exists, but one also wants to retrieve information associated with the computer, such as its operating system.
Using nested data structures, we could store such associated information in a sorted list as follows:
sorted_computer_names_os = [("apple","MacOS"), ("banana","Linux"), ("cherry","Linux"), \ ("grape", "MacOS"), ("mango", "Windows"), ("peach", "Windows"), \ ("pear", "Linux")]
Hence, our list now consists of tuples, where the first part of the tuple stores the computer name and the second part stores the operating system.
We cannot use our existing binary_search to look for the data associated with a given name. In our binary_search algorithm, we perform a test like
if list_of_names[middle] == name:
If our lists consists of tuples, list_of_names[middle] will be a tuple, such as ("apple","MacOS"). We can only use our current implementation to search for complete tuples such as ("apple", "MacOS").
The following modification does allow us to retrieve associated information:
def binary_search_retrieve ( name, list_of_names ): first = 0 last = len(list_of_names)-1 found = False while first<=last and not found: middle = (first + last)//2 middle_name, middle_info = list_of_names[middle] if middle_name == name: found = True else: if name < middle_name: last = middle-1 else: first = middle+1 if found: return middle_info else: return None
Observe that in this code, we assume that our list now contains tuples, each of which we can unpack. If the element we are looking for exists, we return the associated value; otherwise, we made the choice to return the special value None. We can now write code such as:
if binary_search_retrieve ( computer_name, sorted_computer_names_os ) == "Windows": print ( "Welcome to our system" ) else: print ( "Your machine does not exist or your operating system is not supported")
While in this example, we associated one string with a computer name, nothing limits us to associate more complex information.
sorted_computer_names_os_cc = [("apple",("MacOS","BE")), ("banana",("Linux","BE")), ("cherry",("Linux","FR")), \ ("grape", ("MacOS","FR")), ("mango", ("Windows","NL")), ("peach", ("Windows", "DE")), \ ("pear", ("Linux","DE"))]
We can still use binary_search_retrieve to write programs such as this:
result = binary_search_retrieve ( computer_name, sorted_computer_names_os_cc ) if result is not None: os, country = result print ( "Welcome! Your operating system is " + os + " and your country " + country ) else: print ( "Your computer is not known")
Our binary search algorithm only works for data that is sorted. Assume we are given more complex data. How we can sort this data? Fortunately, the sorted function that we saw earlier, also works on tuples. In this case, it will use a lexicographical order, in which the order between two tuples is determined by the second element in the tuple if the first elements are equal.
>>> unsorted_numbers = [ (3,4), (6,3), (1,2), (3,5), (6,2) ] >>> sorted (unsorted_numbers) [(1, 2), (3, 4), (3, 5), (6, 2), (6, 3)]
Note that the order of information in a tuple is important when using the sort function. If the information is not in the correct order, one solution is to recreate the data in the desired order. For instance, to sort our computer_names on operating system, we can write
>>>> sorted ( [ ((cnoc[1][0], cnoc[1][1]), cnoc[0]) for cnoc in sorted_computer_names_os_cc ] ) [(('Linux', 'BE'), 'banana'), (('Linux', 'DE'), 'pear'), (('Linux', 'FR'), 'cherry'), \ (('MacOS', 'BE'), 'apple'), (('MacOS', 'FR'), 'grape'), (('Windows', 'DE'), 'peach'), \ (('Windows', 'NL'), 'mango')]
Recreating the data just to order it differently may seem a little complex. Furthermore, our indexing (cnoc[1][0]) becomes cumbersome and hard to read. We will see later in the course that there is an alternative solution to this problem.
In our explanation, we made our lives easy by making a number of implicit assumptions:
It is important to ask yourself what the code would do if these restrictions no longer hold. We will not discuss these questions in detail in this reference. However, as small exercises consider doing the following:
- algorithm
- An unambiguous specification of how to solve a class of problems.
- binary search
- A search algorithm that searches by repeatedly splitting a list in two parts
- linear search
- A search algorithm that considers all elements of a list in the worst case
- search algorithm
- An algorithm for searching an element that fulfills a well-defined set of requirements