*Source: this section is not based on external sources*.

As we have seen in the previous section, lists can be used to store collections of data of arbitrary length. Often one is faced with the problem of finding information in a list. Consider the following example. We are given a list of computers in a large IT department:

computer_names = ["apple", "pear", "cherry", "banana", "mango", "grape", "peach"]

(Clearly, this IT department has drawn inspiration from fruit names.)

If a new computer needs to be installed, the IT department needs to check whether a computer with a given name (for instance, "pear" or "orange") already exists. How do we check this? In Python there is an easy answer to this question. We can write code such as this:

if "pear" in computer_names: print ( "Pear exists ") else: print ( "Pear does not exist")

Easy, right?

Actually, not always.

To understand the weaknesses of this code, it is important
to have an understanding of **algorithms**. An
algorithm is an unambiguous specification of how to solve a class of problems,
such as the problem of finding a name in a list of names.
Algorithms for finding elements in lists are often called **search algorithms**.

How can a computer find a name in a list? The simplest approach is the following:

def linear_search ( name, list_of_names ): for list_name in list_of_names: if name == list_name: return True return False

Underneath, the `in` statement in Python works similar to the above function.
Consequently, this line of code:

if name in list_of_names:

will give you the same result as this code:

if linear_search(list, list_of_names):

What are the drawbacks of the `linear_search` function (as well as the `in` statement)?
Let us execute our function on some examples to understand this.
Let us first consider the example

if linear_search("apple", computer_names):

In this example, our algorithm is very fast. The first name that we will take from the list
is `apple`. We will find out that this name is equal to the name we are looking for.
The linear search function will immediately return the value `True`.

Let us consider a second example.

if linear_search("orange", computer_names):

The situation is very different in this case. Our algorithm will first look at `apple`;
as we are not looking for apples, it will continue with `pear`, then `cherry`, ... and so on,
till we arrive at the end of the list. The algorithm will then return `False`.

Hence, in the worst case, our algorithm will need to look at all elements in the list. If our list has only 9 elements, this is not problematic.

However, suppose that we would use this algorithm to search among all names of users on
Facebook (2.3 billion in total). In that case the algorithm would take very long.
If we would be looking for a name that is not present on Facebook, we
would have to retrieve all 2.3 billion names from the list before we can answer
`True` or `False`!

As in the worst case we would need to look at every element in the list one
after the other, this algorithm is known as a **linear search algorithm**.
The number of elements that we look at in the worst case is a linear function
of the number of elements in the list.

This raises the question whether we can do better. Fortunately, in some cases we can.

Assume that we would have stored our computer names in a sorted order:

sorted_computer_names = ["apple", "banana", "cherry", "grape", "mango", "peach", "pear"]

Note that we can also obtain such a sorted list using the `sorted` function:

computer_names = ["apple", "pear", "cherry", "banana", "mango", "grape", "peach"] sorted_computer_names = sorted ( computer_names )

On these sorted lists, we can use an approach that is inspired by how we search for words in a dictionary: we start with the middle page of the dictionary, and based on the words at that page, decide where we continue looking in the dictionary. The following code in Python shows this approach:

def binary_search ( name, list_of_names ): first = 0 last = len(list_of_names)-1 found = False while first<=last and not found: middle = (first + last)//2 if list_of_names[middle] == name: found = True else: if name < list_of_names[middle]: last = middle-1 else: first = middle+1 return found

This code is a little bit more complex than the code we have seen till now!

Let's start with a simple case to understand how this code works.
Consider `binary_search("grape",sorted_computer_names)`. Before the start
of the `while` loop, the value for `last` is 6. As `0<6`, we
calculate the `middle` next. The outcome of this is 3. This yields the following
situation:

first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

We retrieve the 3rd value of `sorted_computer_names` next, which happens
to be equal to `grape`. At this moment, the code stops and returns `True`.

Let's continue with a more complex case: `binary_search("orange",sorted_computer_names)`.
Similar to the previous case, we first check `list_of_names[3]`. Again,
we are in the following situation:

first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

Here, we
determine that `"orange" >="grape"`. We set `first=4`, giving the following
situation:

middle first last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

Basically, the program has decided at this moment that if the word `orange` is
present in the sorted list, it must be at position 4 or higher.

As `last` is still 6, and `4<=6`, the loop continues. We calculate a new
`middle`: `(4 + 6) // 2`, which yields the value 5. This gives this situation:

first middle last +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

The name at position 5 is
`peach`. As `"orange"< "peach"`, the loop continues;
we set `last = 5-1`, as `orange` must be somewhere before `peach` in the list:

As `first<=last`, we continue:

first last middle +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

We test whether `"mango"=="orange"`; as `"mango" < "orange"` we obtain
the following situation:

last first middle +-------+--------+--------+-------+-------+-------+------+ | apple | banana | cherry | grape | mango | peach | pear | +-------+--------+--------+-------+-------+-------+------+ 0 1 2 3 4 5 6

In words, the algorithm believes now that if `orange` is in the list, it must be after
`mango`, at position 5 or higher. As at this moment `last<first`, the
algorithm stops and returns `False`.

It is instructive to consider the number of elements in the list that the algorithm
compared with `orange`:

grape peach mango

In total, only 3 elements were considered! This is significantly less than in our earlier algorithm.

At this moment, it can be instructive to use the algorithm to search for
other elements in the sorted list. You will see that on this list, the
`binary_search` algorithm will *never* look at more than 3 elements.

One can wonder how many elements would be considered in the worst case for a sorted list of any other length. To understand how we can generalize our results to other lists, let us first consider a list of 15 elements:

first middle last +---+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+ | | | | | | | | | | | | | | | | | +---+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

What would our algorithm do in this case? There are three cases:

- We stop, as the element we are looking for is in the middle. Clearly, this is not the worst case.
- We move
`last`to position 7; in this case we are effectively continuing the algorithm on a part of the list having 7 elements. - We move
`first`to position 9; in this case we are effectively continuing the algorithm on a part of the list having 7 elements as well.

In the last two cases, we know that the worst case is that we are looking at 3 elements. In total, we hence will look at 4 elements in the worst case.

We can continue this argument. On a list of 31 elements, the worst case number of elements considered will be 5. On a list of 63 elements, it will be 6:

7 3 15 4 31 5 63 6 127 7 255 8 ... ... 4294967295 32

In general, the pattern is this: for a list of length

2^{k} − 1

in the worst case k elements are considered.

As a result, if we have more than 4 billion names in a sorted list, we only need to look at 32 of these names in the worst case to determine whether a given name is present in the list!

This is an enormous improvement over the linear search algorithm.

This improvement was the consequence of a careful consideration of two aspects:

- The organisation of data: we did not put the elements in the list in an arbitrary order; the elements had to be sorted.
- The algorithm operating on the data: we tuned our algorithm such that it worked much better for the chosen organisation of the data.

These two aspects are the topics studied in courses on *algorithms* and *data structures*.
A good understanding of algorithms and data structures can in practice help a lot
to develop programs that work efficiently. You will be able to study these
topics in more detail in later courses.

The code that we saw in the previous section is useful if we want to test that a given element occurs in a sorted list. We can use this two write programs such as

if binary_search ( computer_name, sorted_computer_names ): print ( "Welcome to our system" ) else: print ( "The specified computer does not exist")

In other words, programs in which we only wish to test the presence of a given element.

Often, however, one does not only want to check that a given computer exists, but one also wants to retrieve information associated with the computer, such as its operating system.

Using nested data structures, we could store such associated information in a sorted list as follows:

sorted_computer_names_os = [("apple","MacOS"), ("banana","Linux"), ("cherry","Linux"), \ ("grape", "MacOS"), ("mango", "Windows"), ("peach", "Windows"), \ ("pear", "Linux")]

Hence, our list now consists of tuples, where the first part of the tuple stores the computer name and the second part stores the operating system.

We cannot use our existing `binary_search` to look for the data associated with
a given name. In our `binary_search` algorithm, we perform a test
like

if list_of_names[middle] == name:

If our lists consists of tuples, `list_of_names[middle]` will be a tuple,
such as `("apple","MacOS")`. We can only use our current implementation to
search for complete tuples such as `("apple", "MacOS")`.

The following modification does allow us to retrieve associated information:

def binary_search_retrieve ( name, list_of_names ): first = 0 last = len(list_of_names)-1 found = False while first<=last and not found: middle = (first + last)//2 middle_name, middle_info = list_of_names[middle] if middle_name == name: found = True else: if name < middle_name: last = middle-1 else: first = middle+1 if found: return middle_info else: return None

Observe that in this code, we assume that our list now contains tuples,
each of which we can unpack. If the element we are looking for exists,
we return the associated value; otherwise, we made the choice to
return the special value `None`. We can now write code such as:

if binary_search_retrieve ( computer_name, sorted_computer_names_os ) == "Windows": print ( "Welcome to our system" ) else: print ( "Your machine does not exist or your operating system is not supported")

While in this example, we associated one string with a computer name, nothing limits us to associate more complex information.

sorted_computer_names_os_cc = [("apple",("MacOS","BE")), ("banana",("Linux","BE")), ("cherry",("Linux","FR")), \ ("grape", ("MacOS","FR")), ("mango", ("Windows","NL")), ("peach", ("Windows", "DE")), \ ("pear", ("Linux","DE"))]

We can still use `binary_search_retrieve` to write programs such as this:

result = binary_search_retrieve ( computer_name, sorted_computer_names_os_cc ) if result is not None: os, country = result print ( "Welcome! Your operating system is " + os + " and your country " + country ) else: print ( "Your computer is not known")

Our binary search algorithm only works for data that is sorted. Assume we are given more complex
data. How we can sort this data? Fortunately, the `sorted` function that we saw earlier, also works on
tuples. In this case, it will use a lexicographical order, in which the order between
two tuples is determined by the second element in the tuple if the first elements are equal.

>>> unsorted_numbers = [ (3,4), (6,3), (1,2), (3,5), (6,2) ] >>> sorted (unsorted_numbers) [(1, 2), (3, 4), (3, 5), (6, 2), (6, 3)]

Note that the order of information in a tuple is important when using the `sort` function.
If the information is not in the correct order, one solution is to recreate the data
in the desired order. For instance, to sort our `computer_names` on operating system, we
can write

>>>> sorted ( [ ((cnoc[1][0], cnoc[1][1]), cnoc[0]) for cnoc in sorted_computer_names_os_cc ] ) [(('Linux', 'BE'), 'banana'), (('Linux', 'DE'), 'pear'), (('Linux', 'FR'), 'cherry'), \ (('MacOS', 'BE'), 'apple'), (('MacOS', 'FR'), 'grape'), (('Windows', 'DE'), 'peach'), \ (('Windows', 'NL'), 'mango')]

Recreating the data just to order it differently may seem a little complex. Furthermore,
our indexing (`cnoc[1][0]`) becomes cumbersome and hard to read.
We will see later in the course that there is an alternative solution to this
problem.

In our explanation, we made our lives easy by making a number of implicit assumptions:

- Our lists contain every element only once;
- Our lists had lengths 7, 15, 31, ...

It is important to ask yourself what the code would do if these restrictions no longer hold. We will not discuss these questions in detail in this reference. However, as small exercises consider doing the following:

- Apply the
`binary_search`algorithm on a list of a different length. - Consider modifying the
`binary_search_retrieve`function such that it returns a list of all values associated with a given name.

- algorithm
- An unambiguous specification of how to solve a class of problems.
- binary search
- A search algorithm that searches by repeatedly splitting a list in two parts
- linear search
- A search algorithm that considers all elements of a list in the worst case
- search algorithm
- An algorithm for searching an element that fulfills a well-defined set of requirements