Search Algorithms

Search Algorithms

Source: this section is not based on external sources.

Binary Search on Complex Data structures

The code that we saw in the previous section is useful if we want to test that a given element occurs in a sorted list. We can use this two write programs such as

if binary_search ( computer_name, sorted_computer_names ):
  print ( "Welcome to our system" )
else:
  print ( "The specified computer does not exist")

In other words, programs in which we only wish to test the presence of a given element.

Often, however, one does not only want to check that a given computer exists, but one also wants to retrieve information associated with the computer, such as its operating system.

Using nested data structures, we could store such associated information in a sorted list as follows:

sorted_computer_names_os = [("apple","MacOS"), ("banana","Linux"), ("cherry","Linux"), \
                            ("grape", "MacOS"), ("mango", "Windows"), ("peach", "Windows"), \
                            ("pear", "Linux")]

Hence, our list now consists of tuples, where the first part of the tuple stores the computer name and the second part stores the operating system.

We cannot use our existing binary_search to look for the data associated with a given name. In our binary_search algorithm, we perform a test like

if list_of_names[middle] == name:

If our lists consists of tuples, list_of_names[middle] will be a tuple, such as ("apple","MacOS"). We can only use our current implementation to search for complete tuples such as ("apple", "MacOS").

The following modification does allow us to retrieve associated information:

def binary_search_retrieve ( name, list_of_names ):
  first = 0
  last = len(list_of_names)-1
  found = False

  while first<=last and not found:
    middle = (first + last)//2
    middle_name, middle_info = list_of_names[middle]
      if middle_name == name:
      found = True
    else:
      if name < middle_name:
        last = middle-1
      else:
        first = middle+1

  if found:
    return middle_info
  else:
    return None

Observe that in this code, we assume that our list now contains tuples, each of which we can unpack. If the element we are looking for exists, we return the associated value; otherwise, we made the choice to return the special value None. We can now write code such as:

if binary_search_retrieve ( computer_name, sorted_computer_names_os ) == "Windows":
  print ( "Welcome to our system" )
else:
  print ( "Your machine does not exist or your operating system is not supported")

While in this example, we associated one string with a computer name, nothing limits us to associate more complex information.

sorted_computer_names_os_cc = [("apple",("MacOS","BE")), ("banana",("Linux","BE")), ("cherry",("Linux","FR")), \
                               ("grape", ("MacOS","FR")), ("mango", ("Windows","NL")), ("peach", ("Windows", "DE")), \
                               ("pear", ("Linux","DE"))]

We can still use binary_search_retrieve to write programs such as this:

result = binary_search_retrieve ( computer_name, sorted_computer_names_os_cc )
if result is not None:
  os, country = result
  print ( "Welcome! Your operating system is " + os + " and your country " + country )
else:
  print ( "Your computer is not known")

Our binary search algorithm only works for data that is sorted. Assume we are given more complex data. How we can sort this data? Fortunately, the sorted function that we saw earlier, also works on tuples. In this case, it will use a lexicographical order, in which the order between two tuples is determined by the second element in the tuple if the first elements are equal.

>>> unsorted_numbers = [ (3,4), (6,3), (1,2), (3,5), (6,2) ]
>>> sorted (unsorted_numbers)
[(1, 2), (3, 4), (3, 5), (6, 2), (6, 3)]

Note that the order of information in a tuple is important when using the sort function. If the information is not in the correct order, one solution is to recreate the data in the desired order. For instance, to sort our computer_names on operating system, we can write

>>>> sorted ( [ ((cnoc[1][0], cnoc[1][1]), cnoc[0]) for cnoc in sorted_computer_names_os_cc ] )
[(('Linux', 'BE'), 'banana'), (('Linux', 'DE'), 'pear'), (('Linux', 'FR'), 'cherry'), \
(('MacOS', 'BE'), 'apple'), (('MacOS', 'FR'), 'grape'), (('Windows', 'DE'), 'peach'), \
(('Windows', 'NL'), 'mango')]

Recreating the data just to order it differently may seem a little complex. Furthermore, our indexing (cnoc[1][0]) becomes cumbersome and hard to read. We will see later in the course that there is an alternative solution to this problem.

Corner cases

In our explanation, we made our lives easy by making a number of implicit assumptions:

  • Our lists contain every element only once;
  • Our lists had lengths 7, 15, 31, ...

It is important to ask yourself what the code would do if these restrictions no longer hold. We will not discuss these questions in detail in this reference. However, as small exercises consider doing the following:

  • Apply the binary_search algorithm on a list of a different length.
  • Consider modifying the binary_search_retrieve function such that it returns a list of all values associated with a given name.

Glossary

algorithm
An unambiguous specification of how to solve a class of problems.
binary search
A search algorithm that searches by repeatedly splitting a list in two parts
linear search
A search algorithm that considers all elements of a list in the worst case
search algorithm
An algorithm for searching an element that fulfills a well-defined set of requirements