A table is used to store elements. An element may be inserted in the table, searched for (looked up) in the table, and sometimes deleted from the table. The idea of a table is very general and important in information processing systems. For example, a university data base might contain tables of departments, staff, courses, lectures, students, exam results. A compiler might use tables of keywords, variables, types, constants, procedures and functions. A table is a mutable data structure, it has state, it changes with time. Its properties are therefore dynamic and are expressed by pre and postconditions (assertions):
Clear, insert and delete change the state of the table.
When the table has just been cleared,
lookup returns false for any element.
When an element is added to the table,
lookup returns true for that element
and the contents are otherwise unaffected.
If a new element is added to the table and then deleted
there is no net effect on the table.
Two further situations need to be considered.
If an element is in the table and an attempt is made
to insert it again, the result could either be an
Sometimes the element type consists of a key plus some other attributes or data. The key might be a number or a name (string); it is something unique to identify an element. In this case the lookup operation usually returns the associated attributes rather than just true or false. If there is no extra data we call the element itself the key.
It is important for a program to be efficient: to use little space and to run quickly. How these aims can be achieved depends on the properties of the data and particularly the key, on the operations on the table and on the frequency and pattern of use of the different operations.
We say that the key type is sparse if the key values in use form a very small fraction of the number of possible values in the type. We call it a dense type if most of the possible values come into play. These are only loose definitions and each case must be judged on its merits. For example, if keys are alpha-numeric strings they are necessarily sparse because there are infinitely many possible strings. Even if the string length is limited but reasonably large there are a very great many strings. On the other hand, the days of the week form a dense enumerated type. The sparseness of (some) key-types has a large influence on the method of choice for implementing a table. Indeed many of the data structures described in this book have been devised precisely to deal with sparse types efficiently.
A table where the element is all key can also be thought of as a set. If the element type is a dense index type then the table can be implemented as a bit mask. Lookup is done by the membership test, in, addition and deletion by set union and difference. There is a limit of the range of element values in some programming languages the provide a built-in `set' data-type. A large, dense set can be implemented by an array of boolean so this can also be used to implement the corresponding table. Conversely, a sparse set can be implemented by the methods for tables described below.
A table to hold a limited number of elements with a sparse key can be implemented using an array and a counter. Initially the table is empty and the counter is set to zero. When a new element is inserted the counter is incremented and the new element stored at that index. The elements are in no particular order other than that of time of insertion. Lookup is by the linear-search algorithm which steps sequentially through the array.
A person does not use linear search to find a name in a telephone directory, except perhaps for Dustin Hoffman's character in `The Rain Man'. Rather the directory is opened at a likely place, and the key compared with names on that page. If the key is alphabetically before the names on the page, the search is continued in the front part of the directory otherwise the back part. The fact that the directory is alphabetically ordered or sorted is essential to this process. The binary-search algorithm makes use of an ordering of the key type in a similar way. The table is implemented by a sorted array. Two "pointers" are used to delimit the search area, initially the occupied part of array. The middle element is compared with the key and either the lower pointer moved up or the upper pointer moved down as appropriate.
Change the data in the HTML FORM below, click `go', and experiment. The region of the array from `lo' to `hi-1' is [bracketted] and a[mid] is starred `*' in the trace:
The algorithm is short but there are some subtle points to observe. Firstly, if the key is present it lies between `lo' inclusive and `hi' exclusive - this is the loop invariant. The invariant is true at the start of the first iteration because lo=1 and hi=n+1. The loop body ensures that the invariant remains true at the start of each subsequent iteration. This is so because of the choice of test in the central if statement. If the mid element is less than or equal to the key and lo is moved up to mid the invariant is still true. The converse is that the mid element is greater than the key and when hi is moved down the invariant is still true. Note the connection between the hi-1 in the invariant and the `>=' test. When hi=lo+1 the search has narrowed to just one element, A[lo..hi-1] = A[lo..lo] = A[lo], which can be compared to the search key for equality.
The algorithm terminates because the two integer-valued pointers are moving closer together and the exit test must eventually succeed. If hi>=lo+1 then hi>lo+2 and so lo<mid<hi. Therefore regardless of whether `lo' is moved up to mid or hi is moved down to mid, lo and hi move closer together at each iteration and closer to satisfying the termination test.
It is very easy to get the details of binary search wrong when coding it so it is a good idea to test it on at least the following cases:
Finally, the algorithm takes O(log(n)) time as each iteration halves the search area. Note that it is not worth testing for equality within the loop and exiting. To do so would add an extra test and time to every iteration. The average number of iterations saved would be limited by
Binary-search takes O(log(n)) time in all cases which is much faster than linear search. If the table does not change once created, ie. there is no insertion or deletion, it can be sorted once in O(n*log(n)) time; see the chapter on sorting. Insertion and deletion must keep the array sorted and they take O(n) time at worst and on average which is relatively slow. This is because insertion and deletion to the left of position n require elements to be moved up or down to make space or to fill the hole. Using a sorted array may have an extra benefit if the elements are also processed sequentially, for example to print alphabetical lists.
For an apparently crazy idea hashing works remarkably well. A hash function maps key values onto the index type of an array which implements the hash table. Elements are accessed directly using the hash index as an array index. The surprising part is that the best hash functions are essentially pseudo-random functions. The ideal hash function maps the actual key values uniformly onto the index type; unfortunately the ideal is not always achieved.
A reasonable hash function for strings is given below.
It is impossible to design a good hash function for all circumstances. In general it is necessary to analyse the statistical properties of the key type and design a hash function to suit. For example, if a hash table is used to manage a table of identifiers in a compiler, (some) programmers are fond of names like x1, x2, x3, y1, ..., z1, ... some of which are liable to hash to the same index. Such an occurrence is called a collision. Collisions are in fact inevitable because the key type is infinite and cannot otherwise be squeezed into a finite index range. The solution is to hash the key and examine the table at that index. If a different key is already at that position then a collision has occurred. A probing strategy then looks at a sequence of positions until a blank is found.
Linear probing is the simplest probing strategy. It looks at successive elements, modulo the table size, until a blank is found. The new element is inserted there.
Quadratic probing is a more sophisticated probing strategy.
Simulations can give an estimate of the performance of a hash table. The time taken by insertion and lookup is dependent on the number of collisions. The results given here are based on randomly generated keys with a uniform distribution. This is the best possible case; typical results may therefore be somewhat worse.
If the number of entries in the table is liable to grow, a common strategy is to increase the table size when the number of entries reaches a certain percentage of the table's current capacity or when the collision rate reaches some critical level. Elements may have to be re-hashed when moved into the enlarged table. The table may also be shrunk if the number of entries falls below some limit.
Another way of dealing with collisions is known as chaining. Here, each hash table entry contains a pointer to a list of colliding elements. The lists are created in a collection. Deletion is more easily implemented under chaining. A linear search is used within a list. Provided the hash function is good, each list is kept short and the operations are fast. See the chapter on lists for the necessary techniques.
The hash function randomises the order in which elements are stored in the hash table. This makes it difficult to process the elements sequentially and may rule against the use of a hash table in some applications.
Various tree-based data-structures can be used to implement tables efficiently, e.g.
L.A. ©, Department of Computer Science U.W.A., Perth, W.A., 1984-86.