Hi everyone,
I want to start by saying thank you to everyone for supporting Algorithmic Thinking 2/e. The book is resonating with many people. I’m grateful for the opportunity to help people learn and succeed in their programming endeavors.
Of course, every book has tradeoffs – I’ve really been enjoying these kinds of discussions with readers, and the book was even discussed on Hacker News.
But now, the main thing I wanted to discuss in this post: making our own hash functions!
Chapter 1 starts with the Unique Snowflakes problem. (Yep – still my favourite hash tables problem!) We solved it by devising our own hash function that we wrote from scratch.
Oh oh – don’t worry if you don’t have the book. You can download Chapter 1 right now.
That hash function we came up with hashed a snowflake by adding up the values of its arms. Feel free to re-read our discussion in the chapter to remind you of why we did this but, in brief, we needed to put “possibly identical” snowflakes together in the same bucket, so that we didn’t waste time comparing snowflakes that are obviously not identical.
Why didn’t we use any old hash function here instead of coming up with our own? It’s because our definition of “identical snowflakes” is more complex than just “exactly equal”. Typical hash functions hash exactly equal elements into the same bucket and try to spread everything else around. For example, if we just plucked some hash function from some book or website, it would likely hash snowflake 1 2 3 4 5 6
into a different bucket than snowflake 6 5 4 3 2 1
. That would be bad because identical snowflakes like that are supposed to end up in the same bucket; if they end up in different buckets, then our code in Listing 1-12 is not going to find that they’re identical.
Once we finished with the Unique Snowflakes problem and learned all about hash tables, we moved on to solve the Login Mayhem problem. That one’s all about passwords – and for two passwords to be “identical”, they have to be exactly the same characters in exactly the same order. And because our definition of “identical” here is “exactly equal”, this time we can use whatever hash function we want. In the book, I used that oaat (one-at-a-time) hash function designed by Bob Jenkins (a hash function expert, unlike me!).
Some readers have been asking, though: “can’t we design our own function for Login Mayhem instead of using someone else’s pre-built hash function?”
Answer: we can! I didn’t do it in the book because I wanted to show what an industry-strength hash function looks like and how to use it. But in the spirit of “we’re going to build everything from scratch”, I probably should have included it in the Appendix.
Let’s remedy that now. :)
OK. Now we’re working with passwords instead of snowflakes. So, we’re working with strings of lowercase characters rather than arrays of six integers. That’s OK, though: we can still do essentially the same thing we did for the snowflake hashcode function. Instead of adding up the values of the snowflake arms and then using mod, we’re going to add the characters of the password and then use mod.
“Add the characters”? Yep! For example, we’ll think of a
as 0, b
as 1, c
as 2, and so on. To make that happen, we can take each character and subtract 'a'
from it. (We used a similar trick in Listing 2-16 in Chapter 2.)
That hash function would look like this:
What’s the hashcode for hello
? Well, h
is 7, e
is 4, l
is 11, l
is 11 again, and o
is 14. Add those up and we get 47.
What’s the hashcode for olelh
? That’s a different password than hello
, so we would hope its hashcode is different, too. Unfortunately, it’s still 47, and that’s because the sum of its characters is the same as hello
! The characters are jumbled, sure, but our hash function doesn’t care about that: it just adds them up.
And that’s the problem with this hash function. There are too many collisions. Our hash function is using only the characters of a password, not their location in the password.
If you build a solution to the Login Mayhem problem using this hash function and submit it, you’ll see that the programming judge agrees: we need to do better. This code leads to too many collisions and is therefore too slow.
Suppose you had a stream of digits coming in and had to combine them into a single number. How would you do it?
Let’s suppose that the digits that come in are 3, 2, 1, 4, 5. Here’s what we do:
See that trick? Multiply by 10 each time and then add the new digit.
Think of that 32145 as our hashcode for the digits 3, 2, 1, 4, 5.
The best part of this is that rearranging the digits changes the hashcode that we get. For example, if the digits come in the order 5, 4, 3, 2, 1, then we get 54321. Not the same hashcode as 32145!
Now, in the realm of our passwords, we’re dealing with lowercase characters a-z, not digits 0-9. That’s OK: instead of multiplying by 10, we’ll multiply by 26! This maintains the nice fact that different rearrangements of the same letters will result in different hashcodes.
Our new hash function is almost the same as before, just with the new multiply by 26:
This hash function is good enough to pass the test cases in time!
Here’s the full solution code that includes our own hash function. It’s otherwise nearly identical to the code in the book!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_PASSWORD 10
#define SIZE 1000000
typedef struct password_node {
char password[MAX_PASSWORD + 1];
int total;
struct password_node *next;
} password_node;
unsigned long hash(char *password) {
unsigned long hash = 0, i;
for (i = 0; i < strlen(password); i++)
hash = hash * 26 + (password[i] - 'a');
return hash % SIZE;
}
password_node *in_hash_table(password_node *hash_table[], char *find) {
unsigned password_code;
password_node *password_ptr;
password_code = hash(find);
password_ptr = hash_table[password_code];
while (password_ptr) {
if (strcmp(password_ptr->password, find) == 0)
return password_ptr;
password_ptr = password_ptr->next;
}
return NULL;
}
void add_to_hash_table(password_node *hash_table[],
char *find) {
unsigned password_code;
password_node *password_ptr;
password_ptr = in_hash_table(hash_table, find);
if (!password_ptr) {
password_code = hash(find);
password_ptr = malloc(sizeof(password_node));
if (password_ptr == NULL) {
fprintf(stderr, "malloc error\n");
exit(1);
}
strcpy(password_ptr->password, find);
password_ptr->total = 0;
password_ptr->next = hash_table[password_code];
hash_table[password_code] = password_ptr;
}
password_ptr->total++;
}
int already_added(char all_substrings[][MAX_PASSWORD + 1],
int total_substrings, char *find) {
int i;
for (i = 0; i < total_substrings; i++)
if (strcmp(all_substrings[i], find) == 0)
return 1;
return 0;
}
int main(void) {
static password_node *hash_table[SIZE] = {NULL};
int num_ops, op, op_type, i, j;
char password[MAX_PASSWORD + 1], substring[MAX_PASSWORD + 1];
password_node *password_ptr;
int total_substrings;
char all_substrings[MAX_PASSWORD * MAX_PASSWORD][MAX_PASSWORD + 1];
scanf("%d", &num_ops);
for (op = 0; op < num_ops; op++) {
scanf("%d%s", &op_type, password);
if (op_type == 1) {
total_substrings = 0;
for (i = 0; i < strlen(password); i++)
for (j = i; j < strlen(password); j++) {
strncpy(substring, &password[i], j - i + 1);
substring[j - i + 1] = '\0';
if (!already_added(all_substrings, total_substrings, substring)) {
add_to_hash_table(hash_table, substring);
strcpy(all_substrings[total_substrings], substring);
total_substrings++;
}
}
} else {
password_ptr = in_hash_table(hash_table, password);
if (!password_ptr)
printf("0\n");
else
printf("%d\n", password_ptr->total);
}
}
return 0;
}
Now, in practice, I do recommend using a strong and tested hash function – but now we’ve closed the loop on this problem and built our hash function from scratch :)
Written on March 3rd , 2024 by Daniel Zingaro