Hi everyone,
Let’s solve the COCI (Croatian Open Competition in Informatics) programming puzzle that I gave at the end of Last week’s post. Give it a try if you haven’t already done so…
This problem is a …
a …
a …
a hash table problem! In Chapter 1 of my book Algorithmic Thinking, I offered a clue for when hash tables might apply: it’s when you’re getting bogged down in repeated searches. (You can read Chapter 1 of the book for free.) I’ll move quickly here; I encourage you to read much more about hash functions and hash table design in Chapter 1.
I think it’s a bit tricky to see why this is a problem that requires lots of repeated searches. So let’s first step back and solve this thing without hash tables and see what happens.
We’re looking for all pairs of strings where the first string is a substring of the second string. (This corresponds to a password that’s a substring of another password.) We can implement this using two nested for-loops to consider each pair of strings in turn. For the substring check, we can use the C strstr
function, which searches its first parameter for an occurrence of its second parameter.
Here’s my code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_passwords 20000
#define MAX_LENGTH 10
int main(void) {
static char passwords[MAX_passwords][MAX_LENGTH + 1];
int n, total, i, j;
scanf("%d", &n);
for (i = 0; i < n; i++)
scanf("%s", passwords[i]);
total = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
if (i != j && strstr(passwords[j], passwords[i]))
total++;
printf("%d\n", total);
return 0;
}
The i != j
check (19) is important here. It prevents us from spuriously counting the fact that any user can log-in as themselves.
There are two nested for-loops here, both of which depend on n
(the number of passwords), so we’re looking at an $O(n^2)$ algorithm. This means that we’re doing about $n^2$ string searches. We can have up to 20000 passwords, so our program could do up to $20000^2 = 400000000$ string searches. The problem has a tight timeline of just 1 second. We’re not going to be able to do even close to 400 million string searches in that time. (As a very loose estimate, we might be able to do 100 million basic operations per second. Not only might we have to do more than this number of calls to strstr
, but what strstr
does is certainly not a basic operation, because it might have to try a lot of starting points before it finds a complete match.) Still, if you submit what we have, you should pass some test cases in time, giving us some confidence that we’re doing the right thing. We just need to do it faster.
Consider this test case from the problem description:
5
mir
mirta
ta
ir
t
In how many passwords is mir
a substring? To answer that, we consider each other string, performing a strstr
with it. Is mir
a substring of mirta
? Yes – and strstr
will tell us so. Is mir
a substring of ta
? No, and strstr
will tell us not. We continue, checking whether mir
is a substring of each other string.
And that’s where we lose, because this is too much searching. We’re wasting too much time searching for mir
over and over. After that, we do the same whole lot of searching for mirta
, then a whole lot of searching for ta
and so on.
Suppose instead that we knew in advance the number of other substrings in which mir
showed up… like someone just told us, “hey, the answer is 1.” Then we wouldn’t have to do all of those string searches; we could just look up the answer straight away.
That’s precisely what a hash table is going to allow us to do. We’re going to populate it with the number of times that each substring of each password shows up. Then, we’re going to just look up those answers, dodging the massive number of string searches that we were doing before.
We’ll start with an empty hash table. Then, we’ll go through each password, adding each of its substrings to the hash table. The hash table will keep track of the total number of occurrences of each substring.
For mir
, we’ll add an occurrence of each of these strings to the hash table:
m
mi
mir
i
ir
r
For mirta
, we’ll add an occurrence of each of these strings:
m
mi
mir
mirt
mirta
i
ir
irt
irta
r
rt
rta
t
ta
a
Some of these strings, like m
and mi
, now have counts of 2.
For ta
, we’ll add an occurrence of each of these strings:
t
ta
a
For ir
, we’ll add an occurrence of each of these strings:
i
r
And, finally, for t
, we add an occurrence of just:
t
There we go! That’s our hash table. Here’s a summary of all of its strings and counts:
string | count |
---|---|
a | 2 |
i | 3 |
ir | 3 |
irt | 1 |
irta | 1 |
m | 2 |
mi | 2 |
mir | 2 |
mirt | 1 |
mirta | 1 |
r | 3 |
rt | 1 |
rta | 1 |
t | 3 |
ta | 2 |
In how many passwords is mir
a substring? Check the hash table: it’s 2! Except that one of them is from mir
itself, so we don’t want to count that one. So we have just 1 other password in which mir
is a substring. Our total so far is 1.
How about mirta
: in how many other passwords is mirta
a substring? The hash table stores 1
for mirta
. We subtract 1 to conclude that we have 0 relevant strings here. So our total is still 1.
Continuing: ta
is a substring in 1 other password, ir
is a substring in 2 other passwords, and t
is a substring in 2 other passwords. Including our previous total of 1, we have a grand total of $1+1+2+2=6$. Our answer is 6!
Incidentally, you might notice that there’s a lot of garbage in this hash table, in the form of strings that we’ll never look up. For example, the string a
. No one has a password of a
, so we don’t care how many occurrences of a
there are. That’s OK though: we’ll just fill up the hash table, and then use only what we need.
Now let’s implement this strategy using hash tables.
Here’s my code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_passwords 20000
#define MAX_LENGTH 10
typedef struct password_node {
char *password;
int total;
struct password_node *next;
} password_node;
// A simple hash function (suffices for this problem)
int hash(char *password) {
int i = 0;
int result = 1;
while (password[i] != '\0') {
result = (result * 26 + password[i]) % MAX_passwords;
i++;
}
return result;
}
// Return hash table node for password; add it if it isn't already there
password_node *find(password_node *hash_table[], char *password) {
int password_code;
password_node *passwordptr;
password_code = hash(password);
passwordptr = hash_table[password_code];
while (passwordptr) {
if (strcmp(passwordptr->password, password) == 0)
return passwordptr;
passwordptr = passwordptr->next;
}
// Password isn't in hash table. Add it
password_node *node = malloc(sizeof(password_node)); // elided error checking
node->password = malloc(MAX_LENGTH + 1);
strcpy(node->password, password);
node->total = 0;
node->next = hash_table[password_code];
hash_table[password_code] = node;
return node;
}
// Return true iff search is in first num_elements of array
int in_array(char array[][MAX_LENGTH + 1], int num_elements, char *search) {
int i;
for (i = 0; i < num_elements; i++)
if (strcmp(array[i], search) == 0)
return 1;
return 0;
}
int main(void) {
static password_node *hash_table[MAX_passwords] = {NULL};
static char passwords[MAX_passwords][MAX_LENGTH + 1];
char temp[MAX_LENGTH + 1];
char substrings[MAX_LENGTH * MAX_LENGTH][MAX_LENGTH + 1];
int n, total, i, len, start, end, num_substrings;
scanf("%d", &n);
for (i = 0; i < n; i++) {
scanf("%s", passwords[i]);
len = strlen(passwords[i]);
num_substrings = 0;
// Loop through endpoints of all possible substrings
for (start = 0; start < len; start++)
for (end = start; end < len; end++) {
strncpy(temp, &passwords[i][start], end - start + 1);
temp[end - start + 1] = '\0';
if (!in_array(substrings, num_substrings, temp)) {
strcpy(substrings[num_substrings], temp);
num_substrings++;
password_node *node = find(hash_table, temp);
node->total++;
}
}
}
// Hash table is ready. Now add up occurrences of each password
total = 0;
for (i = 0; i < n; i++)
total = total + find(hash_table, passwords[i])->total - 1;
printf("%d\n", total);
return 0;
}
In the book, I used a wowzer of a hash function written by Bob Jenkins that blends the keys and distributes them quite well. Here, we can get away with a simpler hash function (15) that for each character multiplies the current result by 26 and then adds the new character. My code using this hash function passes the time limit, so let’s stick with this.
Next, we have a find
function (28) that searches the hash table for a password. If it finds it (36), it returns the associated node. If it doesn’t find it (41), then it means that this is the first time we’re seeing this password, so we add it with a total count of 0
(the code in the main
function will make sure to increment this to 1
).
What’s up with that in_array
function – what’s it for? Well, consider a password like aaa
. The substring a
shows up 3 times in there, but we only want to count it once. If we counted it three times, we’d be saying that a
was a substring of three passwords, when of course that isn’t true. So we use in_array
to check whether we’ve already considered a given substring of the current password.
Those are all of the pieces we need for the main
function (62), so let’s talk about main
next.
In addition to the hash table itself (63), I also have an array to keep track of all of the passwords (64). We also need an array to keep track of the substrings of the current password (66), so that we can correctly call in_array
. How big should the substrings
array be – how many substrings might it have to hold? Well, each password is at most 10 characters, there are 10 places where a substring can start, and there are 10 places where a substring can end. So a safe upper bound here is to use MAX_LENGTH * MAX_LENGTH
or 100
.
The rest of the code works in two phases. First, we add all substrings of all passwords to the hash table (75). Once that’s done, we go through each password, adding up the number of occurrences in the hash table (89). Remember that we need to subtract 1 each time (92) so that we don’t count the fact that every password is a substring of itself.
If you submit this solution, you should see that it passes all test cases within the time limit. But…
Remember that we’re trying to do better than $O(n^2)$ time here. You might be skeptical at this point. Because, check it out: we have not only two nested loops, but three (70)(76)(77)! Three! Have e somehow made things even worse than before?
No. Because only one of these loops depends on $n$. The other two are puny loops that depend only on the length of the current password. And each password is only at most 10 characters. Focusing only on $n$, we have an expected $O(n)$ algorithm here: we take $O(n)$ steps to populate the hash table and another $O(n)$ steps to go through the passwords and look up their totals.
I have another hash table problem for you, if you’d like. I won’t go through it in next week’s post – I have something else planned – but I’d encourage you to give it a try for extra practice. Here it is: the Cakey McCakeFace problem from ICPC (International Collegiate Programming Contest) SWERC 2017.
Please feel free to contact me if you have further ideas, requests, etc. I hope these posts are helpful and I’d like to continue to make them better.
I also have a newsletter where I send occasional updates. Please see my Contact and Newsletter page if you’d like to subscribe.
Written on October 21st , 2020 by Daniel Zingaro