|
| 1 | +# Alien Dictionary |
| 2 | + |
| 3 | +You are given a list of words written in an alien language, where the words are sorted lexicographically by the rules of |
| 4 | +this language. Surprisingly, the aliens also use English lowercase letters, but possibly in a different order. |
| 5 | + |
| 6 | +Given a list of words written in the alien language, return a string of unique letters sorted in the lexicographical |
| 7 | +order of the alien language as derived from the list of words. |
| 8 | + |
| 9 | +If there’s no solution, that is, no valid lexicographical ordering, you can return an empty string "". |
| 10 | + |
| 11 | +If multiple valid orderings exist, you may return any of them. |
| 12 | + |
| 13 | +> Note: A string, a, is considered lexicographically smaller than string b if: |
| 14 | +> 1. At the first position where they differ, the character in a comes before the character in b in the alien alphabet. |
| 15 | +> 2. If one string is a prefix of the other, the shorter string is considered smaller. |
| 16 | +
|
| 17 | +## Constraints |
| 18 | + |
| 19 | +- 1 <= `words.length` <= 10^3 |
| 20 | +- 1 <= `words[i].length` <= 20 |
| 21 | +- All characters in `words[i]` are English lowercase letters |
| 22 | + |
| 23 | +## Examples |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +## Solution |
| 32 | + |
| 33 | +We can solve this problem using the topological sort pattern. Topological sort is used to find a linear ordering of |
| 34 | +elements that have dependencies on or priority over each other. For example, if A is dependent on B or B has priority |
| 35 | +over A, then B is listed before A in topological order. |
| 36 | + |
| 37 | +Using the list of words, we identify the relative precedence order of the letters in the words and generate a graph to |
| 38 | +represent this ordering. To traverse a graph, we can use breadth-first search to find the letters’ order. |
| 39 | + |
| 40 | +We can essentially map this problem to a graph problem, but before exploring the exact details of the solution, there |
| 41 | +are a few things that we need to keep in mind: |
| 42 | + |
| 43 | +1. The letters within a word don’t tell us anything about the relative order. For example, the word “educative” in the list |
| 44 | + doesn’t tell us that the letter “e” is before the letter “d.” |
| 45 | + |
| 46 | +2. The input can contain words followed by their prefix, such as “educated” and then “educate.” These cases will never result |
| 47 | + in a valid alphabet because in a valid alphabet, prefixes are always first. We need to make sure our solution detects |
| 48 | + these cases correctly. |
| 49 | +3. There can be more than one valid alphabet ordering. It’s fine for our algorithm to return any one of them. |
| 50 | +4. The output dictionary must contain all unique letters within the words list, including those that could be in any position |
| 51 | + within the ordering. It shouldn’t contain any additional letters that weren’t in the input. |
| 52 | + |
| 53 | +### Step-by-step solution construction |
| 54 | + |
| 55 | +For the graph problem, we can break this particular problem into three parts: |
| 56 | + |
| 57 | +1. Extract the necessary information to identify the dependency rules from the words. For example, in the words |
| 58 | + [“patterns”, “interview”], the letter “p” comes before “i.” |
| 59 | +2. With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and |
| 60 | + the dependencies (order) as the edges. |
| 61 | +3. Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary). |
| 62 | + |
| 63 | +Let’s look at each part in more depth. |
| 64 | + |
| 65 | +#### Part 1: Identifying the dependencies |
| 66 | + |
| 67 | +Let’s start with example words and observe the initial ordering through simple reasoning: |
| 68 | + |
| 69 | +`["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]` |
| 70 | + |
| 71 | +As in the English language dictionary, where all the words starting with “a” come at the start followed by the words |
| 72 | +starting with “b,” “c,” “d,” and so on, we can expect the first letters of each word to be in alphabetical order. |
| 73 | + |
| 74 | +`["m", "m", "x", "x", "x", "x", "s", "s", "s", "s", "r", "r", "r"]` |
| 75 | + |
| 76 | +Removing the duplicates, we get the following: |
| 77 | + |
| 78 | +`["m", "x", "s", "r"]` |
| 79 | + |
| 80 | +Following the intuition explained above, we can assume that the first letters in the messages are in alphabetical order: |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | +Looking at the letters above, we know the relative order of these letters, but we don’t know how these letters fit in |
| 85 | +with the rest of the letters. To get more information, we need to look further into our English dictionary analogy. The |
| 86 | +word “dirt” comes before “dorm.” This is because we look at the second letter when the first letter is the same. In this |
| 87 | +case, “i” comes before “o” in the alphabet. |
| 88 | + |
| 89 | +We can apply the same logic to our alien words and look at the first two words, “mzsor” and “mqov.” As the first letter |
| 90 | +is the same in both words, we look at the second letter. The first word has “z,” and the second one has “q.” Therefore, |
| 91 | +we can safely say that “z” comes before “q” in this alien language. We now have two fragments of the letter order: |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +> Note: Notice that we didn’t mention rules such as “m -> a”. This is fine because we can derive this relation from |
| 96 | +> “m -> x”, “x -> a”. |
| 97 | +
|
| 98 | +This is it for the first part. Let’s put the pieces that we have in place. |
| 99 | + |
| 100 | +#### Part 2: Representing the dependencies |
| 101 | + |
| 102 | +We now have a set of relations mentioning the relative order of the pairs of letters: |
| 103 | + |
| 104 | +`["z -> q", "m -> x", "x -> a", "x -> v", "x -> s", "z -> x", "v -> a", "s -> r", "o -> w"]` |
| 105 | + |
| 106 | +Now the question arises, how can we put these relations together? It might be tempting to start chaining all these |
| 107 | +together. Let’s look at a few possible chains: |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +We can observe from our chains above that some letters might appear in more than one chain, and putting the chains into |
| 112 | +the output list one after the other won’t work. Some of the letters might be duplicated and would result in an invalid |
| 113 | +ordering. Let’s try to visualize the relations better with the help of a graph. The nodes are the letters, and an edge |
| 114 | +between two letters, “x” and “y” represents that “x” is before “y” in the alien words. |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +#### Part 3: Generating the dictionary |
| 119 | + |
| 120 | +As we can see from the graph, four of the letters have no incoming arrows. This means that there are no letters that |
| 121 | +have to come before any of these four. |
| 122 | + |
| 123 | +> Remember: There could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them. |
| 124 | +
|
| 125 | +Therefore, a valid start to the ordering we return would be as follows: |
| 126 | +`["o", "m", "u", "z"]` |
| 127 | + |
| 128 | +We can now remove these letters and edges from the graph because any other letters that required them first will now have |
| 129 | +this requirement satisfied. |
| 130 | + |
| 131 | + |
| 132 | + |
| 133 | +There are now three new letters on this new graph that have no in arrows. We can add these to our output list. |
| 134 | + |
| 135 | +`["o", "m", "u", "z", "x", "q", "w"]` |
| 136 | + |
| 137 | +Again, we can remove these from the graph. |
| 138 | + |
| 139 | + |
| 140 | + |
| 141 | +Then, we add the two new letters with no in arrows. |
| 142 | + |
| 143 | +`["o", "m", "u", "z", "x", "q", "w", "v", "s"]` |
| 144 | +This leaves the following graph: |
| 145 | + |
| 146 | + |
| 147 | + |
| 148 | +We can place the final two letters in our output list and return the ordering: |
| 149 | + |
| 150 | +`["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]` |
| 151 | +Let’s now review how we can implement this approach. |
| 152 | + |
| 153 | +Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the |
| 154 | +relations and insert them into an adjacency list: |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. |
| 159 | +Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a |
| 160 | +little complicated. A naive approach is to repeatedly iterate over the adjacency lists of all the other nodes and check |
| 161 | +whether or not they contain a link to that particular node. |
| 162 | + |
| 163 | +This naive method would be fine for our case, but perhaps we can do it more optimally. |
| 164 | + |
| 165 | +An alternative is to keep two adjacency lists: |
| 166 | + |
| 167 | +One with the same contents as the one above. |
| 168 | +One reversed that shows the incoming links. |
| 169 | +This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list: |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | +What if we can do better than this? Instead of tracking the incoming links for all the letters from a particular letter, |
| 174 | +we can track the count of how many incoming edges there are. We can keep the in-degree count of all the letters along with |
| 175 | +the forward adjacency list. |
| 176 | + |
| 177 | +> In-degree corresponds to the number of incoming edges of a node. |
| 178 | +
|
| 179 | +It will look like this: |
| 180 | + |
| 181 | + |
| 182 | + |
| 183 | +Now, we can decrement the in-degree count of a node instead of removing it from the reverse adjacency list. When the in-degree of the node reaches 0, this particular node has no incoming links left. |
| 184 | + |
| 185 | +We perform BFS on all the letters that are reachable, that is, the in-degree count of the letters is zero. A letter is |
| 186 | +only reachable once the letters that need to be before it have been added to the output, result. |
| 187 | + |
| 188 | +We use a queue to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero |
| 189 | +in-degree count. We keep adding the letters to the queue as their in-degree counts become zero. |
| 190 | + |
| 191 | +We continue this until the queue is empty. Next, we check whether all the letters in the words have been added to the |
| 192 | +output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. |
| 193 | +In this case, we return an empty string. |
| 194 | + |
| 195 | +> Remember: There can be letters that don’t have any incoming edges. This can result in different orderings for the same |
| 196 | +> set of words, and that’s all right. |
| 197 | +
|
| 198 | +### Solution summary |
| 199 | + |
| 200 | +To recap, the solution to this problem can be divided into the following parts: |
| 201 | + |
| 202 | +1. Build a graph from the given words and keep track of the in-degrees of alphabets in a dictionary. |
| 203 | +2. Add the sources to a result list. |
| 204 | +3. Remove the sources and update the in-degrees of their children. If the in-degree of a child becomes 0, it’s the next |
| 205 | + source. |
| 206 | +4. Repeat until all alphabets are covered. |
| 207 | + |
| 208 | +### Time Complexity |
| 209 | + |
| 210 | +There are three parts to the algorithm: |
| 211 | + |
| 212 | +- Identifying all the relations. |
| 213 | +- Putting them into an adjacency list. |
| 214 | +- Converting it into a valid alphabet ordering. |
| 215 | + |
| 216 | +In the worst case, the identification and initialization parts require checking every letter of every word, which is |
| 217 | +O(c), where c is the total length of all the words in the input list added together. |
| 218 | + |
| 219 | +For the generation part, we can recall that a breadth-first search has a cost of O(v+e), where v is the number of vertices |
| 220 | +and e is the number of edges. Our algorithm has the same cost as BFS because it visits each edge and node once. |
| 221 | + |
| 222 | +> Note: A node is visited once all of its edges are visited, unlike the traditional BFS where it’s visited once any edge |
| 223 | +> is visited. |
| 224 | +
|
| 225 | +Therefore, determining the cost of our algorithm requires determining how many nodes and edges there are in the graph. |
| 226 | + |
| 227 | +**Nodes**: We know that there’s one vertex for each unique letter, that is, O(u) vertices, where u is the total number of |
| 228 | +unique letters in words. While this is limited to 26 in our case, we still look at how it would impact the complexity if |
| 229 | +this weren’t the case. |
| 230 | + |
| 231 | +**Edges**: We generate each edge in the graph by comparing two adjacent words in the input list. There are n−1 pairs of |
| 232 | +adjacent words, and only one edge can be generated from each pair, where n is the total number of words in the input list. |
| 233 | +We can again look back at the English dictionary analogy to make sense of this: |
| 234 | + |
| 235 | +"dirt" |
| 236 | +"dorm" |
| 237 | + |
| 238 | +The only conclusion we can draw is that “i” is before “o.” This is the reason "dirt" appears before "dorm" in an English |
| 239 | +dictionary. The solution explains that the remaining letters “rt” and “rm” are irrelevant for determining the alphabetical |
| 240 | +ordering. |
| 241 | + |
| 242 | +> Remember: We only generate rules for adjacent words and don’t add the “implied” rules to the adjacency list. |
| 243 | +
|
| 244 | +So with this, we know that there are at most n−1 edges. |
| 245 | + |
| 246 | +We can place one additional upper limit on the number of edges since it’s impossible to have more than one edge between |
| 247 | +each pair of nodes. With u nodes, this means there can’t be more than u^2 edges. |
| 248 | + |
| 249 | +Because the number of edges has to be lower than both n−1 and u^2, we know it’s at most the smallest of these two values: |
| 250 | +min(u^2 ,n−1). |
| 251 | + |
| 252 | +We can now substitute the number of nodes and the number of edges in our breadth-first search cost: |
| 253 | +- v=u |
| 254 | +- e=min(u^2 ,n−1) |
| 255 | + |
| 256 | +This gives us the following: |
| 257 | +> O(v+e) = O(u + min(u^2, n−1)) = O(u + min(u^2 ,n)) |
| 258 | +
|
| 259 | +Finally, we combine the three parts: O(c) for the first two parts and O(u + min(u^2 ,n)) for the third part. Since we |
| 260 | +have two independent parts, we can add them and look at the final formula to see whether or not we can identify any |
| 261 | +relation between them. Combining them, we get the following: |
| 262 | + |
| 263 | +> O(c) + O(u + min(u^2, n)) = O(c + u + min(u^2,n)) |
| 264 | +
|
| 265 | +So, what do we know about the relative values of n, c and u? We can deduce that both n, the total number of words, and |
| 266 | +u, the total number of unique letters, are smaller than the total number of letters, c, because each word contains at |
| 267 | +least one character and there can’t be more unique characters than there are characters. |
| 268 | + |
| 269 | +We know that c is the biggest of the three, but we don’t know the relation between n and u. |
| 270 | + |
| 271 | +Let’s simplify our formulation a little since we know that the u bit is insignificant compared to c |
| 272 | + |
| 273 | +> O(c+u+min(u^2,n))−>O(c+min(u^2 ,n)) |
| 274 | +
|
| 275 | +Let’s now consider two cases to simplify it a little further: |
| 276 | + |
| 277 | +- If u^2 is smaller than n, then min(u^2,n)=u^2. We have already established that u^2 is smaller than n, which is, in |
| 278 | + turn, smaller than c, and so u^2 is definitely less than c. This leaves us with O(c). |
| 279 | +- If u^2 is larger than n, then min(u^2,n)=n. Because c>n, we’re left with O(c). |
| 280 | + |
| 281 | +So in all cases, we know that c>min(u^2 ,n). This gives us a final time complexity of O(c). |
| 282 | + |
| 283 | +### Space Complexity |
| 284 | + |
| 285 | +The space complexity is O(1) or O(u+min(u^2, n)). The adjacency list uses O(v+e) memory, which in the worst case is |
| 286 | +min(u^2 ,n), as explained in the time complexity analysis. So in total, the adjacency list takes |
| 287 | +O(u+min(u^2,n)) space. So, the space complexity for a large number of letters is O(u+min(u^2 ,n)). However, for our use |
| 288 | +case, where u is fixed at a maximum of 26, the space complexity is O(1). This is because u is fixed at 26, and the number |
| 289 | +of relations is fixed at 26^2, so O(min(26^2,n))=O(26^2)=O(1). |
0 commit comments