|
| 1 | +# Find Duplicate File in System |
| 2 | + |
| 3 | +Given a list paths of directory info, including the directory path, and all the files with contents in this directory, |
| 4 | +return all the duplicate files in the file system in terms of their paths. You may return the answer in any order. |
| 5 | + |
| 6 | +A group of duplicate files consists of at least two files that have the same content. |
| 7 | + |
| 8 | +A single directory info string in the input list has the following format: |
| 9 | + |
| 10 | +"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)" |
| 11 | +It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively |
| 12 | +in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root |
| 13 | +directory. |
| 14 | + |
| 15 | +The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that |
| 16 | +have the same content. A file path is a string that has the following format: |
| 17 | + |
| 18 | +"directory_path/file_name.txt" |
| 19 | + |
| 20 | +## Examples |
| 21 | + |
| 22 | +Example 1: |
| 23 | + |
| 24 | +```text |
| 25 | +Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"] |
| 26 | +Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]] |
| 27 | +``` |
| 28 | + |
| 29 | +Example 2: |
| 30 | + |
| 31 | +```text |
| 32 | +Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"] |
| 33 | +Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]] |
| 34 | +``` |
| 35 | + |
| 36 | +## Constraints |
| 37 | + |
| 38 | +- 1 <= paths.length <= 2 * 10^4 |
| 39 | +- 1 <= paths[i].length <= 3000 |
| 40 | +- 1 <= sum(paths[i].length) <= 5 * 10^5 |
| 41 | +- paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '. |
| 42 | +- You may assume no files or directories share the same name in the same directory. |
| 43 | +- You may assume each given directory info represents a unique directory. A single blank space separates the directory |
| 44 | + path and file info. |
| 45 | + |
| 46 | +Follow up: |
| 47 | + |
| 48 | +- Imagine you are given a real file system, how will you search files? DFS or BFS? |
| 49 | +- If the file content is very large (GB level), how will you modify your solution? |
| 50 | +- If you can only read the file by 1kb each time, how will you modify your solution? |
| 51 | +- What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part |
| 52 | + of it? How to optimize? |
| 53 | +- How to make sure the duplicated files you find are not false positive? |
| 54 | + |
| 55 | + |
| 56 | +## Topics |
| 57 | + |
| 58 | +- Array |
| 59 | +- Hash Table |
| 60 | +- String |
| 61 | + |
| 62 | +## Solution |
| 63 | + |
| 64 | +In this approach, firstly we obtain the directory paths, the file names and their contents separately by appropriately |
| 65 | +splitting each string in the given paths list. In order to find the files with duplicate contents, we make use of a |
| 66 | +HashMap, which stores the data in the form (contents,list_of_file_paths_with_this_content). Thus, for every file's |
| 67 | +contents, we check if the same content already exist in the hashmap. If so, we add the current file's path to the list of |
| 68 | +files corresponding to the current contents. Otherwise, we create a new entry in the map, with the current contents as |
| 69 | +the key and the value being a list with only one entry(the current file's path). |
| 70 | + |
| 71 | +At the end, we find out the contents corresponding to which atleast two file paths exist. We obtain the resultant list |
| 72 | +res, which is a list of lists containing these file paths corresponding to the same contents. |
| 73 | + |
| 74 | +The following animation illustrates the process for a clearer understanding. |
| 75 | + |
| 76 | +### Complexity Analysis |
| 77 | + |
| 78 | +- Time complexity : O(n*x). n strings of average length x is parsed. |
| 79 | +- Space complexity : O(n*x). map and res size grows upto n∗x. |
0 commit comments