Skip to content

Commit b914c4e

Browse files
committed
feat(algorithms, hash-table): duplicate in file system
1 parent 28655bf commit b914c4e

3 files changed

Lines changed: 202 additions & 0 deletions

File tree

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Find Duplicate File in System
2+
3+
Given a list paths of directory info, including the directory path, and all the files with contents in this directory,
4+
return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.
5+
6+
A group of duplicate files consists of at least two files that have the same content.
7+
8+
A single directory info string in the input list has the following format:
9+
10+
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
11+
It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively
12+
in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root
13+
directory.
14+
15+
The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that
16+
have the same content. A file path is a string that has the following format:
17+
18+
"directory_path/file_name.txt"
19+
20+
## Examples
21+
22+
Example 1:
23+
24+
```text
25+
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
26+
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
27+
```
28+
29+
Example 2:
30+
31+
```text
32+
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
33+
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]
34+
```
35+
36+
## Constraints
37+
38+
- 1 <= paths.length <= 2 * 10^4
39+
- 1 <= paths[i].length <= 3000
40+
- 1 <= sum(paths[i].length) <= 5 * 10^5
41+
- paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
42+
- You may assume no files or directories share the same name in the same directory.
43+
- You may assume each given directory info represents a unique directory. A single blank space separates the directory
44+
path and file info.
45+
46+
Follow up:
47+
48+
- Imagine you are given a real file system, how will you search files? DFS or BFS?
49+
- If the file content is very large (GB level), how will you modify your solution?
50+
- If you can only read the file by 1kb each time, how will you modify your solution?
51+
- What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part
52+
of it? How to optimize?
53+
- How to make sure the duplicated files you find are not false positive?
54+
55+
56+
## Topics
57+
58+
- Array
59+
- Hash Table
60+
- String
61+
62+
## Solution
63+
64+
In this approach, firstly we obtain the directory paths, the file names and their contents separately by appropriately
65+
splitting each string in the given paths list. In order to find the files with duplicate contents, we make use of a
66+
HashMap, which stores the data in the form (contents,list_of_file_paths_with_this_content). Thus, for every file's
67+
contents, we check if the same content already exist in the hashmap. If so, we add the current file's path to the list of
68+
files corresponding to the current contents. Otherwise, we create a new entry in the map, with the current contents as
69+
the key and the value being a list with only one entry(the current file's path).
70+
71+
At the end, we find out the contents corresponding to which atleast two file paths exist. We obtain the resultant list
72+
res, which is a list of lists containing these file paths corresponding to the same contents.
73+
74+
The following animation illustrates the process for a clearer understanding.
75+
76+
### Complexity Analysis
77+
78+
- Time complexity : O(n*x). n strings of average length x is parsed.
79+
- Space complexity : O(n*x). map and res size grows upto n∗x.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
from collections import defaultdict
2+
from typing import List, Dict
3+
4+
5+
def find_duplicate(paths: List[str]) -> List[List[str]]:
6+
if not paths:
7+
return []
8+
9+
# Dictionary to store the content as key and list of file paths as value
10+
file_map: Dict[str, List[str]] = defaultdict(list)
11+
12+
for path in paths:
13+
values = path.split()
14+
# Starting at 1 to skip the directory name, Iterate through each file in the current directory path
15+
for i in range(1, len(values)):
16+
# Split the file name and content
17+
name_content = values[i].split("(")
18+
# Extract content part
19+
content = name_content[1][:-1]
20+
21+
directory = values[0]
22+
file_name = name_content[0]
23+
24+
# Construct the full file path
25+
file_path = f"{directory}/{file_name}"
26+
27+
# Add the file path to the list of the corresponding content
28+
file_map[content].append(file_path)
29+
30+
result = []
31+
# Iterate through the dictionary to find duplicates
32+
for file_path_value in file_map.values():
33+
# Only add to result if there are more than one file with the same content
34+
if len(file_path_value) > 1:
35+
result.append(file_path_value)
36+
37+
return result
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
import unittest
2+
from typing import List
3+
from parameterized import parameterized
4+
from algorithms.hash_table.duplicate_file_in_system import find_duplicate
5+
6+
FIND_DUPLICATE_FILE_IN_SYSTEM_TEST_CASES = [
7+
(
8+
[
9+
"root/a 1.txt(abcd) 2.txt(efgh)",
10+
"root/c 3.txt(abcd)",
11+
"root/c/d 4.txt(efgh)",
12+
"root 4.txt(efgh)",
13+
],
14+
[
15+
["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"],
16+
["root/a/1.txt", "root/c/3.txt"],
17+
],
18+
),
19+
(
20+
[
21+
"root/a 1.txt(abcd) 2.txt(efgh)",
22+
"root/c 3.txt(abcd)",
23+
"root/c/d 4.txt(efgh)",
24+
],
25+
[["root/a/2.txt", "root/c/d/4.txt"], ["root/a/1.txt", "root/c/3.txt"]],
26+
),
27+
(
28+
[
29+
"data/files 1.csv(data1)",
30+
"data/files/processed 2.csv(data2)",
31+
"data/files/backup 3.csv(data1)",
32+
"data/archives 4.csv(data3)",
33+
"data/archives/old 5.csv(data2)",
34+
],
35+
[
36+
["data/files/1.csv", "data/files/backup/3.csv"],
37+
["data/files/processed/2.csv", "data/archives/old/5.csv"],
38+
],
39+
),
40+
(
41+
[
42+
"usr/local/bin 1.sh(scriptX) 2.sh(scriptY)",
43+
"usr/local/lib 3.sh(scriptZ)",
44+
"usr/local/share 4.sh(scriptX)",
45+
"usr/local/share/tools 5.sh(scriptY)",
46+
],
47+
[
48+
["usr/local/bin/1.sh", "usr/local/share/4.sh"],
49+
["usr/local/bin/2.sh", "usr/local/share/tools/5.sh"],
50+
],
51+
),
52+
(
53+
[
54+
"documents/reports 1.pdf(report1) 2.pdf(report2)",
55+
"documents/presentations 3.pdf(presentation1)",
56+
"documents/reports/old 4.pdf(report1)",
57+
"documents/archive 5.pdf(archive1)",
58+
],
59+
[["documents/reports/1.pdf", "documents/reports/old/4.pdf"]],
60+
),
61+
(
62+
[
63+
"root/notes 1.md(note1) 2.md(note2)",
64+
"root/notes/meetings 3.md(note3)",
65+
"root/notes/summaries 4.md(note1)",
66+
"root/summaries 5.md(note4)",
67+
"root/summaries/meetings 6.md(note2)",
68+
],
69+
[
70+
["root/notes/1.md", "root/notes/summaries/4.md"],
71+
["root/notes/2.md", "root/summaries/meetings/6.md"],
72+
],
73+
),
74+
]
75+
76+
77+
class FindDuplicateFileInSystemTestCase(unittest.TestCase):
78+
@parameterized.expand(FIND_DUPLICATE_FILE_IN_SYSTEM_TEST_CASES)
79+
def test_find_duplicate(self, paths: List[str], expected: List[List[str]]):
80+
actual = find_duplicate(paths)
81+
actual.sort()
82+
self.assertListEqual(sorted(expected), actual)
83+
84+
85+
if __name__ == "__main__":
86+
unittest.main()

0 commit comments

Comments
 (0)