File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -16,16 +16,20 @@ def i2b(x: int) -> bytes:
1616def compute_histogram (path : str ) -> dict [bytes , int ]:
1717 """Frequency of every 2-byte bigram in the file at ``path``."""
1818 # Step 1: read the whole file into memory as a single bytes object.
19- counts = [0 for _ in range (0 , 2 ** 16 )]
19+ counts = [0 for _ in range (2 ** 16 )]
2020
2121 source = open (path , "rb" , buffering = 0 )
2222 data = mmap (source .fileno (), 0 , access = ACCESS_READ )
2323
2424 # Step 2: slide a 2-byte window across the buffer. For ``b"ABCD"`` the
2525 # iterations produce ``b"AB"``, ``b"BC"``, then ``b"CD"``. For each window,
2626 # bump the matching bucket in a ``dict`` keyed by the bigram itself.
27+ previous = data [0 ]
2728 for i in range (len (data ) - 1 ):
28- bigram = b2i (data [i ], data [i + 1 ])
29- counts [bigram ] += 1
29+ current = data [i + 1 ]
30+ counts [current + (previous << 8 )] += 1
31+ previous = current
3032
31- return {i2b (idx ): value for idx , value in enumerate (counts ) if value != 0 }
33+ return {
34+ i2b (idx ): value for idx , value in enumerate (counts ) if value != 0
35+ }
You can’t perform that action at this time.
0 commit comments