Skip to content

Commit 71b4b99

Browse files
authored
shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize (#7585)
* shuf: Move NonrepeatingIterator to own module * shuf: correctness: Flush output after writing This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before. * shuf: perf: Bump output buffer to 64KB * shuf: correctness: Do not use panics to report --random-source read errors * shuf: perf: Use itoa for integer formatting This gives a 1.8× speedup over a stdlib formatted write for `shuf -r -n1000000 -i1-1024`. The original version of this commit replaced a formatted write, but before it got merged main received optimized manual formatting from another PR. The speedup of itoa over the manual write is around 1.1×, much less dramatic. * shuf: correctness: Make --random-source compatible with GNU shuf When the --random-source option is used uutils shuf now gives identical output to GNU shuf in many (but not all) cases. This is helpful to users who use it to get deterministic output, e.g. by combining it with `openssl` as suggested in the GNU info pages. I reverse engineered the algorithm from GNU shuf's output. There may be bugs. Other modes of shuffling still use `rand`'s `ThreadRng`, though they now sample a uniform distribution directly without going through the slice helper trait. Additionally, switch from `usize` to `u64` for `--input-range` and `--head-count`. This way the same range of numbers can be generated on 32-bit platforms as on 64-bit platforms. * shuf: feature: Add --random-seed option This adds a new option to get reproducible output from a seed. This was already possible with --random-source, but doing that properly was tricky and had poor performance. Adding this option implies a commitment to keep using the exact same algorithms in the future. For that reason we only use third-party libraries for well-known algorithms and implement our own distributions on top of that. ----- As a teenager on King's Day I once used `shuf` for divination. People paid €0.50 to enter a cramped tent and sat down next to me behind an old netbook. I would ask their name and their sun sign and pipe this information into `shuf --random-source=/dev/stdin`, which selected pseudo-random dictionary words and `tee`d them into `espeak`. If someone's name was too short `shuf` crashed with an end of file error. --random-seed would have worked better. * shuf: correctness: Use Fisher-Yates for nonrepeating integers We used to use a clever homegrown way to sample integers. But GNU shuf with --random-source observably uses Fisher-Yates, and the output of the old version depended on a heuristic (making it dangerous for --random-seed). So now we do Fisher-Yates here, just like we do for other inputs. In deterministic modes the output for --input-range is identical that for piping `seq` into `shuf`. We imitate the old algorithm's method for keeping the resource use in check. The performance of the new version is very close to that of the old version: I haven't found any cases where it's much faster or much slower.
2 parents 938039e + b81a018 commit 71b4b99

13 files changed

Lines changed: 758 additions & 403 deletions

File tree

.vscode/cspell.dictionaries/jargon.wordlist.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ fileio
5555
filesystem
5656
filesystems
5757
flamegraph
58+
footgun
5859
freeram
5960
fsxattr
6061
fullblock
@@ -93,6 +94,7 @@ mergeable
9394
microbenchmark
9495
microbenchmarks
9596
microbenchmarking
97+
monomorphized
9698
multibyte
9799
multicall
98100
nmerge
@@ -107,6 +109,7 @@ nolinks
107109
nonblock
108110
nonportable
109111
nonprinting
112+
nonrepeating
110113
nonseekable
111114
notrunc
112115
nowrite

.vscode/cspell.dictionaries/people.wordlist.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ Boden Garman
3737
Chirag B Jadwani
3838
Chirag
3939
Jadwani
40+
Daniel Lemire
41+
Daniel
42+
Lemire
4043
Derek Chiang
4144
Derek
4245
Chiang

.vscode/cspell.dictionaries/workspace.wordlist.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ getrandom
3838
globset
3939
indicatif
4040
itertools
41+
itoa
4142
iuse
4243
langid
4344
lscolors

Cargo.lock

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,7 @@ icu_locale = "2.0.0"
335335
icu_provider = "2.0.0"
336336
indicatif = "0.18.0"
337337
itertools = "0.14.0"
338+
itoa = "1.0.15"
338339
jiff = "0.2.18"
339340
libc = "0.2.172"
340341
lscolors = { version = "0.21.0", default-features = false, features = [
@@ -355,6 +356,7 @@ phf_codegen = "0.13.1"
355356
platform-info = "2.0.3"
356357
procfs = "0.18"
357358
rand = { version = "0.9.0", features = ["small_rng"] }
359+
rand_chacha = { version = "0.9.0" }
358360
rand_core = "0.9.0"
359361
rayon = "1.10"
360362
regex = "1.10.4"

src/uu/shuf/Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,11 @@ path = "src/shuf.rs"
1919

2020
[dependencies]
2121
clap = { workspace = true }
22+
itoa = { workspace = true }
2223
rand = { workspace = true }
24+
rand_chacha = { workspace = true }
2325
rand_core = { workspace = true }
26+
sha3 = { workspace = true }
2427
uucore = { workspace = true }
2528
fluent = { workspace = true }
2629

src/uu/shuf/locales/en-US.ftl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ shuf-help-echo = treat each ARG as an input line
1010
shuf-help-input-range = treat each number LO through HI as an input line
1111
shuf-help-head-count = output at most COUNT lines
1212
shuf-help-output = write result to FILE instead of standard output
13+
shuf-help-random-seed = seed with STRING for reproducible output
1314
shuf-help-random-source = get random bytes from FILE
1415
shuf-help-repeat = output lines can be repeated
1516
shuf-help-zero-terminated = line delimiter is NUL, not newline
@@ -19,6 +20,8 @@ shuf-error-unexpected-argument = unexpected argument { $arg } found
1920
shuf-error-failed-to-open-for-writing = failed to open { $file } for writing
2021
shuf-error-failed-to-open-random-source = failed to open random source { $file }
2122
shuf-error-read-error = read error
23+
shuf-error-read-random-bytes = reading random bytes failed
24+
shuf-error-end-of-random-bytes = end of random source
2225
shuf-error-no-lines-to-repeat = no lines to repeat
2326
shuf-error-start-exceeds-end = start exceeds end
2427
shuf-error-missing-dash = missing '-'
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
// This file is part of the uutils coreutils package.
2+
//
3+
// For the full copyright and license information, please view the LICENSE
4+
// file that was distributed with this source code.
5+
6+
use std::{io::BufRead, ops::RangeInclusive};
7+
8+
use uucore::error::{FromIo, UResult, USimpleError};
9+
use uucore::translate;
10+
11+
/// A uniform integer generator that tries to exactly match GNU shuf's --random-source.
12+
///
13+
/// It's not particularly efficient and possibly not quite uniform. It should *only* be
14+
/// used for compatibility with GNU: other modes shouldn't touch this code.
15+
///
16+
/// All the logic here was black box reverse engineered. It might not match up in all edge
17+
/// cases but it gives identical results on many different large and small inputs.
18+
///
19+
/// It seems that GNU uses fairly textbook rejection sampling to generate integers, reading
20+
/// one byte at a time until it has enough entropy, and recycling leftover entropy after
21+
/// accepting or rejecting a value.
22+
///
23+
/// To do your own experiments, start with commands like these:
24+
///
25+
/// printf '\x01\x02\x03\x04' | shuf -i0-255 -r --random-source=/dev/stdin
26+
///
27+
/// Then vary the integer range and the input and the input length. It can be useful to
28+
/// see when exactly shuf crashes with an "end of file" error.
29+
///
30+
/// To spot small inconsistencies it's useful to run:
31+
///
32+
/// diff -y <(my_shuf ...) <(shuf -i0-{MAX} -r --random-source={INPUT}) | head -n 50
33+
pub struct RandomSourceAdapter<R> {
34+
reader: R,
35+
state: u64,
36+
entropy: u64,
37+
}
38+
39+
impl<R> RandomSourceAdapter<R> {
40+
pub fn new(reader: R) -> Self {
41+
Self {
42+
reader,
43+
state: 0,
44+
entropy: 0,
45+
}
46+
}
47+
}
48+
49+
impl<R: BufRead> RandomSourceAdapter<R> {
50+
fn generate_at_most(&mut self, at_most: u64) -> UResult<u64> {
51+
while self.entropy < at_most {
52+
let buf = self
53+
.reader
54+
.fill_buf()
55+
.map_err_context(|| translate!("shuf-error-read-random-bytes"))?;
56+
let Some(&byte) = buf.first() else {
57+
return Err(USimpleError::new(
58+
1,
59+
translate!("shuf-error-end-of-random-bytes"),
60+
));
61+
};
62+
self.reader.consume(1);
63+
// Is overflow OK here? Won't it cause bias? (Seems to work out...)
64+
self.state = self.state.wrapping_mul(256).wrapping_add(byte as u64);
65+
self.entropy = self.entropy.wrapping_mul(256).wrapping_add(255);
66+
}
67+
68+
if at_most == u64::MAX {
69+
// at_most + 1 would overflow but this case is easy.
70+
let val = self.state;
71+
self.entropy = 0;
72+
self.state = 0;
73+
return Ok(val);
74+
}
75+
76+
let num_possibilities = at_most + 1;
77+
78+
// If the generated number falls within this margin at the upper end of the
79+
// range then we retry to avoid modulo bias.
80+
let margin = ((self.entropy as u128 + 1) % num_possibilities as u128) as u64;
81+
let safe_zone = self.entropy - margin;
82+
83+
if self.state <= safe_zone {
84+
let val = self.state % num_possibilities;
85+
// Reuse the rest of the state.
86+
self.state /= num_possibilities;
87+
// We need this subtraction, otherwise we consume new input slightly more
88+
// slowly than GNU. Not sure if it checks out mathematically.
89+
self.entropy -= at_most;
90+
self.entropy /= num_possibilities;
91+
Ok(val)
92+
} else {
93+
self.state %= num_possibilities;
94+
self.entropy %= num_possibilities;
95+
// I sure hope the compiler optimizes this tail call.
96+
self.generate_at_most(at_most)
97+
}
98+
}
99+
100+
pub fn choose_from_range(&mut self, range: RangeInclusive<u64>) -> UResult<u64> {
101+
let offset = self.generate_at_most(*range.end() - *range.start())?;
102+
Ok(*range.start() + offset)
103+
}
104+
105+
pub fn choose_from_slice<T: Copy>(&mut self, vals: &[T]) -> UResult<T> {
106+
assert!(!vals.is_empty());
107+
let idx = self.generate_at_most(vals.len() as u64 - 1)? as usize;
108+
Ok(vals[idx])
109+
}
110+
111+
pub fn shuffle<'a, T>(&mut self, vals: &'a mut [T], amount: usize) -> UResult<&'a mut [T]> {
112+
// Fisher-Yates shuffle.
113+
// TODO: GNU does something different if amount <= vals.len() and the input is stdin.
114+
// The order changes completely and depends on --head-count.
115+
// No clue what they might do differently and why.
116+
let amount = amount.min(vals.len());
117+
for idx in 0..amount {
118+
let other_idx = self.generate_at_most((vals.len() - idx - 1) as u64)? as usize + idx;
119+
vals.swap(idx, other_idx);
120+
}
121+
Ok(&mut vals[..amount])
122+
}
123+
}
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
use std::collections::HashMap;
2+
use std::ops::RangeInclusive;
3+
4+
use uucore::error::UResult;
5+
6+
use crate::WrappedRng;
7+
8+
/// An iterator that samples from an integer range without repetition.
9+
///
10+
/// This is based on Fisher-Yates, and it's required for backward compatibility
11+
/// that it behaves exactly like Fisher-Yates if --random-source or --random-seed
12+
/// is used. But we have a few tricks:
13+
///
14+
/// - In the beginning we use a hash table instead of an array. This way we lazily
15+
/// keep track of swaps without allocating the entire range upfront.
16+
///
17+
/// - When the hash table starts to get big relative to the remaining items
18+
/// we switch over to an array.
19+
///
20+
/// - We store the array backwards so that we can shrink it as we go and free excess
21+
/// memory every now and then.
22+
///
23+
/// Both the hash table and the array give the same output.
24+
///
25+
/// There's room for optimization:
26+
///
27+
/// - Switching over from the hash table to the array is costly. If we happen to know
28+
/// (through --head-count) that only few draws remain then it would be better not
29+
/// to switch.
30+
///
31+
/// - If the entire range gets used then we might as well allocate an array to start
32+
/// with. But if the user e.g. pipes through `head` rather than using --head-count
33+
/// we can't know whether that's the case, so there's a tradeoff.
34+
///
35+
/// GNU decides the other way: --head-count is noticeably faster than | head.
36+
pub(crate) struct NonrepeatingIterator<'a> {
37+
rng: &'a mut WrappedRng,
38+
values: Values,
39+
}
40+
41+
enum Values {
42+
Full(Vec<u64>),
43+
Sparse(RangeInclusive<u64>, HashMap<u64, u64>),
44+
}
45+
46+
impl<'a> NonrepeatingIterator<'a> {
47+
pub(crate) fn new(range: RangeInclusive<u64>, rng: &'a mut WrappedRng) -> Self {
48+
let values = Values::Sparse(range, HashMap::default());
49+
NonrepeatingIterator { rng, values }
50+
}
51+
52+
fn produce(&mut self) -> UResult<u64> {
53+
match &mut self.values {
54+
Values::Full(items) => {
55+
let this_idx = items.len() - 1;
56+
57+
let other_idx = self.rng.choose_from_range(0..=items.len() as u64 - 1)? as usize;
58+
// Flip the index to pretend we're going left-to-right
59+
let other_idx = items.len() - other_idx - 1;
60+
61+
items.swap(this_idx, other_idx);
62+
63+
let val = items.pop().unwrap();
64+
if items.len().is_power_of_two() && items.len() >= 512 {
65+
items.shrink_to_fit();
66+
}
67+
Ok(val)
68+
}
69+
Values::Sparse(range, items) => {
70+
let this_idx = *range.start();
71+
let this_val = items.remove(&this_idx).unwrap_or(this_idx);
72+
73+
let other_idx = self.rng.choose_from_range(range.clone())?;
74+
75+
let val = if this_idx == other_idx {
76+
this_val
77+
} else {
78+
items.insert(other_idx, this_val).unwrap_or(other_idx)
79+
};
80+
*range = *range.start() + 1..=*range.end();
81+
82+
Ok(val)
83+
}
84+
}
85+
}
86+
}
87+
88+
impl Iterator for NonrepeatingIterator<'_> {
89+
type Item = UResult<u64>;
90+
91+
fn next(&mut self) -> Option<Self::Item> {
92+
match &self.values {
93+
Values::Full(items) if items.is_empty() => return None,
94+
Values::Full(_) => (),
95+
Values::Sparse(range, _) if range.is_empty() => return None,
96+
Values::Sparse(range, items) => {
97+
let range_len = range.size_hint().0 as u64;
98+
if items.len() as u64 >= range_len / 8 {
99+
self.values = Values::Full(hashmap_to_vec(range.clone(), items));
100+
}
101+
}
102+
}
103+
104+
Some(self.produce())
105+
}
106+
}
107+
108+
fn hashmap_to_vec(range: RangeInclusive<u64>, map: &HashMap<u64, u64>) -> Vec<u64> {
109+
let lookup = |idx| *map.get(&idx).unwrap_or(&idx);
110+
range.rev().map(lookup).collect()
111+
}

0 commit comments

Comments
 (0)