Skip to content

Commit 3bf22a4

Browse files
committed
Add rank and percentile rank statistics
1 parent 4be4d0f commit 3bf22a4

26 files changed

Lines changed: 679 additions & 196 deletions

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# Changelog
22

3+
## 1.5.1 - 2026-05-12
4+
- Adding `rank()` method for assigning 1-based ranks to data points, with support for `average`, `min`, `max`, `dense`, and `ordinal` tie strategies
5+
- Adding `percentileRank()` method for calculating the percentile position of a value, with `weak`, `strict`, `mean`, and `rank` variants
6+
- Adding fluent `Statistics::rank()` and `Statistics::percentileRank()` wrapper methods
7+
- Fixing `Statistics::tTestPaired()` to preserve the original input order for paired observations
8+
- Updating README documentation and examples for ranking and percentile-rank usage
9+
- Improving `Statistics` test coverage for two-sample and paired t-test wrappers
10+
311
## 1.5.0 - 2026-03-07
412
- Adding `logarithmicRegression()`, `powerRegression()`, and `exponentialRegression()` methods for non-linear regression models
513

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ The various mathematical statistics are listed below:
7979
| `thirdQuartile()` | 3rd quartile, is the value at which 75 percent of the data is below it |
8080
| `firstQuartile()` | first quartile, is the value at which 25 percent of the data is below it |
8181
| `percentile()` | value at any percentile (0–100) with linear interpolation |
82+
| `rank()` | rank each data point, with configurable tie handling |
83+
| `percentileRank()` | percentile position of a value within a dataset |
8284
| `pstdev()` | Population standard deviation |
8385
| `stdev()` | Sample standard deviation |
8486
| `sem()` | Standard error of the mean (SEM) — measures precision of the sample mean |
@@ -335,6 +337,45 @@ $value = Stat::percentile([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], 90);
335337
// 91.0
336338
```
337339

340+
#### Stat::rank( array $data, string $method = Stat::RANK_AVERAGE )
341+
Return 1-based ranks for each data point, preserving the original array keys.
342+
343+
The `$method` parameter controls how tied values are ranked:
344+
- `Stat::RANK_AVERAGE` (default): tied values receive the average rank.
345+
- `Stat::RANK_MIN`: tied values receive the lowest rank in the tied group.
346+
- `Stat::RANK_MAX`: tied values receive the highest rank in the tied group.
347+
- `Stat::RANK_DENSE`: tied values receive the same rank and ranks do not skip numbers.
348+
- `Stat::RANK_ORDINAL`: tied values are ranked by their sorted order, preserving input order inside ties.
349+
350+
```php
351+
use HiFolks\Statistics\Stat;
352+
353+
$ranks = Stat::rank([10, 20, 20, 30]);
354+
// [1, 2.5, 2.5, 4]
355+
356+
$ranks = Stat::rank([10, 20, 20, 30], Stat::RANK_DENSE);
357+
// [1, 2, 2, 3]
358+
```
359+
360+
#### Stat::percentileRank( array $data, int|float $value, string $kind = Stat::PERCENTILE_RANK_WEAK, ?int $round = null )
361+
Return the percentile position of a value within a dataset.
362+
363+
The `$kind` parameter controls the calculation:
364+
- `Stat::PERCENTILE_RANK_WEAK` (default): percentage of values less than or equal to the value.
365+
- `Stat::PERCENTILE_RANK_STRICT`: percentage of values strictly less than the value.
366+
- `Stat::PERCENTILE_RANK_MEAN`: average of weak and strict percentile ranks.
367+
- `Stat::PERCENTILE_RANK_RANK`: average percentage rank for exact matches, falling back to mean when the value is absent.
368+
369+
```php
370+
use HiFolks\Statistics\Stat;
371+
372+
$rank = Stat::percentileRank([10, 20, 20, 30, 40], 20);
373+
// 60.0
374+
375+
$rank = Stat::percentileRank([10, 20, 20, 30, 40], 20, Stat::PERCENTILE_RANK_STRICT);
376+
// 20.0
377+
```
378+
338379
#### Stat::pstdev( array $data )
339380
Return the **Population** Standard Deviation, a measure of the amount of variation or dispersion of a set of values.
340381
A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

TODO.md

Lines changed: 54 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,56 @@
11
## Missing Functions
22

3-
4-
5-
6-
### Correlation & Regression
7-
8-
9-
- Kendall tau correlation - another rank-based correlation
10-
- Multiple/polynomial regression
11-
12-
### Hypothesis Testing
13-
14-
- ~~T-test (two-sample, paired) — one-sample is done~~ DONE: `tTestTwoSample()` (Welch's) and `tTestPaired()`
15-
- Chi-squared test
16-
17-
### Other Distributions (beyond Normal)
18-
19-
- Chi-squared distribution
20-
- Binomial distribution
21-
- Poisson distribution
22-
- Uniform distribution
23-
- Exponential distribution
24-
25-
26-
27-
28-
### Ranking & Order Statistics
29-
30-
- Rank - assign ranks to data points
31-
- Percentile rank - what percentile a given value falls at
3+
### Priority 1: Ranking & Order Statistics
4+
5+
- DONE: `rank()` - assign ranks to data points.
6+
- Supports tie strategies: `average`, `min`, `max`, `dense`, `ordinal`.
7+
- DONE: `percentileRank()` - calculate what percentile a given value falls at.
8+
- Supports `weak`, `strict`, `mean`, and `rank` variants.
9+
10+
### Priority 2: Correlation
11+
12+
- `kendallTau()` - Kendall tau rank correlation.
13+
- Useful for ordinal data and small samples.
14+
- Complements the existing Pearson and Spearman support in `correlation()`.
15+
- Consider extending `correlation()` with a Kendall method option.
16+
17+
### Priority 3: Hypothesis Testing
18+
19+
- ~~T-test (two-sample, paired) - one-sample is done~~ DONE: `tTestTwoSample()` (Welch's) and `tTestPaired()`.
20+
- `chiSquaredTest()` - chi-squared goodness-of-fit test.
21+
- `chiSquaredIndependence()` - chi-squared test for contingency tables.
22+
23+
### Priority 4: Distributions
24+
25+
- `ChiSquaredDist`
26+
- Needed for chi-squared tests.
27+
- Include `pdf()`, `cdf()`, `invCdf()` if practical, mean, variance.
28+
- `BinomialDist`
29+
- Include `pmf()`, `cdf()`, mean, variance, samples.
30+
- `PoissonDist`
31+
- Include `pmf()`, `cdf()`, mean, variance, samples.
32+
- `ExponentialDist`
33+
- Include `pdf()`, `cdf()`, `invCdf()`, mean, variance, samples.
34+
- `UniformDist`
35+
- Include `pdf()`, `cdf()`, `invCdf()`, mean, variance, samples.
36+
37+
### Priority 5: Statistics Wrapper Completeness
38+
39+
Add fluent `Statistics` wrapper methods for existing `Stat` APIs where useful:
40+
41+
- `correlation()`
42+
- `covariance()`
43+
- `linearRegression()`
44+
- `logarithmicRegression()`
45+
- `powerRegression()`
46+
- `exponentialRegression()`
47+
- `rSquared()`
48+
- `kde()`
49+
- `kdeRandom()`
50+
51+
### Priority 6: Regression & Modeling
52+
53+
- `polynomialRegression()` - fit polynomial models of configurable degree.
54+
- `multipleLinearRegression()` - fit linear models with multiple predictors.
55+
- This likely needs a small matrix/linear-algebra helper layer.
56+
- Add after simpler ranking, correlation, testing, and distribution work.

0 commit comments

Comments
 (0)