Skip to content

Commit 3cc0146

Browse files
committed
Grouped Media function
1 parent 0586823 commit 3cc0146

6 files changed

Lines changed: 285 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Changelog
22

3+
## 1.2.2 - WIP
4+
- Adding `medianGrouped()` method for estimating the median of grouped/binned continuous data using interpolation
5+
6+
37
## 1.2.1 - 2026-02-20
48
- Adding `invCdf()` method to normal distribution
59
- Adding `getVariance()` method to normal distribution (sigma squared)

README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ The various mathematical statistics are listed below:
6767
| `median()` | median or "middle value" of data |
6868
| `medianLow()` | low median of data |
6969
| `medianHigh()` | high median of data |
70+
| `medianGrouped()` | median of grouped data, using interpolation |
7071
| `mode()` | single mode (most common value) of discrete or nominal data |
7172
| `multimode()` | list of modes (most common values) of discrete or nominal data |
7273
| `quantiles()` | cut points dividing the range of a probability distribution into continuous intervals with equal probabilities |
@@ -192,6 +193,35 @@ $median = Stat::medianHigh([1, 3, 5, 7]);
192193
// 5
193194
```
194195

196+
#### Stat::medianGrouped( array $data, float $interval = 1.0 )
197+
Estimate the median for numeric data that has been grouped or binned around the midpoints of consecutive, fixed-width intervals.
198+
The `$interval` parameter specifies the width of each bin (default `1.0`). This function uses interpolation within the median interval, assuming values are evenly distributed across each bin.
199+
200+
```php
201+
use HiFolks\Statistics\Stat;
202+
$median = Stat::medianGrouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]);
203+
// 3.7
204+
$median = Stat::medianGrouped([1, 3, 3, 5, 7]);
205+
// 3.25
206+
$median = Stat::medianGrouped([1, 3, 3, 5, 7], 2);
207+
// 3.5
208+
```
209+
210+
For example, demographic data summarized into ten-year age groups:
211+
```php
212+
use HiFolks\Statistics\Stat;
213+
// 172 people aged 20-30, 484 aged 30-40, 387 aged 40-50, etc.
214+
$data = array_merge(
215+
array_fill(0, 172, 25),
216+
array_fill(0, 484, 35),
217+
array_fill(0, 387, 45),
218+
array_fill(0, 22, 55),
219+
array_fill(0, 6, 65),
220+
);
221+
round(Stat::medianGrouped($data, 10), 1);
222+
// 37.5
223+
```
224+
195225
#### Stat::quantiles( array $data, $n=4, $round=null )
196226
Divide data into n continuous intervals with equal probability. Returns a list of n - 1 cut points separating the intervals.
197227
Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set n to 100 for percentiles which gives the 99 cut points that separate data into 100 equal-sized groups.

TODO.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
Missing Functions
2+
3+
Python Function: median_grouped(data, interval)
4+
Description: Median of grouped/binned continuous data
5+
Status: Missing
6+
────────────────────────────────────────
7+
Python Function: kde(data, h, kernel)
8+
Description: Kernel Density Estimation
9+
Status: Missing
10+
────────────────────────────────────────
11+
Python Function: kde_random(data, h, kernel)
12+
Description: Random sampling from KDE
13+
Status: Missing
14+
15+
Missing Parameters/Variants
16+
17+
Feature: correlation() with method='ranked'
18+
Python: Supports both Pearson and Spearman rank correlation
19+
This Package: Only Pearson
20+
────────────────────────────────────────
21+
Feature: linear_regression() with proportional=True
22+
Python: Supports proportional regression (intercept forced to 0)
23+
This Package: No proportional option
24+
────────────────────────────────────────
25+
Feature: variance(data, xbar) / pvariance(data, mu)
26+
Python: Can pass pre-computed mean to avoid recalculation
27+
This Package: No pre-computed mean parameter
28+
────────────────────────────────────────
29+
Feature: quantiles() with method='inclusive'
30+
Python: Supports both exclusive and inclusive methods
31+
This Package: No method parameter
32+
33+
Summary
34+
35+
The package is actually very close to full parity with Python's statistics
36+
module. The gaps are:
37+
38+
1. median_grouped - interpolation-based median for grouped/binned data
39+
2. kde / kde_random - Kernel Density Estimation (added in Python 3.13,
40+
relatively new)
41+
3. Spearman rank correlation - via method parameter on correlation()
42+
4. Proportional linear regression - forcing intercept through origin
43+
5. Minor parameter additions (xbar/mu on variance/stdev, method on quantiles)
44+
45+
Items 1, 3, and 4 would be the most practical additions to reach near-complete
46+
parity with Python's statistics module. The KDE functions (2) are newer and
47+
more niche.
48+
49+
50+
51+
52+
Currently Implemented (for reference)
53+
54+
Central tendency, variance/stdev, median variants, mode/multimode,
55+
geometric/harmonic mean, quantiles, covariance, correlation, linear
56+
regression, normal distribution (PDF, CDF, inverse CDF, z-score), frequency
57+
tables.
58+
59+
---
60+
Missing Functions
61+
62+
Descriptive Statistics
63+
64+
- Trimmed/Truncated mean - mean after removing outliers (top/bottom x%)
65+
- Weighted median - median with weights (like fmean supports weights, but
66+
median doesn't)
67+
- Skewness - measure of asymmetry of the distribution
68+
- Kurtosis - measure of "tailedness" of the distribution
69+
- Standard error of the mean (SEM)
70+
- Coefficient of variation (CV) - stdev / mean, useful for comparing
71+
variability across datasets
72+
- Mean absolute deviation (MAD)
73+
- Percentile - arbitrary percentile (e.g., 90th percentile) — quantiles()
74+
exists but a direct percentile($data, $p) would be convenient
75+
76+
Correlation & Regression
77+
78+
- Spearman rank correlation - non-parametric correlation
79+
- Kendall tau correlation - another rank-based correlation
80+
- Multiple/polynomial regression
81+
- R-squared (coefficient of determination)
82+
83+
Hypothesis Testing
84+
85+
- T-test (one-sample, two-sample, paired)
86+
- Chi-squared test
87+
- Z-test
88+
- P-value calculation
89+
- Confidence intervals
90+
91+
Other Distributions (beyond Normal)
92+
93+
- Student's t-distribution
94+
- Chi-squared distribution
95+
- Binomial distribution
96+
- Poisson distribution
97+
- Uniform distribution
98+
- Exponential distribution
99+
100+
Outlier Detection
101+
102+
- IQR-based outlier detection (the building blocks exist with
103+
firstQuartile/thirdQuartile, but no dedicated method)
104+
- Z-score based outlier detection
105+
106+
Ranking & Order Statistics
107+
108+
- Rank - assign ranks to data points
109+
- Percentile rank - what percentile a given value falls at
110+
111+
---
112+
The most impactful additions would likely be skewness, kurtosis, coefficient
113+
of variation, percentile, and Spearman correlation — these are commonly needed
114+
and align well with the package's existing scope (inspired by Python's
115+
statistics module).

src/Stat.php

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,96 @@ public static function median(
143143
};
144144
}
145145

146+
/**
147+
* Estimate the median for grouped data that has been binned
148+
* around the midpoints of consecutive, fixed-width intervals.
149+
*
150+
* Uses interpolation within the median interval:
151+
* L + interval * (n/2 - cf) / f
152+
*
153+
* where:
154+
* - L is the lower limit of the median interval
155+
* - cf is the cumulative frequency of the preceding interval
156+
* - f is the number of elements in the median interval
157+
*
158+
* @param array<int|float> $data
159+
* @param float $interval the width of each bin
160+
* @return float the estimated median for grouped data
161+
*
162+
* @throws InvalidDataInputException if the data is empty
163+
*/
164+
public static function medianGrouped(array $data, float $interval = 1.0): float
165+
{
166+
sort($data);
167+
$n = count($data);
168+
if ($n === 0) {
169+
throw new InvalidDataInputException("The data must not be empty.");
170+
}
171+
172+
// Find the value at the midpoint (midpoint of the class interval)
173+
$x = (float) $data[intdiv($n, 2)];
174+
175+
// Find where all the x values occur in the sorted data
176+
// All x will lie within data[i:j]
177+
$i = self::bisectLeft($data, $x);
178+
$j = self::bisectRight($data, $x, $i);
179+
180+
// Lower limit of the median interval
181+
$L = $x - $interval / 2.0;
182+
// Cumulative frequency of the preceding interval
183+
$cf = $i;
184+
// Number of elements in the median interval
185+
$f = $j - $i;
186+
187+
return $L + $interval * ($n / 2.0 - $cf) / $f;
188+
}
189+
190+
/**
191+
* Binary search: find the leftmost position where $target can be inserted
192+
* in $data while keeping it sorted.
193+
*
194+
* @param array<int|float> $data sorted array
195+
* @param float $target value to locate
196+
*/
197+
private static function bisectLeft(array $data, float $target): int
198+
{
199+
$lo = 0;
200+
$hi = count($data);
201+
while ($lo < $hi) {
202+
$mid = intdiv($lo + $hi, 2);
203+
if ($data[$mid] < $target) {
204+
$lo = $mid + 1;
205+
} else {
206+
$hi = $mid;
207+
}
208+
}
209+
210+
return $lo;
211+
}
212+
213+
/**
214+
* Binary search: find the rightmost position where $target can be inserted
215+
* in $data while keeping it sorted.
216+
*
217+
* @param array<int|float> $data sorted array
218+
* @param float $target value to locate
219+
* @param int $lo lower bound for the search
220+
*/
221+
private static function bisectRight(array $data, float $target, int $lo = 0): int
222+
{
223+
$hi = count($data);
224+
while ($lo < $hi) {
225+
$mid = intdiv($lo + $hi, 2);
226+
if ($data[$mid] <= $target) {
227+
$lo = $mid + 1;
228+
} else {
229+
$hi = $mid;
230+
}
231+
}
232+
233+
return $lo;
234+
}
235+
146236
/**
147237
* Return the low median of data.
148238
* The low median is always a member of the data set.

src/Statistics.php

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,18 @@ public function median(): mixed
173173
return Stat::median($this->values);
174174
}
175175

176+
/**
177+
* Estimate the median for grouped data.
178+
*
179+
* @param float $interval the width of each bin
180+
*
181+
* @see Stat::medianGrouped()
182+
*/
183+
public function medianGrouped(float $interval = 1.0): float
184+
{
185+
return Stat::medianGrouped($this->numericalArray(), $interval);
186+
}
187+
176188
/**
177189
* Return the first quartile.
178190
*

tests/StatTest.php

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,40 @@ public function test_calculates_median_high_with_empty_array(): void
9292
Stat::medianHigh([]);
9393
}
9494

95+
public function test_calculates_median_grouped(): void
96+
{
97+
// Python: median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) == 3.7
98+
$this->assertEquals(3.7, Stat::medianGrouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]));
99+
100+
// Python: median_grouped([52, 52, 53, 54]) == 52.5
101+
$this->assertEquals(52.5, Stat::medianGrouped([52, 52, 53, 54]));
102+
103+
// Python: median_grouped([1, 3, 3, 5, 7]) == 3.25
104+
$this->assertEquals(3.25, Stat::medianGrouped([1, 3, 3, 5, 7]));
105+
106+
// With interval=2: median_grouped([1, 3, 3, 5, 7], interval=2) == 3.5
107+
$this->assertEquals(3.5, Stat::medianGrouped([1, 3, 3, 5, 7], 2));
108+
109+
// Demographics example from Python docs (interval=10)
110+
$data = array_merge(
111+
array_fill(0, 172, 25),
112+
array_fill(0, 484, 35),
113+
array_fill(0, 387, 45),
114+
array_fill(0, 22, 55),
115+
array_fill(0, 6, 65),
116+
);
117+
$this->assertEquals(37.5, round(Stat::medianGrouped($data, 10), 1));
118+
119+
// Single element: L = 1 - 0.5 = 0.5, result = 0.5 + 1*(0.5-0)/1 = 1.0
120+
$this->assertEquals(1.0, Stat::medianGrouped([1]));
121+
}
122+
123+
public function test_calculates_median_grouped_with_empty_array(): void
124+
{
125+
$this->expectException(InvalidDataInputException::class);
126+
Stat::medianGrouped([]);
127+
}
128+
95129
public function test_calculates_mode(): void
96130
{
97131
$this->assertEquals(3, Stat::mode([1, 1, 2, 3, 3, 3, 3, 4]));

0 commit comments

Comments
 (0)