The label_distribution measurement is supposed to quantify how biased a column of labels is towards one value or another. It computes the fraction each label value takes up out of all labels (which is useful) and, problematically, the skewness of that distribution:
|
def _compute(self, data): |
|
"""Returns the fraction of each label present in the data""" |
|
c = Counter(data) |
|
label_distribution = {"labels": [k for k in c.keys()], "fractions": [f / len(data) for f in c.values()]} |
|
if isinstance(data[0], str): |
|
label2id = {label: id for id, label in enumerate(label_distribution["labels"])} |
|
data = [label2id[d] for d in data] |
|
skew = stats.skew(data) |
|
return {"label_distribution": label_distribution, "label_skew": skew} |
This is statistical nonsense.
A class label is a multinoulli variable, which means that it's a discrete variable that can emit a finite amount of values whose magnitude has no meaning other than being different from each other. If you have classes {cat, dog, giraffe}, it has no meaning whether we choose to label cat as 0 or 3.1415, and it has no meaning whether cat is labelled 0 or dog is labelled 0, as long as they are different.
Skewness, however, is meant for continuous variables, and cares about magnitudes and permutations. It looks at the symmetry of the distribution, not the uniformity.
- The label column
[0, 0, 0, 1, 1, 1, 2, 2, 2] has 0 skewness, because it is symmetrical.
- The label column
[0, 0, 1, 1, 1, 1, 1, 2, 2] has 0 skewness, because it is symmetrical. Yet, clearly, there is heavy bias towards label 1.
- The label column
[0, 0, 1, 1, 2, 2, 2, 2, 2] is exactly as biased as the previous one (2 labels, 2 labels, 5 labels) and yet now it has skewness != 0 because the weight of the distribution is "on the right", which has zero meaning.
The entropy of the labels is what you are looking for to measure uniformity, not skewness. Entropy is class-permutation-invariant. It is maximised for uniform distributions. If you want to normalise it, you can divide by that maximal entropy (the Hartley function).
The
label_distributionmeasurement is supposed to quantify how biased a column of labels is towards one value or another. It computes the fraction each label value takes up out of all labels (which is useful) and, problematically, the skewness of that distribution:evaluate/measurements/label_distribution/label_distribution.py
Lines 85 to 93 in 55f1bc6
This is statistical nonsense.
A class label is a multinoulli variable, which means that it's a discrete variable that can emit a finite amount of values whose magnitude has no meaning other than being different from each other. If you have classes
{cat, dog, giraffe}, it has no meaning whether we choose to labelcatas 0 or 3.1415, and it has no meaning whethercatis labelled 0 ordogis labelled 0, as long as they are different.Skewness, however, is meant for continuous variables, and cares about magnitudes and permutations. It looks at the symmetry of the distribution, not the uniformity.
[0, 0, 0, 1, 1, 1, 2, 2, 2]has 0 skewness, because it is symmetrical.[0, 0, 1, 1, 1, 1, 1, 2, 2]has 0 skewness, because it is symmetrical. Yet, clearly, there is heavy bias towards label1.[0, 0, 1, 1, 2, 2, 2, 2, 2]is exactly as biased as the previous one (2 labels, 2 labels, 5 labels) and yet now it has skewness != 0 because the weight of the distribution is "on the right", which has zero meaning.The entropy of the labels is what you are looking for to measure uniformity, not skewness. Entropy is class-permutation-invariant. It is maximised for uniform distributions. If you want to normalise it, you can divide by that maximal entropy (the Hartley function).