| description | Make your training procedure more effective |
|---|
- While looking to precesion P and recall R (for example) we may be not able to choose the best model correctly
- So we have to create a new evaluation metric that makes a relation between P and R
- Now we can choose the best model due to our new metric ๐ฃ
- For example: (as a popular associated metric) F1 Score is:
$$F1 = \frac{2}{\frac{1}{P}+\frac{1}{R}}$$
To summarize: we can construct our own metrics due to our models and values to be able to get the best choice ๐ฉโ๐ซ
For better evaluation we have to classify our metrics as the following:
| Metric Type | Description |
|---|---|
| โจ Optimizing Metric | A metric that has to be in its best value |
| ๐ค Satisficing Metric | A metric that just has to be good enough |
Technically, If we have N metrics we have to try to optimize 1 metric and to satisfice N-1 metrics ๐
๐ Clarification: we tune satisficing metrics due to a threshold that we determine
- It is recommended to choose the dev and test sets from the same distribution, so we have to shuffle the data randomly and then split it.
- As a result, both test and dev sets have data from all categories โจ
We have to choose a dev set and test set - from same distribution - to reflect data we expect to get in te future and consider important to do well on
- If we have a small dataset (m < 10,000)
- 60% training, 20% dev, 20% test will be good
- If we have a huge dataset (1M for example)
-
99% trainig, %1 dev, 1% test will be acceptable
And so on, considering these two statuses we can choose the correct ratio ๐ฎโ
-
Guideline: if doing well on metric + dev/test set and doesn't correspond to doing well in the real world application, we have to change our metric and/or dev/test set ๐ณ