So, I've found what I think is a meaningful downside to having ExtPos in the features instead of the Misc column. It's really difficult for a model to learn this in imbalanced situations. So for example, a in Spanish occurs 13,000 times in AnCora, 1000 times in an MWT. This is actually not an unreasonable ratio to learn. But then de occurs about 40,000 times, only 400 of which are MWT. Not surprisingly, Stanza's models completely punt on this.
@dan-zeman - I remember you saying you didn't like ExtPos in the features either...
So, I've found what I think is a meaningful downside to having
ExtPosin the features instead of the Misc column. It's really difficult for a model to learn this in imbalanced situations. So for example,ain Spanish occurs 13,000 times in AnCora, 1000 times in an MWT. This is actually not an unreasonable ratio to learn. But thendeoccurs about 40,000 times, only 400 of which are MWT. Not surprisingly, Stanza's models completely punt on this.@dan-zeman - I remember you saying you didn't like ExtPos in the features either...