Hi @nihonlanguageprocessing @artificialfintelligence @JDryv @stephanyvargas
We need to define the machine learning task we will be solving.
There are a few options; here are my thoughts.
Looking at an excerpt of SEMCOR data we see that the input is a sequence of tokens <wf>, and the prediction is a bunch of attributes lemma, pos, instance_id. I believe instance_id corresponds to a specific entry in the 国語辞典 dictionary. I'd like
<sentence id="d000.s00000">
<wf lemma="" pos="">ここ</wf>
<wf lemma="" pos="">に</wf>
<wf lemma="" pos="">ラマイム</wf>
<wf lemma="" pos="">の</wf>
<wf lemma="" pos="">子</wf>
<wf lemma="" pos="">サレトマ</wf>
<wf lemma="" pos="">という</wf>
<wf lemma="" pos="">者'</wf>
<wf lemma="" pos="">が</wf>
<instance id="bn:00083184v" lemma="いる" pos="VERB">いる</instance>
<wf lemma="" pos="">.</wf>
</sentence>
I would like to use JMDict as our base dictionary instead of the 国語辞典, since it is available under CC license. Hence we need to convert the instance id bn:00083184v for いる into the ent_seq identifier for the same entry in JMDict: https://lindict.api.linalgo.com/v1/ja/entries/69fa42cb-12d9-4eab-9080-5fd5667e069e/.
Also, we'll need to disambiguate between different candidate entries. For example, for いる there are 10 possible candidates: https://lindict.api.linalgo.com/v1/ja/search/?query=いる
Some additional remarks:
- I am unsure how semcor data was tokenized. I think we need to make sure tokenization is consistent with the method we are using.
- This could be cast as a seq to seq task. In fact, right now I am writing a parser that returns two sequences as shown below:
Please let me know your thoughts. In particular, @nihonlanguageprocessing do you know more about the base SEMCOR task and how it is solved by current state of the art models?
Hi @nihonlanguageprocessing @artificialfintelligence @JDryv @stephanyvargas
We need to define the machine learning task we will be solving.
There are a few options; here are my thoughts.
Looking at an excerpt of SEMCOR data we see that the input is a sequence of tokens
<wf>, and the prediction is a bunch of attributeslemma,pos,instance_id. I believeinstance_idcorresponds to a specific entry in the 国語辞典 dictionary. I'd likeI would like to use JMDict as our base dictionary instead of the 国語辞典, since it is available under CC license. Hence we need to convert the instance id
bn:00083184vfor いる into theent_seqidentifier for the same entry in JMDict: https://lindict.api.linalgo.com/v1/ja/entries/69fa42cb-12d9-4eab-9080-5fd5667e069e/.Also, we'll need to disambiguate between different candidate entries. For example, for いる there are 10 possible candidates: https://lindict.api.linalgo.com/v1/ja/search/?query=いる
Some additional remarks:
Please let me know your thoughts. In particular, @nihonlanguageprocessing do you know more about the base SEMCOR task and how it is solved by current state of the art models?