feat: ODM without categories#162
Conversation
|
Okay, I forgot to run the tests, so some linting tests may fail. If it fails, I will fix them! |
|
@romitjain please fix the tests and let me know |
|
@kmehant the builds failed due to an OSError Do you know how to fix it? |
|
Please amend or remove this commit - 0adf9c0 |
|
@kmehant How would that resolve the OSError? |
|
@romitjain thats for DCO. For no space, we need to see how to make the runner slim if we can avoid installation of any dependencies or removing existing files. Do you want to do it as part of this PR like profiling the disk? |
Signed-off-by: romit <romit@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Padmanabha Venkatagiri Seshadri <seshapad@in.ibm.com>
0a846a5 to
931dfe2
Compare
|
@kmehant I will figure out why the linting github actions are not running, but meanwhile, you can review the logic and PR |
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
| # distributed setup | ||
| dataloader_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True) | ||
| accelerator = Accelerator(split_batches=True, dataloader_config=dataloader_config) | ||
| accelerator = Accelerator(dataloader_config=dataloader_config) |
There was a problem hiding this comment.
What was the reason to remove split_batches=True ?
There was a problem hiding this comment.
DataLoaderConfiguration already includes split_batches. Do we still need to add it to Accelerator?
Signed-off-by: romit <romit@ibm.com>
| tokenizer: Optional[Any] = None | ||
|
|
||
|
|
||
| class DatasetAutoCategorizer: |
There was a problem hiding this comment.
Any thoughts on supporting when the dataset is iterable?
There was a problem hiding this comment.
Good catch, at least for the current implementation - no. Clustering would require all the data to be in memory, so with iterable datasets, we would need to fetch all the records and then run clustering.
This auto categorization is only suitable for smaller datasets (sub million).
There was a problem hiding this comment.
Sure, do you want to raise an error or something that iterable is currently not supported if the dataset is of that type?
Signed-off-by: romit <romit@ibm.com>
DatasetAutoCategorizerclass that creates an embedding out of text data and clusters it using KMeans (GPU accelerated)python -m pytest tests/test_auto_categorization.py