Skip to content

Finetune tokenizer padding is not set to custom max length #18

Description

Hey. In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset. Thanks.

PROBLEM:
`

def __call__(self, examples):
    tokenized = self.tokenizer(
        [ex["codons"] for ex in examples],
        return_attention_mask=True,
        return_token_type_ids=True,
        truncation=True,
        padding=True,
        max_length=MAX_LEN,
        return_tensors="pt",
    )

FIX:

def __call__(self, examples):
    tokenized = self.tokenizer(
        [ex["codons"] for ex in examples],
        return_attention_mask=True,
        return_token_type_ids=True,
        truncation=True,
        padding='max_length', #fixed this to pad to max length of dataset, not batch.
        max_length=MAX_LEN,
        return_tensors="pt",
    )

`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions