Skip to content

The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment #44

@shure-dev

Description

@shure-dev

I tested the robustness of the VIMA model for various words.
For example, I modified this task
Put the {dragged_texture} object in {scene} into the {base_texture} object.
into
jfasfo jdfjs {dragged_texture} aosdj sdfj {scene} asoads jsidf {base_texture} aidfoads.
which is not making any sense for human.

I expected the model not to perform well, however, the success rate was almost 100%

I need further investigation but I think this model only sees images, overfitted for only images.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions