Question
I would like to do some mech interp on a generative video model (which will be a diffusion model but with temporal attention blocks). Note this model would not have any text element, and would just be a transformer predicting the next video frame (like Sora and other SOTA generative video models). It will be video-2-video (V2V). Afaik the current version of TransformerLens does not have support for a model like this - but I would like to get started on my research quite soon and hence would like to tailor TransformerLens to be able to handle a model like this. Two questions:
-
Does it make sense for me to expand TransformerLens to this case? (ie, would it be better to just start from scratch and not use TransformerLens for this?)
-
How would I go about doing this?
While question 1 may have a simple answer I realize that question 2 may not. I am happy to have a longer conversation about this + get working on it after that.
Question
I would like to do some mech interp on a generative video model (which will be a diffusion model but with temporal attention blocks). Note this model would not have any text element, and would just be a transformer predicting the next video frame (like Sora and other SOTA generative video models). It will be video-2-video (V2V). Afaik the current version of TransformerLens does not have support for a model like this - but I would like to get started on my research quite soon and hence would like to tailor TransformerLens to be able to handle a model like this. Two questions:
Does it make sense for me to expand TransformerLens to this case? (ie, would it be better to just start from scratch and not use TransformerLens for this?)
How would I go about doing this?
While question 1 may have a simple answer I realize that question 2 may not. I am happy to have a longer conversation about this + get working on it after that.