Hello @kjsman,
this is more a feature proposal than an actual issue. Instead of requiring the user to download and open the tar file containing the weights and the vocabulary from your huggingface hub repository, one can directly make the model_loader and the Tokenizer download and cache them.
For the first part, it only requires replacing torch.load(...) here (and for the other 3 functions in the same file) with
torch.hub.load_state_dict_from_url(weights_url, check_hash=True)
All it takes on your side is to upload on hugginface hub the 4 pt files (not in a zipped file) and thats' it.
As regards the tokenizer, just takes to add a default_bpe() method / function
@lru_cache()
def default_bpe():
p = os.path.join(
os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz"
)
if os.path.exists(p):
return p
else:
p = urlretrieve(
"https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz?raw=true",
"bpe_simple_vocab_16e6.txt.gz",
)
if len(p) != 1:
# if it also contains the
# HTTP message as second entry
return p[0]
else:
return p
Another option is, if you prefer to keep your vocab.json and merges.txt, to upload them as well to Hugginface hub (not in a tar file) or directly to GitHub like the original reposiotry does with its vocab.
If you like it, I will open a new PR, otherwise please let me know if you have any better idea or close this issue if you are not interested in this feature 😄
Hello @kjsman,
this is more a feature proposal than an actual issue. Instead of requiring the user to download and open the tar file containing the weights and the vocabulary from your huggingface hub repository, one can directly make the
model_loaderand theTokenizerdownload and cache them.For the first part, it only requires replacing
torch.load(...)here (and for the other 3 functions in the same file) withAll it takes on your side is to upload on hugginface hub the 4
ptfiles (not in a zipped file) and thats' it.As regards the tokenizer, just takes to add a
default_bpe()method / functionAnother option is, if you prefer to keep your vocab.json and merges.txt, to upload them as well to Hugginface hub (not in a tar file) or directly to GitHub like the original reposiotry does with its vocab.
If you like it, I will open a new PR, otherwise please let me know if you have any better idea or close this issue if you are not interested in this feature 😄