Add configs to run int4 inference#37
Add configs to run int4 inference#37RezaYazdaniAminabadi wants to merge 5 commits intohuggingface:mainfrom
Conversation
| mp_size=world_size, | ||
| base_dir=repo_root, | ||
| dtype=getattr(torch, infer_dtype), | ||
| quantization_bits=8 if args.dtype == 'int8' else 4, |
There was a problem hiding this comment.
what happens with --dtype float16?
probably best to set this in kwargs only if quantization dtype is provided
There was a problem hiding this comment.
The quabtization-bit should not be used when running in half-precision. But, I agree we can do it in the kwargs and only for qunatized inference mode.
There was a problem hiding this comment.
these demos are already used by many users so let's make those nice and clean configuration-wise, so it's clear to the reader when what bits should be enabled.
| input_sentences *= math.ceil(args.batch_size / len(input_sentences)) | ||
|
|
||
| generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False) | ||
| generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True) |
There was a problem hiding this comment.
this is already a very different type of change.
If int4 requires do_sample=True, then again, let's change it only if it's --dtype int4
There was a problem hiding this comment.
Sure, I will double check the do_sample=False again to see if the text generated makes sense. If not, I set it to true for int4
There was a problem hiding this comment.
I just checked with do_sample=False and I see the text is produced in the same way as for FP16 and INT8:
in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for deep learning. It is a Python library, and it is also a framework. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and
So, I am gonna turn it off for now.
|
Also should probably assert if I can take care of that. |
Sounds good to me. Thanks @stas00 |
|
|
||
| model_name = args.name | ||
| infer_dtype = args.dtype | ||
| infer_dtype = args.dtype if args.dtype != 'int4' else 'int8' |
There was a problem hiding this comment.
would it make for a more user-friendly API to
- keep the
dtypeintact - drop
quantization_bits - let
deepspeed.init_inferencederive the number of bits fromdtype?
not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice
There was a problem hiding this comment.
oh, wait, torch.init4 still doesn't exist, does it?
let's find the feature request.
There was a problem hiding this comment.
still not implemented pytorch/pytorch#74627
so that's why you had to do the odd workarounds, right?
There was a problem hiding this comment.
I guess we can drop it once its implemented @stas00 ?
For now, this might be the best way to do it.
There was a problem hiding this comment.
it's pointless to wait, since they won't have int3 and int12
There was a problem hiding this comment.
would it make for a more user-friendly API to
- keep the
dtypeintact- drop
quantization_bits- let
deepspeed.init_inferencederive the number of bits fromdtype?not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in
dtypeand andquantization_bitstwice
@stas00 and @RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example: https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173
Also, see the docs for the new config: https://deepspeed.readthedocs.io/en/latest/inference-init.html
There was a problem hiding this comment.
That definitely works.
@awan-10, may I suggest you make the inference config accept dict_or_path just like zero does? it might be for some users easier to write out a separate file.
There was a problem hiding this comment.
@stas00 - thanks for the suggestion. Created an issue so we can track it: deepspeedai/DeepSpeed#2532. Mike and I will work on it.
|
OK, I think I understand the limitations of pytorch and it'll get only worse when you try I propose we break the currently proposed API and draw a better one. I propose to have only 2 user-configurable args related to how deepspeed-inference operates
Now the API is simple, unambiguous and future proof (as in For back-compat What do you think, Reza? |
|
Huh? |
Hi @stas00, |
|
In that case, we
In this case, we can simply pass the bits to the DeepSpeed-inference config: |
|
may I suggest that the just added why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want. |
I agree, let me work on that and I fix it. |
@RezaYazdaniAminabadi -- please see my comment above. #37 (comment) |
thanks @awan-10. Please go ahead and push your changes. |
Update bloom-ds-inference.py
235af1a to
134b703
Compare
abfc97f to
9d48dbf
Compare
655179f to
114b912
Compare
Add some minor config changes to support int4 inference through DeepSpeed-Inference.
The Int4 support will be added to DeepSpeed through this PR.
cc: @stas00