Skip to content

RISCV: compressed insn treats the uncompressed version as an alias#2959

Draft
slate5 wants to merge 2 commits into
capstone-engine:nextfrom
slate5:fix/uncompressed-alias
Draft

RISCV: compressed insn treats the uncompressed version as an alias#2959
slate5 wants to merge 2 commits into
capstone-engine:nextfrom
slate5:fix/uncompressed-alias

Conversation

@slate5

@slate5 slate5 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Your checklist for this pull request

  • I've documented or updated the documentation of every API function and struct this PR changes.
  • I've added tests that prove my fix is effective or that my feature works (if possible)

Detailed description

This draft is connected to #2923.
It's a very simple change. All compressed instructions that have an uncompressed counterpart use that instruction as an alias. A better solution would be to include a new uncompressed_id instead of an alias, but that would require more modifications (cs_insn)...

Test plan

...

Closing issues

...

@github-actions github-actions Bot added the RISCV Arch label Jun 9, 2026
@slate5 slate5 force-pushed the fix/uncompressed-alias branch from 646c2ac to bc9042a Compare June 9, 2026 03:41

@Rot127 Rot127 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests of course.
Otherwise I think we can start with that one.

@slate5

slate5 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Will do, but let's see what @moste00 says

@moste00

moste00 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Hello @Rot127 @slate5, thanks for notifying me of this.

So a couple of problems that I see:

1- We're still keeping the "(un)compressing is the same thing as aliasing" paradigm, and the flags of noalias and noaliascompressed with their old meanings. Didn't we talk about removing them because LLVM doesn't see compression as an aliasing ? will we remove them in a later PR or just give up and keep them for now?

2- The logic assigns a normal CS instruction ID (the ID of the uncompressed instruction) to the alias ID, but are the 2 ID spaces mutually exclusive? What if they intersect in some IDs ?

3- While I didn't run this yet as I'm not on my laptop now, the details logic will probably not solve my problem because when it prints the details for if(is_uncompressed) path, it prints for the MI struct, not McInstr. MI is the original compressed instruction, so this will print the original details of the compressed instruction. I want the details of the uncompressed instruction.

All in all: I think the main problem here is that is that we're just keeping most of the logic as-is and patching 1 problem ad-hoc instead of refactoring enough, but if both of you think we're not ready for the somewhat complicated approach in the other PR, and if (3) is addressed without breaking any existing tests, I'm okay with this PR.

Personally, I would like the other approach I described in the other PR, it just exhaustively lists all possibilities and lays out the exact thing to do for each. I might be biased though and if you think it's unclear or too verbose a description then so will capstone users. Maybe I just mis-explained it ?

@slate5

slate5 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Hi @moste00,

1- We're still keeping the "(un)compressing is the same thing as aliasing" paradigm...

Yes, that is what I also addressed. Maybe it would be better to have a separate cs_insn element for this specifically (eg, uncompressed_id). On the other hand, if we start complicating with noalias, nocompressed, keep this or that, we have to expect that the user is an expert in riscv. I will be the first to forget this "mess" with compressed insn, and its "aliases" in a week. From the perspective of an asm writer, it makes sense like this because in your source file you put addi t0, t0, -1 and the assembler translates that into c.addi (same reg is used and imm is in range of signed 2^6) or not if imm is bigger or there is no riscv C extension on system/assembler. What I'm pointing out with this is that addi is kinda an alias either way because until the assembler decides its encoding, it stands as a logical representation of some functionality and not a raw instruction.

I'm making this long... I lean towards fewer features (flags) that tune options without the explicit need of end users. It's clutter, and i'm not sure if there is demand for it. To me, this differentiation between +noaliascompressed and +noalias (as it is now) is irrelevant. Either i want aliases or not, i don't care if insn is compressed. As you said in the other PR, "Or you can do uncompression alone but stop at just the alias mapping, using +noaliascompressed" makes more sense to redefine the purpose of it (it would stop the logic of this PR, no uncompressed aliases, e.g., c.addi doesn't have an alias ID).

2- The logic assigns a normal CS instruction ID (the ID of the uncompressed instruction) to the alias ID, but are the 2 ID spaces mutually exclusive? What if they intersect in some IDs ?

They are in the same enum riscv_insn, separated by RISCV_INS_ENDING and RISCV_INS_ALIAS_BEGIN, if that is your question

3- While I didn't run this yet as I'm not on my laptop now...

But if you want real details, specify -r or what am i missing?

I don't think this PR solution is great, but i'm trying to stick to occam's razor :) And yes, I think if you explain to me a bit simpler in terms of what the end goal of your approach is, and what problem it solves (or what benefit it gives), it would help me a lot

@moste00

moste00 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

3- While I didn't run this yet as I'm not on my laptop now...

But if you want real details, specify -r or what am i missing?

Hey, sorry for saying that I wil test this at the time then delaying it so much. I was busy.

So what I mean is: I want to be able to give the -r flag and that would mean the "real" details of compressed instruction (those of the uncompressed equivalent) would appear. This is very useful to me in (e.g.) Rizin lifting, where I can just lift all the non-compressed instructions and then every c. instruction is just a special case, but this only works if the c. instruction can be made to always output the details of their uncompressed equivalents.

In some sense, this is the opposite POV to what you're saying. You're thinking of addi as the fictitious alias and the c_addi as the real instruction, and you're right in the sense that when a programmer writes "addi" in the asm file but the assembler writes "c_addi" then indeed it IS "addi" that is the fictional alias that doesn't really exist in this situation.

However, my POV is that the "original" 4-byte instructions are the "real" ones, and the C style instructions are just aliases for shorthand operations. It's kinda how in Bash you define alias commands to achieve the same thing as a longer command but with shorter text, both truly fictitious aliases (like ret) and the C extension are like aliases in this way. The fictitious aliases are shorter in text but not in bytes, and the C extension instruction are shorter in bytes but not in text or description. In both cases, they achieve the same effect as an plain 4-byte instruction but being "shorter" in some way. That's why I see them both as the same kind of thing.

All in all, your PR doesn't break the test suite either way, even if you change MI to McInstr in the way that would achieve my use case (as suggested in review). So maybe we can call this whole thing done and simplify the logic another day.

Comment thread arch/RISCV/RISCVInstPrinter.c Outdated
MI->flat_insn->detail->riscv.op_count = 0;
// re-disassemble again with no printing in order to obtain the full details
// including the whole operands array
printInstruction(MI, MI->address, O);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usse McInstr here instead of MI, so that if the instruction is uncompressed the details would be those of the uncompressed equivalent.

Comment thread arch/RISCV/RISCVInstPrinter.c Outdated
MI->flat_insn->detail->riscv.op_count = 0;
// re-disassemble again with no printing in order to obtain the full details
// including the whole operands array
printInstruction(MI, MI->address, O);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

@moste00

moste00 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

I left some comments on the code, if you apply them and no tests are broken, then LGTM.

(I already applied them on my local branch and they didn't break any tests, but check again just in case.)

@slate5

slate5 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Now I am sorry for the delay, hehe
Thank you for clarifying things and the reasoning behind it.

So what I mean is: I want to be able to give the -r flag and that would mean the "real" details of compressed instruction (those of the uncompressed equivalent) would appear. This is very useful to me in (e.g.) Rizin lifting, where I can just lift all the non-compressed instructions and then every c. instruction is just a special case, but this only works if the c. instruction can be made to always output the details of their uncompressed equivalents.

Ok, so you want the opposite behaviour for uncompressed ones with -r, so that operands are consistent between compressed and full instructions. From the ISA perspective, that doesn't make a lot of sense, but I understand your POV, and it makes sense. Now i'm thinking about new flag, haha... nvm

Your explanations about aliases and Bash are okay with me. I cannot see it fully like that, but ofc c. can be seen as a shorthand alias for the real insn. Thinking about those writebacks reaffirms this perspective (c.addi sp, -16, sp is RW, which means first cpu reads from it, adds imm, then writes to it. So, it's still addi sp,sp,-16).

This is set then, but i would like to hear your opinion on @Rot127's last msg in #2923. I would like to replace noaliascompressed with keepcompressed, and the role would be to just ignore this uncompressed complexity. Is that ok?

Also, how much complexity would be added to Rizin if we don't use alias_id for this uncompressed stuff, but instead add a new one, uncompressed_id?

I'm just gonna swap those two (MI and McInstr) as u proposed, but to get this opposite functionality completely (even with +noalias), we have to do more changes.

@slate5 slate5 force-pushed the fix/uncompressed-alias branch from bc9042a to 66b162f Compare June 16, 2026 05:34
@moste00

moste00 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@slate5 Thank you!

Yes, I read Rot's comment on the other PR and I like it, I was planning to implement it. If you want to implement it then of course be my welcome guest, although don't feel pressured to, it was originally my idea to do the refactoring so I don't want to just throw work on you!

That said, if you want to implement it, of course I have no objections. If you don't feel like it then it's totally OK and I can resume the unfinished other PR.

While you're at it, you can also seperate the details from the text, i.e. make it so the text is completely independent from details, such that I can request -r while still using +keepcompressed and everything will work fine: the compressed text is used but -r makes the details like those of the uncompressed equivalent. This can be implemented by just making the details-checking a seperate pass done after the text printing and indepedent from it, like the code in the other PR does.

So in short: yes, definitely feel free to implement Rot's modification, but don't feel pressured because it might be a hassle and I don't want to assign boring work to you. (in particular, tests might break and require passing extra flags for it to work.)

Also, how much complexity would be added to Rizin if we don't use alias_id for this uncompressed stuff, but instead add a new one, uncompressed_id?

No complexity either way, Rizin doesn't use any of those IDs to drive any interesting analysis as far as I'm aware, but adding uncompressed_id will be cleaner in general, this is because uncompression and aliases are orthogonal, for some instructions both can happen, so two fields encode the full detail of what happened to the instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

RISCV Arch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants