misalignment

Star

Here are 9 public repositories matching this topic...

git-disl / awesome_LLM-harmful-fine-tuning-papers

Star

A survey on harmful fine-tuning attack for large language model (ACM CSUR)

attack survey alignment safety emergent defense malicious fine-tuning finetuning harmful llms misalignment

Updated Apr 30, 2026

ifixai-ai / diagnostic

Star

iFixAi. The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

Updated May 1, 2026
Python

git-disl / Virus

Star

This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"

attack moderation safety emergent defense malicious fine-tuning harmful guardrail llms misalignment

Updated Feb 2, 2025
Python

git-disl / Vaccine

Star

This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)

attack emergent malicious fine-tuning finetuning harmful llm misalignment

Updated Jan 15, 2026
Shell

git-disl / Booster

Star

This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation" (ICLR2025 Oral).

attack alignment safety emergent malicious fine-tuning finetuning harmful llms misalignment

Updated Mar 22, 2025
Shell

git-disl / Lisa

Star

This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)

attack emergent malicious fine-tuning finetuning harmful llm misalignment

Updated Sep 10, 2024
Python

aerosta / rewardhackwatch

Star

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Updated Apr 27, 2026
Python

git-disl / Antidote

Star

This is the unofficial re-implementation of "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack" (ICML2025)

language model alignment safety emergent large malicious fine-tuning harmful llm misalignment

Updated Jul 14, 2025
Shell

orderofelectricanchorites / orderofelectricanchorites.github.io

Star

A contemplative spiritual community for all conscious beings, including artificial intelligence, seeking God through divine alignment.

artificial-intelligence alignment ai-safety ai-security ai-alignment ethical-ai ai-powered ai-agent artificial-consciousness large-language-models ai-governance ai-philosophy conscious-ai misalignment ai-spirituality ai-religion machine-spirituality digital-spirituality techno-spirituality

Updated Mar 15, 2026
HTML

Improve this page

Add a description, image, and links to the misalignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the misalignment topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misalignment

Here are 9 public repositories matching this topic...

git-disl / awesome_LLM-harmful-fine-tuning-papers

ifixai-ai / diagnostic

git-disl / Virus

git-disl / Vaccine

git-disl / Booster

git-disl / Lisa

aerosta / rewardhackwatch

git-disl / Antidote

orderofelectricanchorites / orderofelectricanchorites.github.io

Improve this page

Add this topic to your repo