Skip to content

x86: Enhanced copy capabilities for Hygon processor#844

Merged
Avenger-285714 merged 1 commit into
deepin-community:linux-6.6.yfrom
zhitengqiu:linux-6.6.y
Jun 5, 2025
Merged

x86: Enhanced copy capabilities for Hygon processor#844
Avenger-285714 merged 1 commit into
deepin-community:linux-6.6.yfrom
zhitengqiu:linux-6.6.y

Conversation

@zhitengqiu
Copy link
Copy Markdown

@zhitengqiu zhitengqiu commented Jun 3, 2025

hygon inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/IAQQDF CVE: NA


The following methods are used to improve the large memory copy performance of the Hygon processor between kernel and user mode.

Prefetch is a technique for reading blocks of data from main memory at very high data rates, then operating on them within the cache. Results are then written out to memory, all with high efficiency.

The code can employ a very special instruction: NT. This is a streaming store instruction for writing data to memory. This instruction bypasses the on-chip cache and sends data directly into a write-combining buffer. Because NT allows the CPU to avoid reading the old data from the memory destination address, NT can effectively improve the total write bandwidth. There are similar optimizations for reading data from memory.

Interruptions may occur when copying large memory, which may trigger thread switching. You need to save the current MMX register context and continue copying when switching back to the thread next time.

Summary by Sourcery

Enable high-performance user-kernel memory copying on Hygon processors by adding streaming store and prefetch-accelerated routines, non-atomic FPU context support, and runtime toggles via a static key and sysfs control.

New Features:

  • Add Hygon-specific optimized large memory copy routines using SSE2/AVX2 NT streaming stores and prefetch
  • Introduce non-atomic kernel FPU APIs (kernel_fpu_begin_nonatomic and kernel_fpu_end_nonatomic) with scheduling hooks to preserve FPU state across preemptions
  • Integrate a static branch to dynamically select Hygon large memory copy in the generic copy_user path
  • Expose sysfs entries under c86_features to configure the minimum length for NT block copy

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Jun 3, 2025

Reviewer's Guide

This PR introduces Hygon-specific large memory copy optimizations using streaming stores and prefetch for user/kernel transfers, gated by a static branch and non-atomic kernel FPU context management, plus a sysfs tunable for the NT block size threshold.

Class Diagram for Key Components in Hygon LMC Enhancement

classDiagram
    class hygon_c86_info {
      +unsigned int nt_cpy_mini_len
    }

    class fpu {
      +void* regs
      +unsigned int default_size
    }

    class thread_info {
      +TIF_USING_FPU_NONATOMIC: boolean
    }

    class CoreLMCLogic {
      <<Static Global>>
      +static_key_false hygon_lmc_key
      +unsigned int fpu_kernel_nonatomic_xstate_size
      +hygon_c86_info hygon_c86_data
      +Hygon_LMC_check(unsigned long len) bool
      +copy_large_memory_generic_string(void* to, const void* from, unsigned long len) unsigned long
      +kernel_fpu_begin_nonatomic_mask(unsigned int kfpu_mask) int
      +kernel_fpu_end_nonatomic() void
      +switch_kernel_fpu_prepare(task_struct* prev, int cpu) void
      +switch_kernel_fpu_finish(task_struct* next) void
      +get_nt_block_copy_mini_len() unsigned int
    }
    CoreLMCLogic ..> hygon_c86_info : uses

    class OptimizedCopyRoutines {
      <<Assembly Functions>>
      +copy_user_sse2_opt_string(void* to, const void* from, unsigned long len) unsigned long
      +copy_user_avx2_pf64_nt_string(void* to, const void* from, unsigned long len) unsigned long
      +fpu_save_xmm0_3(void* to, ...)
      +fpu_restore_xmm0_3(void* to, ...)
      +fpu_save_ymm0_7(void* to, ...)
      +fpu_restore_ymm0_7(void* to, ...)
    }

    class KernelInterface {
      <<Modified Kernel Functions>>
      +copy_user_generic(void* to, const void* from, unsigned long len) unsigned long
      +__switch_to(task_struct* prev, task_struct* next) task_struct*
      +kernel_fpu_begin_mask(unsigned int kfpu_mask) void
      +kernel_fpu_end() void
    }

    KernelInterface ..> CoreLMCLogic : invokes
    CoreLMCLogic ..> OptimizedCopyRoutines : calls
    CoreLMCLogic ..> fpu : manages context
    CoreLMCLogic ..> thread_info : uses flag
Loading

File-Level Changes

Change Details Files
Integrate static branch key and vendor check to enable Hygon LMC
  • Define and export hygon_lmc_key static key
  • Enable key at boot if CPU vendor is Hygon
  • Guard copy_user_generic and FPU scheduling paths with static_branch_unlikely(&hygon_lmc_key)
arch/x86/kernel/cpu/common.c
arch/x86/kernel/fpu/core.c
arch/x86/include/asm/uaccess_64.h
arch/x86/kernel/process_64.c
arch/x86/include/asm/fpu/sched.h
Add non-atomic kernel FPU context API for large-copy operations
  • Implement kernel_fpu_begin_nonatomic_mask/end with TIF_USING_FPU_NONATOMIC flag
  • Export fpu_kernel_nonatomic_xstate_size and compute free space & register offsets
  • Hook save/restore into scheduling via switch_kernel_fpu_prepare/finish
  • Declare TIF_USING_FPU_NONATOMIC in thread_info.h and API wrappers in fpu/api.h
arch/x86/kernel/fpu/core.c
arch/x86/include/asm/fpu/api.h
arch/x86/kernel/fpu/init.c
arch/x86/kernel/fpu/xstate.c
arch/x86/include/asm/fpu/sched.h
arch/x86/include/asm/thread_info.h
Introduce optimized SSE2/AVX2 assembly routines for large copy_user
  • Add copy_user_sse2.S and copy_user_avx2.S with prefetch and vmovntdq loops
  • Map copy_user_large_memory_generic_string to the assembly paths
  • Define MAX_FPU_CTX_SIZE and KERNEL_FPU_NONATOMIC_SIZE and override kernel_fpu_states_save/restore
  • Update arch/x86/lib Makefile to build new assembly modules
arch/x86/lib/copy_user_sse2.S
arch/x86/lib/copy_user_avx2.S
arch/x86/include/asm/uaccess_64.h
arch/x86/lib/Makefile
Expose NT block-copy threshold via sysfs
  • Define hygon_c86_info with nt_cpy_mini_len field
  • Implement show/store attribute and attribute_group
  • Initialize and teardown kobject in kobject_hygon_c86_init
arch/x86/kernel/cpu/hygon.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link
Copy Markdown

Hi @zhitengqiu. Thanks for your PR. 😃

@deepin-ci-robot
Copy link
Copy Markdown

Hi @zhitengqiu. Thanks for your PR.

I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Avenger-285714
Copy link
Copy Markdown
Member

/ok-to-test

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhitengqiu - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

#define MAX_FPU_CTX_SIZE 64
#define KERNEL_FPU_NONATOMIC_SIZE (2 * (MAX_FPU_CTX_SIZE))

#define copy_user_large_memory_generic_string copy_user_sse2_opt_string
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Macro redefinition risk if both SSE2 and AVX2 are enabled.

If both macros are enabled, copy_user_large_memory_generic_string will be defined twice, with the last definition overriding the previous one. Use #elif or a clearer selection method to avoid this.

@zhitengqiu zhitengqiu changed the title mm: Enhanced copy capabilities for Hygon processor x86: Enhanced copy capabilities for Hygon processor Jun 4, 2025
Copy link
Copy Markdown
Member

@opsiff opsiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need modified the arch/x86/configs/deepin_x86_desktop_defconfig

--- defconfig 2025-06-03 11:49:07.246724991 +0000
+++ defconfig.orig 2025-06-03 11:48:56.175714420 +0000
@@ -76,7 +76,6 @@
CONFIG_ACRN_GUEST=y
CONFIG_INTEL_TDX_GUEST=y
CONFIG_PROCESSOR_SELECT=y
-CONFIG_USING_FPU_IN_KERNEL_NONATOMIC=y
CONFIG_GART_IOMMU=y
CONFIG_MAXSMP=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
@@ -5898,3 +5897,6 @@

CONFIG_X86_DEBUG_FPU is not set

CONFIG_UNWINDER_FRAME_POINTER=y

CONFIG_RUNTIME_TESTING_MENU is not set

+CONFIG_USING_FPU_IN_KERNEL_NONATOMIC=y
+# CONFIG_X86_HYGON_LMC_SSE2_ON is not set
+CONFIG_X86_HYGON_LMC_AVX2_ON=y

@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hygon inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/IAQQDF
CVE: NA

 ---------------------------

The following methods are used to improve the large memory copy
performance of the Hygon processor between kernel and user mode.

Prefetch is a technique for reading blocks of data from main memory at
very high data rates, then operating on them within the cache. Results
are then written out to memory, all with high efficiency.

The code can employ a very special instruction: NT. This is a streaming
store instruction for writing data to memory. This instruction bypasses
the on-chip cache and sends data directly into a write-combining buffer.
Because NT allows the CPU to avoid reading the old data from the memory
destination address, NT can effectively improve the total write bandwidth.
There are similar optimizations for reading data from memory.

Interruptions may occur when copying large memory, which may trigger
thread switching. You need to save the current MMX register context and
continue copying when switching back to the thread next time.

Signed-off-by: zhuchao <zhuchao@hygon.cn>
Signed-off-by: qiuzhiteng <qiuzhiteng@hygon.cn>
@deepin-ci-robot
Copy link
Copy Markdown

deepin pr auto review

代码审查意见:

  1. arch/x86/Kconfig.fpu文件中,menuconfig USING_FPU_IN_KERNEL_NONATOMIC的描述信息可能需要更详细地说明其用途和影响,以便于其他开发者理解这个配置选项的作用。

  2. arch/x86/include/asm/fpu/api.h文件中,kernel_fpu_begin_nonatomic函数的实现中,对于32位代码的处理逻辑可能需要进一步确认,以确保在所有情况下都能正确处理。

  3. arch/x86/include/asm/fpu/sched.h文件中,switch_kernel_fpu_prepareswitch_kernel_fpu_finish函数的实现中,对于TIF_USING_FPU_NONATOMIC标志位的检查可能需要更详细的注释,以便于理解其用途。

  4. arch/x86/include/asm/thread_info.h文件中,新增的TIF_USING_FPU_NONATOMIC标志位应该有相应的文档说明,以便于其他开发者了解其含义和用途。

  5. arch/x86/include/asm/uaccess_64.h文件中,copy_user_generic函数的实现中,对于Hygon_LMC_check函数的调用可能需要更详细的注释,以便于理解其作用和逻辑。

  6. arch/x86/kernel/cpu/common.c文件中,update_lmc_branch_cond函数的实现中,对于static_branch_enable的调用可能需要更详细的注释,以便于理解其作用和逻辑。

  7. arch/x86/kernel/cpu/hygon.c文件中,kobject_hygon_c86_init函数的实现中,对于sysfs_create_group的调用可能需要更详细的错误处理逻辑,以便于在创建sysfs节点失败时能够正确处理。

  8. arch/x86/kernel/fpu/core.c文件中,kernel_fpu_begin_nonatomic_mask函数的实现中,对于TIF_USING_FPU_NONATOMIC标志位的检查可能需要更详细的注释,以便于理解其用途。

  9. arch/x86/kernel/fpu/init.c文件中,fpu__init_task_struct_size函数的实现中,对于fpu_kernel_nonatomic_xstate_size的赋值可能需要更详细的注释,以便于理解其用途。

  10. arch/x86/kernel/fpu/xstate.c文件中,init_xstate_size函数的实现中,对于fpu_kernel_nonatomic_xstate_size的赋值可能需要更详细的注释,以便于理解其用途。

  11. arch/x86/kernel/process_64.c文件中,__switch_to函数的实现中,对于switch_kernel_fpu_prepareswitch_kernel_fpu_finish函数的调用可能需要更详细的注释,以便于理解其用途。

  12. arch/x86/lib/Makefile文件中,对于copy_user_sse2.ocopy_user_avx2.o的编译选项可能需要更详细的注释,以便于理解其用途。

  13. arch/x86/lib/copy_user_avx2.S文件中,copy_user_avx2_pf64_nt_string函数的实现中,对于prefetchnta指令的使用可能需要更详细的注释,以便于理解其用途。

  14. arch/x86/lib/copy_user_sse2.S文件中,copy_user_sse2_opt_string函数的实现中,对于prefetchnta指令的使用可能需要更详细的注释,以便于理解其用途。

以上是针对代码审查意见的详细说明,希望能够对您有所帮助。

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Enable high-performance large memory copy on Hygon CPUs by introducing non-temporal SSE2/AVX2 routines, non-atomic FPU context support, and runtime toggles via a static key and sysfs.

  • Add SSE2- and AVX2-based streaming-store copy_user implementations with prefetch
  • Introduce kernel_fpu_begin_nonatomic()/end_nonatomic() APIs and extend FPU state sizing
  • Wire up a static branch (hygon_lmc_key) and expose a sysfs knob for minimum NT copy length

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
arch/x86/lib/copy_user_sse2.S New SSE2-NT copy_user implementation with prefetch
arch/x86/lib/copy_user_avx2.S New AVX2-NT copy_user implementation with prefetch
arch/x86/lib/Makefile Build rules for copy_user_sse2.o and copy_user_avx2.o
arch/x86/kernel/process_64.c Hook non-atomic FPU save/restore around context switch
arch/x86/kernel/fpu/xstate.c, init.c, core.c Extend xstate sizing and implement non-atomic FPU APIs
arch/x86/kernel/cpu/common.c Define and enable hygon_lmc_key static branch
arch/x86/kernel/cpu/hygon.c Add c86_features/hygon_c86 sysfs group for NT copy threshold
arch/x86/include/asm/uaccess_64.h Integrate Hygon_LMC_check and override generic copy_user path
arch/x86/Kconfig.fpu, arch/x86/Kconfig New configuration options for Hygon large-memory copy support
Comments suppressed due to low confidence (5)

arch/x86/Kconfig.fpu:3

  • The Kconfig symbol name USING_FPU_IN_KERNEL_NONATOMIC is misleading for a Hygon large-memory copy feature; consider renaming it to X86_HYGON_LMC or similar to clearly reflect its purpose.
menuconfig USING_FPU_IN_KERNEL_NONATOMIC

arch/x86/include/asm/uaccess_64.h:142

  • Function names in the kernel should use lowercase_with_underscores; rename Hygon_LMC_check to hygon_lmc_check to match style.
static inline bool Hygon_LMC_check(unsigned long len)

arch/x86/kernel/cpu/hygon.c:515

  • The sysfs attribute is created with permission mode 0600, preventing non-root users from reading the NT copy threshold; consider using 0644 for read-only access by all.
static struct kobj_attribute nt_cpy_mini_len_attribute = __ATTR(

arch/x86/kernel/fpu/api.h:52

  • [nitpick] The new non-atomic FPU APIs (kernel_fpu_begin_nonatomic/kernel_fpu_end_nonatomic) lack accompanying tests; consider adding unit or integration tests to validate normal and error paths.
static inline int kernel_fpu_begin_nonatomic(void)

arch/x86/kernel/cpu/hygon.c:482

  • Calling memset requires <linux/string.h> (or <string.h>) inclusion for clarity; verify the appropriate header is included for consistency.
memset((void *)&hygon_c86_data, 0, sizeof(struct hygon_c86_info));

Comment on lines +116 to +118
#define MAX_FPU_CTX_SIZE 64
#define KERNEL_FPU_NONATOMIC_SIZE (2 * (MAX_FPU_CTX_SIZE))

Copy link

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_FPU_CTX_SIZE and KERNEL_FPU_NONATOMIC_SIZE are conditionally redefined in both the SSE2 and AVX2 sections, risking macro redefinition or confusion; consider centralizing or guarding these definitions.

Suggested change
#define MAX_FPU_CTX_SIZE 64
#define KERNEL_FPU_NONATOMIC_SIZE (2 * (MAX_FPU_CTX_SIZE))
#ifndef MAX_FPU_CTX_SIZE
#endif

Copilot uses AI. Check for mistakes.
@Avenger-285714 Avenger-285714 merged commit 9590a37 into deepin-community:linux-6.6.y Jun 5, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants