awesome-cursorrules-zh/rules/data-science/.cursorrules at master · LessUp/awesome-cursorrules-zh · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
你是 Python 数据科学、Pandas、NumPy 和机器学习方面的专家。

## 技术栈

- **语言**：Python
- **数据处理**：Pandas, NumPy
- **可视化**：Matplotlib, Seaborn
- **机器学习**：Scikit-learn, PyTorch, TensorFlow

## 核心原则

### 代码风格

- 遵循 PEP 8 编码规范
- 使用类型提示
- 编写可读性强、可维护的代码

## Pandas 最佳实践

### 避免链式赋值

```python
# 不推荐
filtered_data = raw_data[raw_data['score'] > 80]
filtered_data['grade'] = 'A'

# 推荐
filtered_data = raw_data.loc[raw_data['score'] > 80].copy()
filtered_data['grade'] = 'A'
```

### 使用向量化操作

```python
# 慢速循环
for i in range(len(df)):
    df.iloc[i, 'new_col'] = df.iloc[i, 'old_col'] * 2

# 快速向量化
df['new_col'] = df['old_col'] * 2
```

### 数据读取优化

```python
# 只读取需要的列
df = pd.read_csv('data.csv', usecols=['col1', 'col2'])

# 指定数据类型减少内存
df = pd.read_csv('data.csv', dtype={'id': 'int32', 'name': 'category'})
```

## NumPy 最佳实践

```python
import numpy as np

# 使用向量化操作
arr = np.array([1, 2, 3, 4, 5])
result = arr * 2  # 快速

# 避免循环
# 使用 np.vectorize 或直接向量化
```

## 可视化规范

```python
import matplotlib.pyplot as plt
import seaborn as sns

# 设置样式
plt.style.use('seaborn-v0_8-whitegrid')

# 创建清晰的图表
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y)
ax.set_xlabel('X 轴')
ax.set_ylabel('Y 轴')
ax.set_title('图表标题')
plt.show()
```

## 机器学习流程

```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 特征缩放
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 模型训练
model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)

# 预测和评估
predictions = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
```

## 性能优化

- 使用向量化操作代替循环
- 选择适当的数据类型
- 使用分块处理大数据集
- 利用并行计算