Skip PCA when exploring anomaly detection solely on node embeddings

JohT · JohT · commit 0376a805779a · 2026-01-20T08:53:32.000+01:00
diff --git a/domains/anomaly-detection/explore/AnomalyDetectionIsolationForestExploration.ipynb b/domains/anomaly-detection/explore/AnomalyDetectionIsolationForestExploration.ipynb
@@ -279,7 +279,14 @@
     "    'clusterNoise', # highly correlated with \"clusterApproximateOutlierScore\". doesn't improve F1 score of proxy model.\n",
     "    'embeddingVisualizationX',\n",
     "    'embeddingVisualizationY',\n",
-    "]"
+    "]\n",
+    "\n",
+    "features_for_visualization_and_training: typing.List[str] = [\n",
+    "    'pageRank', \n",
+    "    'articleRank'\n",
+    "]\n",
+    "\n",
+    "features_for_visualization: typing.List[str] = features_for_visualization_excluded_from_training + features_for_visualization_and_training"
    ]
   },
   {
@@ -748,7 +755,9 @@
    "id": "b2cfcc56",
    "metadata": {},
    "source": [
-    "#### 1.3b List the top 10 anomalies solely based on embeddings"
+    "#### 1.3b List the top 10 anomalies solely based on embeddings\n",
+    "\n",
+    "By leaving out all other features, we can see if the embeddings alone are sufficient to detect anomalies. Anomalies detected solely based on embeddings could indicate structural outliers in the graph representation of the codebase. In most cases however, combining embeddings with other features yields better results."
    ]
   },
   {
@@ -758,10 +767,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "java_package_embedding_anomaly_detection_features = java_package_anomaly_detection_features[features_for_visualization_excluded_from_training + ['embedding', 'pageRank', 'articleRank']].copy()\n",
-    "java_package_embedding_anomaly_detection_input = reduce_dimensionality_of_node_embeddings(java_package_embedding_anomaly_detection_features, max_dimensions=60, target_variance=0.95)\n",
-    "java_package_embedding_anomaly_detection_feature_names = embedding_feature_names = [f'nodeEmbeddingPCA_{i}' for i in range(java_package_embedding_anomaly_detection_input.shape[1])]\n",
+    "# Create a copy of the java_package features, selecting only visualization and embedding features\n",
+    "java_package_embedding_anomaly_detection_features = java_package_anomaly_detection_features[features_for_visualization + ['embedding']].copy()\n",
+    "\n",
+    "# Skip PCA and keep the original dimensionality of the node embeddings. When only considering embeddings, there are no features that could get outperformed.\n",
+    "# java_package_embedding_anomaly_detection_input = reduce_dimensionality_of_node_embeddings(java_package_embedding_anomaly_detection_features, max_dimensions=60, target_variance=0.95)\n",
+    "java_package_embedding_anomaly_detection_input = np.stack(java_package_embedding_anomaly_detection_features['embedding'].apply(np.array).tolist())\n",
+    "java_package_embedding_anomaly_detection_feature_names = [f'nodeEmbedding_{i}' for i in range(java_package_embedding_anomaly_detection_input.shape[1])]\n",
+    "\n",
+    "# Tune anomaly detection models using only the reduced embedding features, with automatic contamination threshold\n",
     "java_package_embedding_anomaly_detection_result = tune_anomaly_detection_models(java_package_embedding_anomaly_detection_input, contamination=\"auto\")\n",
+    "\n",
+    "# Add the anomaly detection results (labels and scores) to the features dataframe with custom column names for embedding-based anomalies\n",
     "java_package_embedding_anomaly_detection_features = add_anomaly_detection_results_to_features(java_package_embedding_anomaly_detection_features, java_package_embedding_anomaly_detection_result, anomaly_label_column='anomalyOfEmbeddingLabel', anomaly_score_column='anomalyOfEmbeddingScore')\n",
     "\n",
     "display(get_top_10_anomalies(java_package_embedding_anomaly_detection_features, anomaly_label_column='anomalyOfEmbeddingLabel', anomaly_score_column='anomalyOfEmbeddingScore').reset_index(drop=True))"
@@ -2000,7 +2017,9 @@
    "id": "c314821d",
    "metadata": {},
    "source": [
-    "#### 2.3b List the top 10 anomalies solely based on embeddings"
+    "#### 2.3b List the top 10 anomalies solely based on embeddings\n",
+    "\n",
+    "By leaving out all other features, we can see if the embeddings alone are sufficient to detect anomalies. Anomalies detected solely based on embeddings could indicate structural outliers in the graph representation of the codebase. In most cases however, combining embeddings with other features yields better results."
    ]
   },
   {
@@ -2010,12 +2029,20 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "java_type_embedding_anomaly_detection_features = java_type_anomaly_detection_features[features_for_visualization_excluded_from_training + ['embedding', 'pageRank', 'articleRank']].copy()\n",
-    "java_type_embedding_anomaly_detection_input = reduce_dimensionality_of_node_embeddings(java_type_embedding_anomaly_detection_features, max_dimensions=60, target_variance=0.95)\n",
-    "java_type_embedding_anomaly_detection_feature_names = embedding_feature_names = [f'nodeEmbeddingPCA_{i}' for i in range(java_type_embedding_anomaly_detection_input.shape[1])]\n",
+    "# Create a copy of the java_type features, selecting only embeddings and everything needed for visualization\n",
+    "java_type_embedding_anomaly_detection_features = java_type_anomaly_detection_features[features_for_visualization + ['embedding']].copy()\n",
+    "\n",
+    "# Skip PCA and keep the original dimensionality of the node embeddings. When only considering embeddings, there are no features that could get outperformed.\n",
+    "# java_type_embedding_anomaly_detection_input = reduce_dimensionality_of_node_embeddings(java_type_embedding_anomaly_detection_features, max_dimensions=60, target_variance=0.95)\n",
+    "java_type_embedding_anomaly_detection_input = np.stack(java_type_embedding_anomaly_detection_features['embedding'].apply(np.array).tolist())\n",
+    "java_type_embedding_anomaly_detection_feature_names = [f'nodeEmbedding_{i}' for i in range(java_type_embedding_anomaly_detection_input.shape[1])]\n",
+    "\n",
     "java_type_embedding_anomaly_detection_result = tune_anomaly_detection_models(java_type_embedding_anomaly_detection_input, contamination=\"auto\")\n",
+    "\n",
+    "# Add the anomaly detection results (labels and scores) to the features dataframe with custom column names for embedding-based anomalies\n",
     "java_type_embedding_anomaly_detection_features = add_anomaly_detection_results_to_features(java_type_embedding_anomaly_detection_features, java_type_embedding_anomaly_detection_result, anomaly_label_column='anomalyOfEmbeddingLabel', anomaly_score_column='anomalyOfEmbeddingScore')\n",
     "\n",
+    "# Display the top 10 anomalies detected based on embeddings, sorted by anomaly score in descending order, with index reset for cleaner output\n",
     "display(get_top_10_anomalies(java_type_embedding_anomaly_detection_features, anomaly_label_column='anomalyOfEmbeddingLabel', anomaly_score_column='anomalyOfEmbeddingScore').reset_index(drop=True))"
    ]
   },