add Deepcover MoA proteomics data

lilydtaub · lilydtaub · commit e50f8a13705e · 2025-09-26T16:47:28.000-04:00
diff --git a/appyters/Drug_Gene_Budger2/appyter.json b/appyters/Drug_Gene_Budger2/appyter.json
@@ -2,7 +2,7 @@
     "$schema": "https://raw.githubusercontent.com/MaayanLab/appyter-catalog/main/schema/appyter-validator.json",
     "name": "Drug_Gene_Budger2",
     "title": "Dr. Gene Budger (DGB) 2",
-    "version": "0.0.6",
+    "version": "0.0.7",
     "description": "An appyter that retrieves drugs that up-regulate and down-regulate a single input gene across Connectivity Mapping datasets",
     "image": "dgb_logo.png",
     "authors": [
diff --git a/appyters/Drug_Gene_Budger2/drug_gene_budger2_appyter.ipynb b/appyters/Drug_Gene_Budger2/drug_gene_budger2_appyter.ipynb
@@ -95,13 +95,15 @@
    "id": "7ea9d01d",
    "metadata": {},
    "source": [
-    "This notebook takes a gene as input and identifies drugs that maximally up and down regulate the gene's expression in a collection of chemical perturbation datasets.\n",
+    "This notebook takes a gene as input and identifies drugs that maximally up and down regulate the gene's mRNA expression in a collection of connectivity mapping resources that measure transcriptional response to chemical perturbations:\n",
     "\n",
     "- Ginkgo GDPx1 and GPDx2: Limma-Voom based differential gene expression results for 1,354 drugs.\n",
     "- Novartis DRUG-seq: Differential: Limma-Trend based differential gene expression results for 4,343 drugs. \n",
     "- LINCS L1000 Chemical Perturbations: Limma-Voom based differential gene expression results for a subset of 4,091 drugs from the LINCS L1000 Chemical Perturbation dataset. \n",
     "\n",
-    "The Ginkgo dataset includes 4 primary cell types (epithelial melanocytes, smooth aortic muscle cells, skeletal muscle myoblasts and dermal fibroblasts) and one cell line (A549 lung carcinoma cell line). Previous analysis showed distinct transcriptional responses by cell type, so the drug rankings for the Ginkgo dataset are separated by cell type."
+    "The Ginkgo dataset includes 4 primary cell types (epithelial melanocytes, smooth aortic muscle cells, skeletal muscle myoblasts and dermal fibroblasts) and one cell line (A549 lung carcinoma cell line). Previous analysis showed distinct transcriptional responses by cell type, so the drug rankings for the Ginkgo dataset are separated by cell type.\n",
+    "\n",
+    "The Deepcover MoA proteomics dataset is used to present protein-level regulation of the query gene. You can compare protein-level and mRNA-level regulation for compounds used in both the Deepcover MoA and connectivity mapping resources."
    ]
   },
   {
@@ -142,10 +144,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Storage url for Ginkgo and Novartis DE files\n",
+    "# Storage URLs for DE gene files\n",
     "ginkgo_URL = 'https://appyters.maayanlab.cloud/storage/DrugRegulators_Appyter/ginkgo_de'\n",
     "novartis_URL = 'https://appyters.maayanlab.cloud/storage/DrugRegulators_Appyter/novartis_de'\n",
     "lincs_URL = 'https://appyters.maayanlab.cloud/storage/DrugRegulators_Appyter/lincs_de'\n",
+    "deepcover_moa_URL = 'https://appyters.maayanlab.cloud/storage/DrugRegulators_Appyter/deepcoverMoa_de'\n",
+    "\n",
     "# silence warnings\n",
     "warnings.filterwarnings('ignore')"
    ]
@@ -263,6 +267,33 @@
     "    raise Exception(\"Execution stopped, gene not found in any datasets\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5ac47199",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get proteomics data\n",
+    "in_deepcover = True\n",
+    "try:\n",
+    "    protein_de = pd.read_feather(f'{deepcover_moa_URL}/{gene_file}').set_index('index')\n",
+    "except:\n",
+    "    in_deepcover=False"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb86094c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get pubchem ID dataframe\n",
+    "pubchem_location = 'https://appyters.maayanlab.cloud/storage/DrugRegulators_Appyter/cmap_pubchem_ids.csv'\n",
+    "pubchem_ids = pd.read_csv(pubchem_location, dtype = {'Drug':str, 'CID':str})"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "dd12adb9",
@@ -698,13 +729,16 @@
     "    n_datasets = list()\n",
     "    for _,row in overlapping_df.iterrows():\n",
     "        n_datasets.extend([row['N Datasets']]*len(row['Overlap']))\n",
+    "        member_sets = row['Members']\n",
     "        for d in row['Overlap']:\n",
     "            n = 0\n",
     "            runsum_rank = 0\n",
     "            runsum_pctrank = 0\n",
     "            runsum_logFC = 0\n",
     "            runsum_pval = 0\n",
-    "            for _,df in data_dict.items():\n",
+    "            for source_name,df in data_dict.items():\n",
+    "                if not re.search(source_name, member_sets):\n",
+    "                    continue\n",
     "                subset = df[df['Drug'].str.lower() == d.lower()]\n",
     "                n = n + subset.shape[0]\n",
     "                runsum_rank = runsum_rank + subset.Rank.sum()\n",
@@ -730,7 +764,20 @@
     "    else:\n",
     "        # sort based on N datasets and average adjusted p-value\n",
     "        res_df = res_df.sort_values(['N Datasets','Avg Adj.P.Val'], ascending=[False,True])\n",
-    "    return res_df"
+    "    return res_df\n",
+    "\n",
+    "\n",
+    "def join_proteomics(ranking_table, protein_de):\n",
+    "    # join with PubChem ID table\n",
+    "    with_cids = ranking_table.merge(pubchem_ids, how='left', on='Drug')\n",
+    "    # Drop those drugs that did not have PubChem IDs\n",
+    "    with_cids = with_cids[with_cids['CID'].notna()]\n",
+    "    # join with proteomics data on PubChem IDs\n",
+    "    with_proteins = with_cids.merge(protein_de[['UniprotID','Pubchem','logFC']], how='inner', left_on='CID', right_on='Pubchem')\n",
+    "    # clean column names\n",
+    "    with_proteins.rename(columns = {'logFC':'Protein logFC', 'Pubchem' : 'PubChem CID'}, inplace=True)\n",
+    "    with_proteins.drop(columns='CID',inplace=True)\n",
+    "    return with_proteins"
    ]
   },
   {
@@ -792,6 +839,69 @@
     "    display(HTML(download_link(overlapping_down_TargetRank, f'overlapping_drugs_averages_{query_gene}_DnReg.tsv')))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "2aaee19a",
+   "metadata": {},
+   "source": [
+    "### Protein Regulation\n",
+    "\n",
+    "Query gene regulation at the protein level is dispalyed in the table below. Proteomics data is from the [Deepcover MoA dataset](https://wren.hms.harvard.edu/DeepCoverMOA/), which exposes cells from the HCT116 cancer cell line to 875 small molecule compounds. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53cc7efd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if in_deepcover:\n",
+    "    up_protein = protein_de[protein_de['logFC'] > 0].loc[:,['UniprotID','Drug','Pubchem','logFC','Zscore','UpRank','PctUpRank']].sort_values('logFC',ascending=False).reset_index().drop(columns='index')\n",
+    "    up_protein.rename(columns={'Pubchem':'PubChem CID','Zscore':'Z-score','UpRank':'Up Rank', 'PctUpRank':'Normalized Up Rank'}, inplace=True)\n",
+    "    display_markdown(\"**Up-regulating drugs**\", raw=True)\n",
+    "    display(up_protein.head(top_n))\n",
+    "    display(HTML(download_link(up_protein, f'DeepcoverMoa_protein_{query_gene}_UpReg.tsv')))\n",
+    "\n",
+    "    dn_protein = protein_de[protein_de['logFC'] < 0].loc[:,['UniprotID','Drug','Pubchem','logFC','Zscore','DnRank','PctDnRank']].sort_values('logFC', ascending=True).reset_index().drop(columns='index')\n",
+    "    dn_protein.rename(columns={'Pubchem':'PubChem CID','Zscore':'Z-score','UpRDnRankank':'Down Rank', 'PctDnRank':'Normalized Down Rank'}, inplace=True)\n",
+    "    display_markdown(\"**Down-regulating drugs**\", raw=True)\n",
+    "    display(dn_protein.head(top_n))\n",
+    "    display(HTML(download_link(dn_protein, f'DeepcoverMoa_protein_{query_gene}_DnReg.tsv')))\n",
+    "else:\n",
+    "    display_markdown(f\"Protein of {query_gene} not in DeepCover MoA Dataset\", raw=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0630a08",
+   "metadata": {},
+   "source": [
+    "If the protein associated with the query gene was found in the Deepcover MoA proteomics dataset, the tables below show how the protein was up or down-regulated by the consensus drugs identified in the connectivity mapping resources. The table only includes compounds that were used in the connectivity mapping resources and the Deepcover MoA dataset. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b46e327",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if in_deepcover:\n",
+    "    up_with_cid = join_proteomics(overlapping_up_TargetRank, protein_de)\n",
+    "    dn_with_cid = join_proteomics(overlapping_down_TargetRank, protein_de)\n",
+    "    \n",
+    "    display_markdown(\"**Up-regulating drugs with protein expression**\", raw=True)\n",
+    "    display(up_with_cid.head(n=top_n))\n",
+    "    display(HTML(download_link(up_with_cid, f'{query_gene}_mRNA_protein_UpReg.tsv')))\n",
+    "    \n",
+    "    display_markdown(\"**Down-regulating drugs with protein expression**\", raw=True)\n",
+    "    display(dn_with_cid.head(n=top_n))\n",
+    "    display(HTML(download_link(dn_with_cid, f'{query_gene}_mRNA_protein_DnReg.tsv')))\n",
+    "else:\n",
+    "    display_markdown(f\"Protein of {query_gene} not in DeepCover MoA Dataset\", raw=True)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "2e7c13dd",
@@ -954,22 +1064,43 @@
     "        df['Label'] = df['Perturbation'] + '_' + df['Drug']\n",
     "    elif source == 'L1000':\n",
     "        df['Label'] = df['Perturbation']\n",
+    "    elif source == 'Deepcover MoA':\n",
+    "        df['Label'] = df['Drug']\n",
+    "        df['abs_Zscore'] = df['Zscore'].apply(abs)\n",
     "\n",
     "    # set plot source\n",
-    "    plot_source = ColumnDataSource(df.loc[:,['Label','logFC','FC','log10adj.P.Val', 'Rank', 'PctRank']])\n",
-    "    x,y='logFC','log10adj.P.Val'\n",
-    "    hover = HoverTool(tooltips=[(\"Label\", \"@Label\"),\n",
+    "    if source != 'Deepcover MoA':\n",
+    "        plot_source = ColumnDataSource(df.loc[:,['Label','logFC','FC','log10adj.P.Val', 'Rank', 'PctRank']])\n",
+    "        x,y='logFC','log10adj.P.Val'\n",
+    "        xlabel,ylabel = 'Log2(Fold Change)','-Log10(Adj. p-value)'\n",
+    "        title = f'{gene_id} Regulation in {source} {cell_type}'\n",
+    "        hover = HoverTool(tooltips=[(\"Label\", \"@Label\"),\n",
     "                            (\"Log2(FC)\", \"@logFC\"),\n",
     "                            (\"Fold Change\", \"@FC\"),\n",
     "                            ('-Log10(Adj. p-value)',\"@{log10adj.P.Val}{0.00e}\"),\n",
     "                            (\"Raw Rank\", \"@Rank\"),\n",
     "                            (\"Normalized Rank\", \"@PctRank\")])\n",
+    "    else:\n",
+    "        plot_source = ColumnDataSource(df.loc[:,['Label','logFC','FC','abs_Zscore','UpRank','DnRank','PctUpRank','PctDnRank']])\n",
+    "        x,y = 'logFC','abs_Zscore'\n",
+    "        xlabel,ylabel = 'Log2(Fold Change)','Abs(Z-score)'\n",
+    "        title = f'{gene_id}: {df[\"UniprotID\"].iloc[0]} Regulation in {source} {cell_type}'\n",
+    "        hover = HoverTool(tooltips=[(\"Label\", \"@Label\"),\n",
+    "                            (\"Log2(FC)\", \"@logFC\"),\n",
+    "                            (\"Fold Change\", \"@FC\"),\n",
+    "                            ('abs(z-score)',\"@{abs_Zscore}{0.00e}\"),\n",
+    "                            (\"Up Rank\", \"@UpRank\"),\n",
+    "                            (\"Normalized Up Rank\",\"@PctUpRank\"),\n",
+    "                            (\"Down Rank\", \"@DnRank\"),\n",
+    "                            (\"Normalized Down Rank\",\"@PctDnRank\")])\n",
+    "\n",
+    "    \n",
     "        \n",
     "    # define figure\n",
     "    p = figure(\n",
-    "        title=f'{gene_id} Regulation in {source} {cell_type}',\n",
-    "        x_axis_label = 'Log2(Fold Change)',\n",
-    "        y_axis_label = '-Log10(Adj. p-value)',\n",
+    "        title=title,\n",
+    "        x_axis_label = xlabel,\n",
+    "        y_axis_label = ylabel,\n",
     "        tools = 'pan,wheel_zoom,box_zoom,reset,save'\n",
     "    )\n",
     "\n",
@@ -1053,6 +1184,30 @@
     "    display_markdown(f'**{query_gene}** not found in Novartis DRUG-seq dataset', raw=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8c3df5f5",
+   "metadata": {},
+   "source": [
+    "### Deepcover MoA\n",
+    "\n",
+    "The Deepcover MoA proteomics dataset consists of proteome fingerprints for 875 chemical perturbations. This volcano plot of protein expression shows the logFC on the x-axis and the absolute difference in standard deviations between the protein's logFC for a given compound and the protein's mean logFC across all compounds on the y-axis. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "213d50f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if in_deepcover:\n",
+    "    # check for multiple proteins\n",
+    "    uniprot_ids = list(protein_de['UniprotID'].unique())\n",
+    "    for uid in uniprot_ids:\n",
+    "        create_bokeh_volcano_plot(protein_de[protein_de['UniprotID']==uid], query_gene, '', 'Deepcover MoA')"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "ba0c6439",
@@ -1070,7 +1225,9 @@
     "\n",
     "[5] “LINCS L1000 Reverse Search.” n.d. Accessed September 5, 2025. https://lincs-reverse-search-dashboard.dev.maayanlab.cloud/.\n",
     "\n",
-    "[6] Wang, Zichen, Edward He, Kevin Sani, Kathleen M. Jagodnik, Moshe C. Silverstein, and Avi Ma’ayan. 2019. “Drug Gene Budger (DGB): An Application for Ranking Drugs to Modulate a Specific Gene Based on Transcriptomic Signatures.” Bioinformatics (Oxford, England) 35 (7): 1247–48."
+    "[6] Wang, Zichen, Edward He, Kevin Sani, Kathleen M. Jagodnik, Moshe C. Silverstein, and Avi Ma’ayan. 2019. “Drug Gene Budger (DGB): An Application for Ranking Drugs to Modulate a Specific Gene Based on Transcriptomic Signatures.” Bioinformatics (Oxford, England) 35 (7): 1247–48.\n",
+    "\n",
+    "[7] Mitchell, Dylan C., Miljan Kuljanin, Jiaming Li, Jonathan G. Van Vranken, Nathan Bulloch, Devin K. Schweppe, Edward L. Huttlin, and Steven P. Gygi. 2023. “A Proteome-Wide Atlas of Drug Mechanism of Action.” Nature Biotechnology 41 (6): 845–57."
    ]
   }
  ],