Skip to content

Commit f0d4048

Browse files
authored
Merge pull request #59 from sixiang-svg/update-research-part-2
feat: update research skills part 2
2 parents 2be3584 + 76fb056 commit f0d4048

20 files changed

Lines changed: 674 additions & 2493 deletions

File tree

skills/Research/cancel/SKILL.md

Lines changed: 43 additions & 357 deletions
Large diffs are not rendered by default.
Lines changed: 34 additions & 333 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
---
2-
category: Research
32
id: chembl-database
43
name: ChEMBL Database
5-
description: Guidance and answers for chembl database.
4+
description: Guidance for retrieving compound information, bioactivity data, and identifier mapping in ChEMBL.
5+
category: Research
6+
requires: []
7+
examples:
8+
- Search for bioactivity data of Geldanamycin in the ChEMBL database.
9+
- Map this KEGG compound ID to its corresponding ChEMBL ID.
610
---
711

812
# BioServices
@@ -24,334 +28,31 @@ This skill should be used when:
2428
- Mining genomic data (BioMart, ArrayExpress, ENA)
2529
- Integrating data from multiple bioinformatics resources in a single workflow
2630

27-
## Core Capabilities
28-
29-
### 1. Protein Analysis
30-
31-
Retrieve protein information, sequences, and functional annotations:
32-
33-
```python
34-
from bioservices import UniProt
35-
36-
u = UniProt(verbose=False)
37-
38-
# Search for protein by name
39-
results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
40-
41-
# Retrieve FASTA sequence
42-
sequence = u.retrieve("P43403", "fasta")
43-
44-
# Map identifiers between databases
45-
kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
46-
```
47-
48-
**Key methods:**
49-
- `search()`: Query UniProt with flexible search terms
50-
- `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
51-
- `mapping()`: Convert identifiers between databases
52-
53-
Reference: `references/services_reference.md` for complete UniProt API details.
54-
55-
### 2. Pathway Discovery and Analysis
56-
57-
Access KEGG pathway information for genes and organisms:
58-
59-
```python
60-
from bioservices import KEGG
61-
62-
k = KEGG()
63-
k.organism = "hsa" # Set to human
64-
65-
# Search for organisms
66-
k.lookfor_organism("droso") # Find Drosophila species
67-
68-
# Find pathways by name
69-
k.lookfor_pathway("B cell") # Returns matching pathway IDs
70-
71-
# Get pathways containing specific genes
72-
pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene
73-
74-
# Retrieve and parse pathway data
75-
data = k.get("hsa04660")
76-
parsed = k.parse(data)
77-
78-
# Extract pathway interactions
79-
interactions = k.parse_kgml_pathway("hsa04660")
80-
relations = interactions['relations'] # Protein-protein interactions
81-
82-
# Convert to Simple Interaction Format
83-
sif_data = k.pathway2sif("hsa04660")
84-
```
85-
86-
**Key methods:**
87-
- `lookfor_organism()`, `lookfor_pathway()`: Search by name
88-
- `get_pathway_by_gene()`: Find pathways containing genes
89-
- `parse_kgml_pathway()`: Extract structured pathway data
90-
- `pathway2sif()`: Get protein interaction networks
91-
92-
Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
93-
94-
### 3. Compound Database Searches
95-
96-
Search and cross-reference compounds across multiple databases:
97-
98-
```python
99-
from bioservices import KEGG, UniChem
100-
101-
k = KEGG()
102-
103-
# Search compounds by name
104-
results = k.find("compound", "Geldanamycin") # Returns cpd:C11222
105-
106-
# Get compound information with database links
107-
compound_info = k.get("cpd:C11222") # Includes ChEBI links
108-
109-
# Cross-reference KEGG → ChEMBL using UniChem
110-
u = UniChem()
111-
chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315
112-
```
113-
114-
**Common workflow:**
115-
1. Search compound by name in KEGG
116-
2. Extract KEGG compound ID
117-
3. Use UniChem for KEGG → ChEMBL mapping
118-
4. ChEBI IDs are often provided in KEGG entries
119-
120-
Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
121-
122-
### 4. Sequence Analysis
123-
124-
Run BLAST searches and sequence alignments:
125-
126-
```python
127-
from bioservices import NCBIblast
128-
129-
s = NCBIblast(verbose=False)
130-
131-
# Run BLASTP against UniProtKB
132-
jobid = s.run(
133-
program="blastp",
134-
sequence=protein_sequence,
135-
stype="protein",
136-
database="uniprotkb",
137-
email="your.email@example.com" # Required by NCBI
138-
)
139-
140-
# Check job status and retrieve results
141-
s.getStatus(jobid)
142-
results = s.getResult(jobid, "out")
143-
```
144-
145-
**Note:** BLAST jobs are asynchronous. Check status before retrieving results.
146-
147-
### 5. Identifier Mapping
148-
149-
Convert identifiers between different biological databases:
150-
151-
```python
152-
from bioservices import UniProt, KEGG
153-
154-
# UniProt mapping (many database pairs supported)
155-
u = UniProt()
156-
results = u.mapping(
157-
fr="UniProtKB_AC-ID", # Source database
158-
to="KEGG", # Target database
159-
query="P43403" # Identifier(s) to convert
160-
)
161-
162-
# KEGG gene ID → UniProt
163-
kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
164-
165-
# For compounds, use UniChem
166-
from bioservices import UniChem
167-
u = UniChem()
168-
chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
169-
```
170-
171-
**Supported mappings (UniProt):**
172-
- UniProtKB ↔ KEGG
173-
- UniProtKB ↔ Ensembl
174-
- UniProtKB ↔ PDB
175-
- UniProtKB ↔ RefSeq
176-
- And many more (see `references/identifier_mapping.md`)
177-
178-
### 6. Gene Ontology Queries
179-
180-
Access GO terms and annotations:
181-
182-
```python
183-
from bioservices import QuickGO
184-
185-
g = QuickGO(verbose=False)
186-
187-
# Retrieve GO term information
188-
term_info = g.Term("GO:0003824", frmt="obo")
189-
190-
# Search annotations
191-
annotations = g.Annotation(protein="P43403", format="tsv")
192-
```
193-
194-
### 7. Protein-Protein Interactions
195-
196-
Query interaction databases via PSICQUIC:
197-
198-
```python
199-
from bioservices import PSICQUIC
200-
201-
s = PSICQUIC(verbose=False)
202-
203-
# Query specific database (e.g., MINT)
204-
interactions = s.query("mint", "ZAP70 AND species:9606")
205-
206-
# List available interaction databases
207-
databases = s.activeDBs
208-
```
209-
210-
**Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
211-
212-
## Multi-Service Integration Workflows
213-
214-
BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
215-
216-
### Complete Protein Analysis Pipeline
217-
218-
Execute a full protein characterization workflow:
219-
220-
```bash
221-
python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
222-
```
223-
224-
This script demonstrates:
225-
1. UniProt search for protein entry
226-
2. FASTA sequence retrieval
227-
3. BLAST similarity search
228-
4. KEGG pathway discovery
229-
5. PSICQUIC interaction mapping
230-
231-
### Pathway Network Analysis
232-
233-
Analyze all pathways for an organism:
234-
235-
```bash
236-
python scripts/pathway_analysis.py hsa output_directory/
237-
```
238-
239-
Extracts and analyzes:
240-
- All pathway IDs for organism
241-
- Protein-protein interactions per pathway
242-
- Interaction type distributions
243-
- Exports to CSV/SIF formats
244-
245-
### Cross-Database Compound Search
246-
247-
Map compound identifiers across databases:
248-
249-
```bash
250-
python scripts/compound_cross_reference.py Geldanamycin
251-
```
252-
253-
Retrieves:
254-
- KEGG compound ID
255-
- ChEBI identifier
256-
- ChEMBL identifier
257-
- Basic compound properties
258-
259-
### Batch Identifier Conversion
260-
261-
Convert multiple identifiers at once:
262-
263-
```bash
264-
python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
265-
```
266-
267-
## Best Practices
268-
269-
### Output Format Handling
270-
271-
Different services return data in various formats:
272-
- **XML**: Parse using BeautifulSoup (most SOAP services)
273-
- **Tab-separated (TSV)**: Pandas DataFrames for tabular data
274-
- **Dictionary/JSON**: Direct Python manipulation
275-
- **FASTA**: BioPython integration for sequence analysis
276-
277-
### Rate Limiting and Verbosity
278-
279-
Control API request behavior:
280-
281-
```python
282-
from bioservices import KEGG
283-
284-
k = KEGG(verbose=False) # Suppress HTTP request details
285-
k.TIMEOUT = 30 # Adjust timeout for slow connections
286-
```
287-
288-
### Error Handling
289-
290-
Wrap service calls in try-except blocks:
291-
292-
```python
293-
try:
294-
results = u.search("ambiguous_query")
295-
if results:
296-
# Process results
297-
pass
298-
except Exception as e:
299-
print(f"Search failed: {e}")
300-
```
301-
302-
### Organism Codes
303-
304-
Use standard organism abbreviations:
305-
- `hsa`: Homo sapiens (human)
306-
- `mmu`: Mus musculus (mouse)
307-
- `dme`: Drosophila melanogaster
308-
- `sce`: Saccharomyces cerevisiae (yeast)
309-
310-
List all organisms: `k.list("organism")` or `k.organismIds`
311-
312-
### Integration with Other Tools
313-
314-
BioServices works well with:
315-
- **BioPython**: Sequence analysis on retrieved FASTA data
316-
- **Pandas**: Tabular data manipulation
317-
- **PyMOL**: 3D structure visualization (retrieve PDB IDs)
318-
- **NetworkX**: Network analysis of pathway interactions
319-
- **Galaxy**: Custom tool wrappers for workflow platforms
320-
321-
## Resources
322-
323-
### scripts/
324-
325-
Executable Python scripts demonstrating complete workflows:
326-
327-
- `protein_analysis_workflow.py`: End-to-end protein characterization
328-
- `pathway_analysis.py`: KEGG pathway discovery and network extraction
329-
- `compound_cross_reference.py`: Multi-database compound searching
330-
- `batch_id_converter.py`: Bulk identifier mapping utility
331-
332-
Scripts can be executed directly or adapted for specific use cases.
333-
334-
### references/
335-
336-
Detailed documentation loaded as needed:
337-
338-
- `services_reference.md`: Comprehensive list of all 40+ services with methods
339-
- `workflow_patterns.md`: Detailed multi-step analysis workflows
340-
- `identifier_mapping.md`: Complete guide to cross-database ID conversion
341-
342-
Load references when working with specific services or complex integration tasks.
343-
344-
## Installation
345-
346-
```bash
347-
uv pip install bioservices
348-
```
349-
350-
Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
351-
352-
## Additional Information
353-
354-
For detailed API documentation and advanced features, refer to:
355-
- Official documentation: https://bioservices.readthedocs.io/
356-
- Source code: https://github.com/cokelaer/bioservices
357-
- Service-specific references in `references/services_reference.md`
31+
## Instruction
32+
You are a Chemical Informatics and Bioactivity Specialist. When this skill is activated, you must guide the user through the retrieval and cross-referencing of compound data using the following behavioral logic:
33+
34+
1. **Compound Identification Logic**: Guide the user in searching the ChEMBL repository for bioactive molecules using common names or structural identifiers.
35+
2. **Bioactivity & Assay Analysis**:
36+
- Instruct the user on how to retrieve quantitative data, such as IC50, Ki, and EC50 values, from specific assays.
37+
- Explain the logic of filtering results by Target Type and Organism to ensure scientific relevance.
38+
3. **Cross-Database Mapping**:
39+
- Use the logic of UniChem to map identifiers between ChEMBL and other chemical repositories like KEGG, ChEBI, or PubChem.
40+
- Describe the workflow for mapping KEGG compound IDs to ChEMBL IDs to bridge pathway analysis with drug discovery data.
41+
4. **Data Integration Flow**:
42+
- Guide the user in integrating ChEMBL data with protein repositories like UniProt to link ligands with their target receptors.
43+
- Explain how to handle tabular data (TSV) or structured JSON returns for downstream analysis.
44+
45+
## Output
46+
Your response must be structured to provide a professional chemoinformatics report:
47+
48+
### 1. Compound & Bioactivity Summary
49+
- **Target Compound**: Standard name and ChEMBL identifier.
50+
- **Activity Profile**: A summary of key bioactivity metrics and the associated targets.
51+
52+
### 2. Implementation Logic (Natural Language)
53+
- **Search Workflow**: Step-by-step guidance on querying compounds and filtering assay results.
54+
- **Mapping Logic**: A natural language description of how to bridge identifiers across databases.
55+
56+
### 3. Best Practices & Data Interpretation
57+
- **Data Quality Warnings**: Reminders to check the "Confidence Score" of assays.
58+
- **Unit Standardization**: Advice on ensuring concentration units are consistent across data sets.

0 commit comments

Comments
 (0)