What is Python?
Python is a general purpose programming language that can be used in a number of settings, from website development to robotics. For our purposes, one common usage of Python is in data analysis and machine learning.
While many researchers use R, a different language, for data analysis, Python also has important strengths in data analytics, especially in image analysis and natural language processing. Often, data analysis takes place in a special format of Python language called a "notebook". If you've heard of "iPython" notebooks or "Jupyter" notebooks or labs, that's the format we're talking about. You may have also seen a notebook format in Google Collaboratory (or Colab) notebooks. These notebooks allow your code to be interspersed with formatted text that is intended to communicate with other humans, not with a computer. In Arcus, we provide the JupyterLab environment to provide users with notebook functionality.
Additionally, Python is widely used in machine learning, a computational method that helps develop models that can classify data and make predictions on new data. Often the development of machine learning takes place in notebooks, which makes trial and error and human intervention easy, and is then, once successful, is scaled up for production use in an automated form that doesn't use notebooks but rather raw python code for speed and efficiency.
What Makes Python Popular?
Like R, Python is free and open source, and promotes research reproducibility.
If you're already an Arcus user (you've signed our Terms of Use and completed CITI training), you can sign up for our Arcus On-Ramp webinars. In these webinars, you work in a real Arcus lab analyzing CHOP's electronic health record (EHR) to replicate an actual published study. Workshops focus either on exploring the data and defining a query for your study using SQL, or running the analysis in R/Python. No coding experience is required to attend. Registration closes one week before each workshop so we have time to add registered attendees as users in the webinar training lab. To sign up, please visit https://arcus.chop.edu/education/webinar-signup/. This link is only available for Arcus customers on the CHOP network.
For an example of how to use Python / JupyterLab in your Arcus lab, start with the training videos on your lab's landing page.
These are very introductory, but help you understand specifically how to work with your Arcus lab.
We strongly encourage you to watch all of the videos, in order, even the ones that don't refer to Python specifically. It's only about an hour of your time, and we think it will answer many of your questions and save time in the long run.
Arcus training is a great place to get started with your Python education, but you will probably want to continue your education on your own, growing in skills that are specific to your own research goals or career needs.
You have several options when it comes to growing in your Python skills.
There are a number of university classes, online courses and live workshops that go in depth about how to use Python. Simply search for courses at the university or MOOC (e.g. Coursera) you prefer to use.
If you prefer something a bit more "just in time", however, we suggest the Python modules from the DART (Data and Analytics for Research Training) program.
DART includes dozens of data science modules that are each 1 hour or less in duration and with a narrow focus and clear learning objectives. They are asynchronous and you can take them at any time!
Arcus Education's DART modules are the result of a study funded by an NIH grant aimed at educating biomedical researchers. The active research phase of this program is complete, so we are no longer recruiting learners to be our subjects. However, if you'd like to receive updates about publications or applications of this research, please email us at dart@chop.edu.
Training modules:
To begin learning R, there are a couple of options with regard to the DART self-guided tutorial modules.
If you want a comprehensive curriculum of nearly twenty modules, you might enjoy our Suggested Pathway 5: Analysis in Python curriculum, which includes overview materials about reproducible research and data organization, introductory material in Python, and some advanced topics you'll need as a biomedical researcher. While you're there, check out the other suggested pathways, too!
Expand to see a sneak preview of Suggested Pathway 5: Analysis in Python!
| Order | Module | Description | Estimated Time |
|---|---|---|---|
| 1 | Reproducibility, Generalizability, and Reuse | This module provides learners with an approachable introduction to the concepts and impact of research reproducibility, generalizability, and data reuse, and how technical approaches can help make these goals more attainable. | 60 min |
| 2 | How to Troubleshoot | Learning to use technical methods like coding and version control in your research inevitably means running into problems. Learn practical methods for troubleshooting and moving past error codes and other difficulties. | 30 min |
| 3 | Learning to Learn Data Science | Discover how learning data science is different than learning other subjects. | 20 min |
| 4 | Demystifying Python | This module introduces the Python programming language, explores why Python is useful in research, and describes how to download Python and Jupyter. | 20 min |
| 5 | Directories and File Paths | In this module, learners will explore what a directory is and how to describe the location of a file using its file path. | 15 min |
| 6 | Python Basics: Functions, Methods, and Variables | Learn the foundations of writing Python code, including the use of functions, methods, and variables. | 20 min |
| 7 | Python Basics: Lists and Dictionaries | Learn about collection objects, specifically lists and dictionaries, in Python. | 15 min |
| 8 | Python Basics: Loops and Conditionals | Learn how to use loops and conditional statements in Python. | 20 min |
| 9 | Python Basics: Exercise | Practice the skills acquired in the Python Basics sequence by working through an exercise. | 30 min |
| 10 | Transform Data with pandas | This is an introduction to transforming data using a Python library named pandas. | 60 min |
| 11 | Tidy Data | Tidy is a technical term in data analysis and describes an optimal way for organizing data that will be analyzed computationally. | 45 min |
| 12 | Data Visualization in Open Source Software | Introduction to principles of data vizualization and typical data vizualization workflows using two common open source libraries: ggplot2 and seaborn. | 20 min |
| 13 | Data Visualization in seaborn | This module includes code and explanations for several popular data visualizations using python's seaborn library. It also includes examples of how to modify seaborn plots to customize them for different uses. | 60 min |
| 14 | Introduction to Null Hypothesis Significance Testing | This is an introduction to NHST for biomedical researchers. | 40 min |
| 15 | Statistical Tests in Open Source Software | This module provides an overview of the most commonly used kinds of statistical tests and links to code for running many of them in both R and python. | 20 min |
| 16 | Python Practice | Use the basics of Python coding, data transformation, and data visualization to work with real data. | 60 min |
| 17 | Demystifying Machine Learning | An approachable and practical introduction to machine learning for biomedical researchers. | 60 min |
| 18 | Understanding the Bias-Variance Tradeoff | The bias-variance tradeoff is a central issue in nearly all machine learning analyses. This module explains what the tradeoff is, why it matters for machine learning, and what you can do to manage it in your own analyses. | 20 min |
If these pathways are close, but not quite right, you can also build your own pathway through these materials using our prototype curriculum development tool at https://learn.arcus.chop.edu.
If you're in a hurry and and you want to just get a bit of specific Python instruction, we recommend starting with these modules:
- Demystifying Python
- Python Basics: Functions, Methods, and Variables
- Python Basics: Lists and Dictionaries
- Python Basics: Loops and Conditionals
- Transform Data with pandas
- Data Visualization in seaborn
Additionally, beyond the NIH grant, we have other articles and miscellany we suggest, whether those are resources we've created in Arcus, or things we recommend from the larger Python community.
Compendia of Resources:
- Our "Python 101" Guide includes links to articles, webinars, and other materials on a variety of topics.

