AWS S3 & Python Data Processing Lab

A hands-on lab for researchers: store data in Amazon S3 and process it with Python in a web-based VS Code environment.

Region: Singapore (ap-southeast-1)
Time: ~45–60 minutes

Project structure

.
├── LAB.md                 # This lab (instructions)
├── requirements.txt       # Python dependencies (boto3, pandas)
├── process_data.py        # Script: read from S3, process, write to S3
└── sample-data/
    ├── genomics_sample.csv
    └── sensor_readings.json

Objectives

Create an S3 bucket in the Singapore region via the AWS Management Console.
Upload sample research datasets (CSV and JSON) to S3.
Run a Python script that reads from S3, computes statistics, filters data, converts formats, and writes results back to S3.

Prerequisites

AWS account with console access.
Web-based VS Code environment with Python 3 and terminal access.
Basic familiarity with Python and CSV/JSON.

Part 1: Create an S3 Bucket (AWS Management Console)

Sign in to AWS
Go to https://console.aws.amazon.com and sign in.
Open S3
In the search bar, type S3 and open S3 (or go to Services → Storage → S3).
Create bucket
Click Create bucket.
Bucket settings
- Bucket name: Choose a globally unique name (e.g. research-data-yourname-2025).
  S3 bucket names must be unique across all AWS accounts and regions.
- Region: Select Asia Pacific (Singapore) ap-southeast-1.
- Block Public Access: Leave the default (Block all public access) unless your use case requires public access.
- Bucket Versioning: Optional; you can leave Disable for this lab.
- Default encryption: Optional; Server-side encryption (SSE-S3) is a good choice for research data.
Create
Click Create bucket at the bottom.
Note your bucket name
You will need it for uploading data and for the Python script.

Part 2: Upload Sample Datasets to S3

You will upload two sample datasets into your bucket: a genomics-style CSV and a sensor JSON file. Use the files provided in this lab in the sample-data/ folder:

File	Description
`sample-data/genomics_sample.csv`	Sample IDs, gene names, expression values, condition (control/treatment)
`sample-data/sensor_readings.json`	Sensor IDs, timestamps, numeric values, units (temperature, pH)

Upload steps (AWS Console)

In the S3 console, click your bucket name.
Click Upload.
Add files:
- sample-data/genomics_sample.csv → upload as genomics_sample.csv (or keep path; see script config below)
- sample-data/sensor_readings.json → upload as sensor_readings.json
Under Destination, leave the prefix empty so files land in the bucket root. The script expects keys genomics_sample.csv and sensor_readings.json. If you use a prefix (e.g. raw/), set GENOMICS_KEY and SENSOR_KEY in process_data.py accordingly (e.g. raw/genomics_sample.csv).
Click Upload, then Done.

Check: In the bucket, you should see your CSV and JSON files (e.g. in the root or under raw/).

Part 3: Python Environment and Dependencies

In your VS Code environment, open a terminal and install dependencies:

pip install -r requirements.txt

Ensure you have AWS credentials configured (e.g. environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN, or an AWS CLI profile). The script will use the default credential chain (environment variables or ~/.aws/credentials).

Part 4: Configure and Run the Processing Script

Set your bucket name
Open process_data.py and set the variable BUCKET_NAME to your bucket name (e.g. research-data-yourname-2025).
Set region
Ensure the script uses region ap-southeast-1 (Singapore) for S3; the sample script does this by default.
Run the script
```
python process_data.py
```

What the script does

Reads from your S3 bucket:
- genomics_sample.csv (expression data) and sensor_readings.json (sensor readings).
Processes data:
- Statistics: mean, median, min, max, count for genomics expression and for sensor value.
- Filtering: genomics rows with expression >= 10; sensor records with value >= 22.
- Format conversion: filtered genomics written as CSV and JSON; filtered sensor as JSON and CSV.
Writes results to folder processed/ inside the same bucket:
- processed/genomics_stats.json, processed/genomics_filtered.csv, processed/genomics_filtered.json
- processed/sensor_stats.json, processed/sensor_filtered.json, processed/sensor_filtered.csv

After a successful run, refresh your bucket in the S3 console and open the output folder to see the generated files.

Part 5: Verify Results in S3

In the S3 console, open your bucket.
Open the output folder (e.g. processed/ or results/).
Download and open the generated files (e.g. summary JSON, processed CSV/JSON) and confirm they match what the script describes (statistics, filtered records, format conversion).

Extensions and Troubleshooting

Permissions: The IAM user/role used must have s3:GetObject and s3:PutObject (and s3:ListBucket if the script lists objects) on your bucket.
Region: Bucket and script must use the same region (e.g. ap-southeast-1).
Paths: If you uploaded files under a prefix (e.g. raw/), set the script’s input key names to match (e.g. raw/genomics_sample.csv).

Summary

Step	Action
1	Create S3 bucket in Singapore (ap-southeast-1) via AWS Console.
2	Upload `genomics_sample.csv` and `sensor_readings.json` to the bucket.
3	Install dependencies (`pip install -r requirements.txt`) and configure AWS credentials.
4	Set `BUCKET_NAME` in `process_data.py` and run `python process_data.py`.
5	Check the output folder in S3 for processed and summary files.

Lab designed for researchers learning AWS S3 and Python data processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS S3 & Python Data Processing Lab

Project structure

Objectives

Prerequisites

Part 1: Create an S3 Bucket (AWS Management Console)

Part 2: Upload Sample Datasets to S3

Upload steps (AWS Console)

Part 3: Python Environment and Dependencies

Part 4: Configure and Run the Processing Script

What the script does

Part 5: Verify Results in S3

Extensions and Troubleshooting

Summary

FilesExpand file tree

LAB.md

Latest commit

History

LAB.md

File metadata and controls

AWS S3 & Python Data Processing Lab

Project structure

Objectives

Prerequisites

Part 1: Create an S3 Bucket (AWS Management Console)

Part 2: Upload Sample Datasets to S3

Upload steps (AWS Console)

Part 3: Python Environment and Dependencies

Part 4: Configure and Run the Processing Script

What the script does

Part 5: Verify Results in S3

Extensions and Troubleshooting

Summary