Skip to content

Commit 41bd46d

Browse files
committed
Updated training instructions
1 parent 70399cd commit 41bd46d

1 file changed

Lines changed: 11 additions & 52 deletions

File tree

training.md

Lines changed: 11 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
66

77
If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88

9-
### 1. Prerequisites
10-
Ensure you have **Python 3.11+** installed.
9+
#### 1. Prerequisites
10+
Ensure you have **Python 3.12+** installed.
1111
```bash
1212
python3 --version
1313
```
@@ -23,6 +23,11 @@ source venv/bin/activate
2323
```
2424

2525
### 3. Install Dependencies
26+
27+
Install the latest version of the jax-tpu-embedding library:
28+
```bash
29+
pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
30+
```
2631
```bash
2732
pip install -r requirements.txt
2833
```
@@ -41,57 +46,17 @@ python dlrm_experiment_test.py
4146

4247
If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4348

44-
### 1. Create a Dockerfile
45-
Create a file named `Dockerfile` in the root of the repository:
46-
47-
```dockerfile
48-
# Use an official Python 3.12 runtime as a parent image
49-
FROM python:3.12-slim
50-
51-
# Set the working directory
52-
WORKDIR /app
53-
54-
# This tells Python to look in /app for the 'recml' package
55-
ENV PYTHONPATH="${PYTHONPATH}:/app"
56-
57-
# This tells Python to look in /app for the 'recml' package
58-
ENV PYTHONPATH="${PYTHONPATH}:/app"
59-
60-
# Install system tools if needed (e.g., git)
61-
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
62-
63-
# Install the latest jax-tpu-embedding wheel
64-
COPY jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl ./
65-
RUN pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
66-
67-
# Copy requirements.txt to current directory
68-
COPY requirements.txt ./
69-
70-
# Install dependencies
71-
RUN pip install --upgrade pip
72-
RUN pip install -r ./requirements.txt
73-
74-
# Force install the specific protobuf version
75-
RUN pip install "protobuf>=6.31.1" --no-deps
76-
77-
# Copy the current directory contents into the container
78-
COPY . /app
79-
80-
# Default command to run the training script
81-
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
82-
```
83-
84-
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
85-
86-
### 2. Build the Image
49+
### 1. Build the Image
8750

8851
Run this command from the root of the repository. It reads the `Dockerfile`, installs all dependencies, and creates a ready-to-run image.
8952

9053
```bash
9154
docker build -t recml-training .
9255
```
9356

94-
### 3. Run the Image
57+
### 2. Run the Image
58+
59+
This will run the docker image and execute the command specified, which is currently set to run DLRM.
9560

9661
```bash
9762
docker run --rm --privileged \
@@ -100,9 +65,3 @@ docker run --rm --privileged \
10065
--name recml-experiment \
10166
recml-training
10267
```
103-
104-
### What is happening here?
105-
* **`--rm`**: Automatically deletes the container after the script finishes to keep your disk clean.
106-
* **`--privileged`**: Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
107-
* **`--net=host`**: Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
108-
* **`--ipc=host`**: Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.

0 commit comments

Comments
 (0)