Skip to content

Commit fdb73b4

Browse files
authored
Merge pull request #147 from AI-Hypercomputer/ajkv/library-fix
Updated code with latest jax-tpu-embedding wheel to run on V7
2 parents 9f1c0bd + 632a42e commit fdb73b4

4 files changed

Lines changed: 24 additions & 49 deletions

File tree

Dockerfile

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,31 @@
1-
# Use an official Python 3.11 runtime as a parent image
2-
FROM python:3.11-slim
1+
# Use an official Python 3.12 runtime as a parent image
2+
FROM python:3.12-slim
33

44
# Set the working directory
55
WORKDIR /app
66

7-
# Copy the current directory contents into the container
8-
COPY . /app
9-
107
# This tells Python to look in /app for the 'recml' package
118
ENV PYTHONPATH="${PYTHONPATH}:/app"
129

1310
# Install system tools if needed (e.g., git)
1411
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
1512

13+
# Install the latest jax-tpu-embedding wheel
14+
COPY jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl ./
15+
RUN pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
16+
17+
# Copy requirements.txt to current directory
18+
COPY requirements.txt ./
19+
1620
# Install dependencies
1721
RUN pip install --upgrade pip
18-
RUN pip install -r requirements.txt
22+
RUN pip install -r ./requirements.txt
1923

2024
# Force install the specific protobuf version
2125
RUN pip install "protobuf>=6.31.1" --no-deps
2226

27+
# Copy the current directory contents into the container
28+
COPY . /app
29+
2330
# Default command to run the training script
2431
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
Binary file not shown.

requirements.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,6 @@ importlib-resources==6.5.2
3434
iniconfig==2.1.0
3535
isort==6.0.1
3636
jax==0.8.2
37-
jax-tpu-embedding==0.1.0.dev20251208
3837
jaxlib==0.8.2
3938
jaxtyping==0.3.1
4039
Jinja2==3.1.6
@@ -123,4 +122,4 @@ wadler-lindig==0.1.5
123122
Werkzeug==3.1.3
124123
wheel==0.45.1
125124
wrapt==1.17.2
126-
zipp==3.21.0
125+
zipp==3.21.0

training.md

Lines changed: 10 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
77
If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88

99
### 1. Prerequisites
10-
Ensure you have **Python 3.11+** installed.
10+
Ensure you have **Python 3.12+** installed.
1111
```bash
1212
python3 --version
1313
```
@@ -23,6 +23,11 @@ source venv/bin/activate
2323
```
2424

2525
### 3. Install Dependencies
26+
27+
Install the latest version of the jax-tpu-embedding library:
28+
```bash
29+
pip install ./jax_tpu_embedding-0.1.0.dev20260121-cp312-cp312-manylinux_2_31_x86_64.whl
30+
```
2631
```bash
2732
pip install -r requirements.txt
2833
```
@@ -41,47 +46,17 @@ python dlrm_experiment_test.py
4146

4247
If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4348

44-
### 1. Create a Dockerfile
45-
Create a file named `Dockerfile` in the root of the repository:
46-
47-
```dockerfile
48-
# Use an official Python 3.11 runtime as a parent image
49-
FROM python:3.11-slim
50-
51-
# Set the working directory
52-
WORKDIR /app
53-
54-
# Copy the current directory contents into the container
55-
COPY . /app
56-
57-
# This tells Python to look in /app for the 'recml' package
58-
ENV PYTHONPATH="${PYTHONPATH}:/app"
59-
60-
# Install system tools if needed (e.g., git)
61-
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
62-
63-
# Install dependencies
64-
RUN pip install --upgrade pip
65-
RUN pip install -r requirements.txt
66-
67-
# Force install the specific protobuf version
68-
RUN pip install "protobuf>=6.31.1" --no-deps
69-
70-
# Default command to run the training script
71-
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
72-
```
73-
74-
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
75-
76-
### 2. Build the Image
49+
### 1. Build the Image
7750

7851
Run this command from the root of the repository. It reads the `Dockerfile`, installs all dependencies, and creates a ready-to-run image.
7952

8053
```bash
8154
docker build -t recml-training .
8255
```
8356

84-
### 3. Run the Image
57+
### 2. Run the Image
58+
59+
This will run the docker image and execute the command specified, which is currently set to run DLRM.
8560

8661
```bash
8762
docker run --rm --privileged \
@@ -90,9 +65,3 @@ docker run --rm --privileged \
9065
--name recml-experiment \
9166
recml-training
9267
```
93-
94-
### What is happening here?
95-
* **`--rm`**: Automatically deletes the container after the script finishes to keep your disk clean.
96-
* **`--privileged`**: Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
97-
* **`--net=host`**: Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
98-
* **`--ipc=host`**: Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.

0 commit comments

Comments
 (0)