🎓 What is this? This tutorial provides hands-on experience with Data Version Control (DVC), demonstrating how to manage multiple datasets and machine learning model artifacts alongside your code. We'll use a classic image classification problem (cats vs. dogs) to illustrate DVC's capabilities for reproducibility and collaboration in MLOps.
👩💻 Who is this for? ML Engineers, Data Scientists, and AI Developers who need to manage large datasets and models within Git repositories, ensuring experiment reproducibility and traceable deployments. A basic understanding of Python, Git, and command-line operations is assumed.
🎯 What will you learn?
- How to initialize a DVC project within a Git repository.
- How to version large files and directories using
dvc add. - How DVC tracks data changes using
.dvcfiles and a cache. - How to switch between different versions of your data and models using
dvc checkout. - How to set up and use remote storage to store and share DVC-tracked data with
dvc pushanddvc pull. - Advanced data access methods like
dvc get,dvc import, anddvc import-url. - The fundamentals of DVC pipelines using
dvc stage addanddvc reprofor automating ML workflows.
(This is a conceptual learning objective, even if the dedicated section is removed, as it's a core DVC concept.)
🔍 How is it structured? We'll start by setting up our environment and performing basic file and directory versioning. Then, we'll demonstrate tracking changes and switching versions. Next, we'll configure remote storage for sharing data. Finally, we'll cover advanced data access patterns and introduce DVC pipelines for automated model versioning conceptually.
⏱️ How much time will it take? Approximately 55-70 minutes to complete, providing a solid foundation in DVC for your MLOps workflows.
- ⚙️ 1. Setting Up Your Environment
- ⭐ 2. Versioning Your First Dataset (File)
- 📁 3. Versioning a Directory (Cats & Dogs Dataset)
- ♻️ 4. Tracking Changes & Updating Data
- ⏪ 5. Switching Between Data Versions (
dvc checkout) - ☁️ 6. Storing and Sharing Data with Remotes
- 📥 7. Advanced Data Access Commands
- 🔍 8. Understanding DVC File and Cache Internals
- 🔗 Additional Resources
- 🎉 Conclusion & Next Steps
Let's prepare your system for working with DVC.
Make sure you have the following installed:
- Python 3.8+: For running the
train.pyscript. - Git: Essential for version control of your code and DVC-files.
- DVC (Data Version Control): Follow the official DVC installation instructions if you don't have it already.
unzip: A command-line tool for extracting.ziparchives.
We'll start by cloning a ready-made Git repository that contains a train.py script and requirements.txt.
git clone https://github.com/iterative/example-versioning.git
cd example-versioningOutput (example):
Cloning into 'example-versioning'...
remote: Enumerating objects: 49, done.
...
Now, let's create a new Git branch for our tutorial work to keep things organized.
git checkout -b tutorialOutput:
Switched to a new branch 'tutorial'
It's best practice to use a virtual environment for your Python projects to manage dependencies.
-
Create a virtual environment:
python3 -m venv .env
-
Activate the virtual environment:
-
On macOS/Linux:
source .env/bin/activate -
On Windows (Command Prompt):
.env\Scripts\activate.bat
-
On Windows (PowerShell):
.env\Scripts\Activate.ps1
You should see
(.env)prepended to your command prompt, indicating the virtual environment is active. -
-
Install project requirements:
pip install -r requirements.txt
This might take a few minutes as it installs libraries like TensorFlow/ Keras for the dummy model.
The first step in any DVC project is to initialize it within your Git repository.
dvc initOutput:
Initialized DVC repository.
You can now add your first data file or folder to DVC:
dvc add <filename|dirname>
...
Let's see what DVC initialization has done:
ls -la .dvc/Output (example):
total 8
drwxr-xr-x 4 user user 4096 Apr 1 10:00 .
drwxr-xr-x 8 user user 4096 Apr 1 10:00 ..
-rw-r--r-- 1 user user 122 Apr 1 10:00 config
drwxr-xr-x 2 user user 4096 Apr 1 10:00 tmp
-rw-r--r-- 1 user user 8 Apr 1 10:00 .gitignore
DVC creates a .dvc/ directory and populates it with internal files. The most important one right now is .dvc/config, which stores DVC's settings. It also creates a .dvc/.gitignore file to ensure Git ignores DVC's internal cache.
Now, let's add these DVC-related files to Git and commit them. This ties your DVC setup to your Git repository.
git statusOutput (example):
On branch tutorial
...
Untracked files:
(use "git add <file>..." to include in what will be committed)
.dvc/
git add .dvc/
git commit -m "Initialize DVC"Output:
[tutorial 0c7d42a] Initialize DVC
2 files changed, 5 insertions(+)
create mode 100644 .dvc/.gitignore
create mode 100644 .dvc/config
Now that DVC is set up, let's learn how to track individual files. We'll start with a simple XML dataset.
We'll download a small data.xml file. Notice the -o data/data.xml flag, which specifies the output path.
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xmlOutput:
100%|█████████████████████████████████████████████████████████████████████████████████████| 2.72K/2.72K [00:00]
dvc get is like wget or curl but specifically designed to download files tracked by DVC (or Git) repositories.
Now, let's tell DVC to track this data.xml file.
dvc add data/data.xml -vOutput (example):
Added 'data/data.xml' to DVC cache.
To share it, run 'dvc push'.
To put it under Git version control, run 'git add data/data.xml.dvc'.
What happened?
- DVC moved the actual
data.xmlfile into its cache (located in.dvc/cache). - It replaced the original
data.xmlwith a small.dvcfile (specifically,data/data.xml.dvc). This.dvcfile is a plain text file that acts as a pointer to the data's content in the DVC cache. - DVC also automatically created
data/.gitignoreto tell Git to ignore the actualdata.xmlcontent, only tracking the.dvcpointer file.
Let's inspect the .dvc file:
cat data/data.xml.dvcOutput (example):
outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
size: 2785
This .dvc file contains metadata about the data: an MD5 hash of its content, its path, and its size. The MD5 hash is crucial as it uniquely identifies the data's content in the DVC cache.
Since data.xml.dvc is just a small text file, we can commit it to Git. This links the specific version of your data to your Git commit history.
git add data/.gitignore data/data.xml.dvc
git commit -m "Add raw data.xml"Output:
[tutorial 4d0e7f7] Add raw data.xml
2 files changed, 6 insertions(+)
create mode 100644 data/.gitignore
create mode 100644 data/data.xml.dvc
Now, your Git repository tracks the metadata about your data, while DVC manages the large data content efficiently.
DVC excels at versioning entire directories containing many files, which is common for image datasets.
Let's create a new Git branch to track our first version of the cats and dogs dataset.
git checkout -b cats-dogs-v1Output:
Switched to a new branch 'cats-dogs-v1'
We'll download a larger dataset consisting of cat and dog images into a directory named datadir. We're using dvc get --rev to specify a particular version of the dataset from the remote DVC repository.
dvc get --rev cats-dogs-v1 https://github.com/iterative/dataset-registry use-cases/cats-dogs -o datadirOutput (example):
100%|█████████████████████████████████████████████████████████████████████████████████████| 67.5M/67.5M [00:03]
This command downloads the dataset, which includes train and validation subdirectories for cats and dogs images.
Let's confirm some files are there:
ls datadir/data/train/catsOutput (example):
cat.1.jpg cat.10.jpg cat.100.jpg ... cat.999.jpg
Now, add the entire datadir to DVC.
dvc add datadirOutput (example):
Added 'datadir' to DVC cache.
To share it, run 'dvc push'.
To put it under Git version control, run 'git add datadir.dvc'.
Similar to adding a file, dvc add moves the datadir content to the DVC cache and creates datadir.dvc. For directories, DVC creates a special .dir file in its cache that lists the hashes of all files and subdirectories within the tracked directory. This allows efficient versioning of large folder structures.
Inspect the .dvc file for the directory:
cat datadir.dvcOutput (example):
outs:
- md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
path: datadir
Notice the .dir suffix on the MD5 hash, indicating it refers to a directory hash.
Commit the datadir.dvc file to Git and add a Git tag to mark this specific data version.
git add .gitignore datadir.dvc # .gitignore is created by DVC when adding a dir
git commit -m "Add cats-dogs dataset (v1)"
git tag -a cats-dogs-v1 -m "Dataset version v1.0"Output:
[cats-dogs-v1 2a3b4c5] Add cats-dogs dataset (v1)
2 files changed, 4 insertions(+)
create mode 100644 .gitignore
create mode 100644 datadir.dvc
You now have a Git tag cats-dogs-v1 that points to a specific version of your datadir dataset!
One of DVC's core strengths is detecting and tracking changes to your data.
Use dvc status to see the state of your DVC-tracked files.
dvc statusOutput:
Data and pipelines are up to date.
This indicates that all DVC-tracked data in your workspace matches what DVC expects (i.e., the content pointed to by the .dvc files in your current Git commit).
Let's simulate an update where our cats-dogs dataset doubles in size (e.g., new images added after more data collection). We'll switch to a new Git branch first.
git checkout -b cats-dogs-v2Output:
Switched to a new branch 'cats-dogs-v2'
Now, download the updated dataset version. This will overwrite the datadir in your workspace.
dvc get --rev cats-dogs-v2 https://github.com/iterative/dataset-registry use-cases/cats-dogs -o datadirOutput (example):
100%|█████████████████████████████████████████████████████████████████████████████████████| 90.7M/90.7M [00:04]
The datadir in your project now contains the "v2" version of the data.
Let's check dvc status again:
dvc statusOutput (example):
? datadir
The ? indicates that datadir exists in your workspace but is not tracked by DVC's current .dvc file. DVC recognizes that the content of datadir has changed since the last dvc add operation.
To capture this new version of the data, we simply run dvc add again.
dvc add datadirOutput (example):
M datadir
Updated 'datadir' in DVC cache.
To share it, run 'dvc push'.
To put it under Git version control, run 'git add datadir.dvc'.
DVC detects the change, updates its cache, and modifies datadir.dvc to point to the new content hash.
Now, commit the updated datadir.dvc file to Git and tag it as cats-dogs-v2.
git add datadir.dvc
git commit -m "Updated cats-dogs dataset (v2)"
git tag -a cats-dogs-v2 -m "Dataset version v2.0"Output:
[cats-dogs-v2 b7d8e9f] Updated cats-dogs dataset (v2)
1 file changed, 2 insertions(+), 2 deletions(-)
You now have two distinct versions of your dataset (cats-dogs-v1 and cats-dogs-v2) linked to different Git tags, all managed efficiently by DVC.
The power of DVC comes from its ability to quickly switch between different versions of your data, just like Git switches code versions.
Let's switch back to our initial tutorial branch, which does not contain the datadir.dvc file in its Git history.
git checkout tutorialOutput:
Switched to branch 'tutorial'
Your branch is up to date with 'origin/tutorial'.
Now, check your files:
lsOutput (example):
.dvc .env data main.py requirements.txt train.py
Notice that datadir is gone from your workspace! This is because the tutorial branch's Git history doesn't track datadir.dvc.
To ensure your workspace accurately reflects the DVC-tracked data for the current Git commit, you need to run dvc checkout. This command will synchronize your workspace with the DVC cache based on the .dvc files that are currently in your Git index.
dvc checkoutOutput:
M data/data.xml
D datadir
What happened?
M data/data.xml: If thedata/data.xmlin your workspace (from previous steps) was different from whatdata/data.xml.dvc(tracked ontutorialbranch) points to, DVC updated it.D datadir: Since thetutorialbranch's Git history does not includedatadir.dvc, DVC recognized that the physicaldatadirdirectory in your workspace was no longer tracked by DVC for this Git commit. Thus, DVC deleteddatadirfrom your workspace to keep it clean and consistent with the current Git branch's data versioning state.
This demonstrates how DVC intelligently removes untracked (by DVC) data from your workspace when you switch to a Git commit that doesn't expect it.
Let's specifically try to restore the cats-dogs-v1 version.
We want to get back the cats-dogs-v1 dataset. First, switch your Git repository to that version's tag.
git checkout cats-dogs-v1Output:
Note: switching to 'cats-dogs-v1'.
...
HEAD is now at 2a3b4c5... Add cats-dogs dataset (v1)
Now, if you check your files:
lsOutput (example):
.dvc .env data datadir main.py requirements.txt train.py
datadir is there, but its content is likely the v2 content you just had. Why? Because git checkout only updates the .dvc pointer files in your workspace. It does not automatically fetch the actual data content. That's DVC's job!
To truly bring the cats-dogs-v1 data content into your workspace, you must run dvc checkout:
dvc checkoutOutput:
M datadir
M indicates that datadir was modified (i.e., its content was swapped to the v1 version). Now, your workspace's datadir should match the cats-dogs-v1 content.
ls datadir/data/train/dogs | head -n 5You can verify the number of files or file names to ensure you are on v1 (e.g., v1 had 500 dogs, v2 had 1000). This demonstrates that DVC efficiently retrieves the correct data version from its cache without re-downloading or copying large files unnecessarily.
DVC's cache stores data locally. To share data with your team or deploy models, you need a remote storage location. This can be cloud storage (S3, GCS, Azure Blob), network file systems, or even a local directory.
For simplicity in this tutorial, we'll set up a local directory as our DVC remote. In a real-world scenario, this would be cloud storage.
mkdir -p /tmp/dvc_remote_storage
dvc remote add -d local_storage /tmp/dvc_remote_storageOutput (example):
Setting 'local_storage' as a default remote.
mkdir -p /tmp/dvc_remote_storage: Creates a temporary directory. 👉 IMPORTANT: In a real project, NEVER use/tmpfor long-term storage, as it's frequently cleared by your system. Use a persistent directory or, ideally, cloud storage.dvc remote add -d local_storage /tmp/dvc_remote_storage: Tells DVC about a new remote namedlocal_storageand sets it as the default (-d).
This command modifies your .dvc/config file. Let's see:
cat .dvc/configOutput (example):
['core']
remote = local_storage
['remote "local_storage"']
url = /tmp/dvc_remote_storage
Now, commit this configuration change to Git so your team knows where the data is supposed to be stored.
git add .dvc/config
git commit -m "Add local DVC remote storage"Output:
[cats-dogs-v1 8e9f0a1] Add local DVC remote storage
1 file changed, 3 insertions(+)
To upload the DVC-tracked data from your local cache to the configured remote, use dvc push.
dvc push -vOutput (example, showing data being pushed):
Preparing to push data.
...
Pushing 'datadir' to 'local_storage'
100%|█████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00]
...
Pushed data to 'local_storage'.
DVC only pushes the actual data content, not the .dvc pointer files (which are in Git). It's intelligent enough to only upload data that isn't already present in the remote storage.
You can verify the data was pushed by checking the remote directory (it mirrors DVC's cache structure):
ls /tmp/dvc_remote_storageOutput (example):
a3 b6
These are directories named after the first two characters of the data's MD5 hashes.
Now, let's simulate a scenario where a new team member joins or you're working on a new machine. The data isn't in your local DVC cache.
-
Remove local cache and data:
rm -rf .dvc/cache rm -rf datadir rm -rf data/data.xml # Clean up the individual file tooYour workspace should now be clean of the actual data files.
-
Pull data from the remote:
dvc pull -v
Output (example):
Preparing to pull data. ... 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00] Pulled data to 'local_storage'.dvc pullchecks the.dvcfiles in your current Git commit, looks up the corresponding data hashes in the remote, downloads them to your local DVC cache, and then links them back into your workspace.Verify that your data is back:
ls datadir ls data
You should see both
datadiranddata/data.xmlrestored. This demonstrates how DVC enables seamless data sharing and environment setup for collaborative ML projects.
DVC offers more specialized commands for accessing and incorporating data from other DVC repositories or external URLs.
You can explore the contents of any DVC repository hosted on a Git server without cloning it first.
dvc list https://github.com/iterative/dataset-registry use-casesOutput (example):
cats-dogs/
get-started/
This shows the directories available under the use-cases path in that DVC dataset registry.
You've already used dvc get earlier. It downloads data from a DVC remote repository to your current working directory, but it does not automatically add the downloaded data to your DVC project.
Let's get cats-dogs again, but this time, into a new folder to confirm it's not DVC-tracked locally:
dvc get https://github.com/iterative/dataset-registry use-cases/cats-dogs -o cats-dogs-untrackedOutput (example):
100%|█████████████████████████████████████████████████████████████████████████████████████| 90.7M/90.7M [00:04]
lsYou'll see cats-dogs-untracked directory. If you check dvc status or look for a cats-dogs-untracked.dvc file, you won't find one. This is useful when you just need a copy of data for temporary exploration or a one-off task.
dvc import is like dvc get followed by dvc add. It downloads data from another DVC repository and automatically registers it with your local DVC project, creating a .dvc file for it.
dvc import git@github.com:iterative/example-get-started data/data.xml -o imported_data.xmlOutput (example):
Importing 'data/data.xml' from 'git@github.com:iterative/example-get-started'.
100%|█████████████████████████████████████████████████████████████████████████████████████| 2.72K/2.72K [00:00]
Now, check your directory:
ls
cat imported_data.xml.dvcYou'll see imported_data.xml and its corresponding imported_data.xml.dvc file, indicating it's now tracked by your DVC project.
dvc import-url allows you to track data directly from an external URL (HTTP/HTTPS, S3, GCS, Azure Blob, etc.) into your DVC project. It downloads the data and creates a .dvc file for it. This is useful for managing external datasets that you don't control.
dvc import-url https://data.dvc.org/get-started/data.xml -o external_data.xmlOutput (example):
100%|█████████████████████████████████████████████████████████████████████████████████████| 2.72K/2.72K [00:00]
You will find external_data.xml and external_data.xml.dvc in your project. If the data at the URL changes, dvc update external_data.xml.dvc (or dvc pull) can pull the new version.
Let's take a closer look at the files and directories DVC creates and manages, which underpin its versioning capabilities.
We've already seen examples like data/data.xml.dvc and datadir.dvc. These are small YAML files that act as pointers to the actual data content, which lives in DVC's cache.
For a single file (data/data.xml.dvc), the .dvc file usually contains:
md5: An MD5 hash of the file's content. This uniquely identifies the file.path: The relative path to the data file in the workspace.size: The size of the data file in bytes.
For a directory (datadir.dvc), the md5 field will end with .dir, indicating that the hash represents the content of the entire directory, not a single file. This "directory hash" is calculated based on the hashes of all files and subdirectories within it.
The .dvc/cache directory is where DVC stores the actual content of your data and model files.
ls -la .dvc/cacheOutput (example):
drwxr-xr-x 4 user user 4096 Apr 1 10:00 .
drwxr-xr-x 8 user user 4096 Apr 1 10:00 ..
drwxr-xr-x 2 user user 4096 Apr 1 10:00 a3
drwxr-xr-x 2 user user 4096 Apr 1 10:00 b6
Inside .dvc/cache, DVC organizes files based on their MD5 hash. The first two characters of the hash form a subdirectory, and the rest become the filename within that subdirectory. This helps to prevent too many files in one directory and makes lookup efficient.
For example, if your data.xml had an MD5 hash starting with a3, its content would be stored in .dvc/cache/a3/04afb96060aad90176268345e10355. If datadir had a directory hash starting with b6, its metadata (listing hashes of its contents) would be in .dvc/cache/b6/923e1e4ad16ea1a7e2b328842d56a2.dir.
Key Advantage: If you have multiple DVC-tracked files with different names but identical content, DVC only stores one copy of that content in the cache, saving disk space.
The .dvc/ directory is DVC's internal workspace for your project.
ls -la .dvcOutput (example):
drwxr-xr-x 9 user user 4096 Apr 1 10:00 .
drwxr-xr-x 8 user user 4096 Apr 1 10:00 ..
-rw-r--r-- 1 user user 122 Apr 1 10:00 config
drwxr-xr-x 2 user user 4096 Apr 1 10:00 cache
-rw-r--r-- 1 user user 8 Apr 1 10:00 .gitignore
drwxr-xr-x 2 user user 4096 Apr 1 10:00 tmp
# Other directories may appear depending on DVC version and features used:
# plots/, experiments/ etc.
.dvc/config: Main DVC configuration (remotes, cache location, etc.)..dvc/cache: Where actual data/model content is stored..dvc/.gitignore: Ensures Git ignores the cache and other DVC internal files..dvc/tmp: Temporary files used by DVC.- Other directories like
.dvc/plotsor.dvc/experimentsmight appear if you use DVC's plotting or experiment management features.
Understanding these internals helps you grasp how DVC efficiently links your Git-versioned code with your large, DVC-versioned data and models.
- DVC Official Documentation: https://dvc.org/doc
- DVC Get Started: https://dvc.org/doc/start
- DVC User Guide (Files & Directories): https://dvc.org/doc/user-guide/dvc-files-and-directories
- DVC Get Started: Data Pipelines: https://dvc.org/doc/start/data-pipelines
- Example: Tracking a remote file (for
dvc import-url): https://dvc.org/doc/command-reference/import-url
Congratulations! You've successfully completed this hands-on tutorial on Data and Model Versioning with DVC.
- You've learned how to initialize DVC and version large files and directories alongside your Git repository.
- You now understand the crucial role of
.dvcfiles and the DVC cache in maintaining data integrity and reproducibility. - You've mastered how to switch between different data versions and how to efficiently share data using DVC remotes.
- You've gained an initial understanding of DVC pipelines, which allow you to track and reproduce entire ML workflows.
- Experiment Management (MLflow): Explore how DVC integrates with MLflow for comprehensive experiment tracking, logging metrics, and managing models in a registry.
- Continuous Integration/Delivery for ML (CI/CD for MLOps): Learn how to set up CI/CD pipelines that leverage DVC's
dvc reproto automate model retraining, evaluation, and deployment upon code or data changes. - Advanced DVC Features: Dive deeper into DVC's capabilities for metrics and plots tracking, experiment versioning (
dvc exp), and managing more complex data dependencies. - Cloud Remotes: Transition from local DVC remotes to cloud storage solutions (AWS S3, Google Cloud Storage, Azure Blob Storage) for real-world production deployments.
Keep practicing these MLOps fundamentals. The ability to version, reproduce, and share your data and models reliably is a superpower in modern AI development!