Member-only story

Data Versioning using DVC for MLOps pipelines

Atul Yadav
4 min readSep 20, 2022

--

Before talking about the importance of the Data versioning using the DVC, let's first talk about the day-to-day challenges for Data scientists

  1. Tracking the data science project to be able to reproduce the results:- The live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets. This leads to multiple versions of the same dataset.
  2. Auditing of Data & Models:- Several versions of the model using the same datasets of different versions, can create discrepancies. If not properly audited and versioned, this would create a tangled web of datasets and experiments

Why Is Data Versioning So Important?

Data versioning can be very useful for data reproducibility, trustworthiness, compilation, and auditing. Data versions uniquely identify revisions of a dataset, and the uniqueness helps consumers of such a dataset know whether and how the dataset has changed over a given period. They are able to identify specifically which version of the dataset is being used.

DATA VERSIONING USE CASES

  1. Data Tracking and maintainability: As a data scientist, you might not only want to control different versions of your code but also control different versions of your…

--

--

Atul Yadav
Atul Yadav

Written by Atul Yadav

MLOps | DataOps | DevOps Practitioner

No responses yet