Creating Mock Data Pipelines with Polyfactory and Python

Creating Mock Data Pipelines with Polyfactory and Python

Sam TorresSam Torres
4 min read8 viewsUpdated March 6, 2026
Share:

In the ever-evolving world of software development, the importance of mock data cannot be overstated. As we build applications, testing becomes crucial, and having realistic, production-grade mock data can make all the difference. Today, we’re diving into a powerful tool called Polyfactory that allows us to generate rich mock data straight from Python type hints. But what does that mean for developers? Let’s explore.

Setting Up Your Environment

Before we get started, it’s essential to set up our development environment properly. For this tutorial, you’ll need to have Python installed, along with Polyfactory, dataclasses, Pydantic, and attrs. If you haven’t already installed these, you can do so using pip:

pip install polyfactory pydantic attrs

Once you have your packages ready, let’s jump into creating our first mock data factory.

Building Factories for Data Classes

Polyfactory allows us to define factories that generate instances of Python classes. To illustrate this, let’s create a simple data class that models a user profile:

from dataclasses import dataclass

@dataclass
class UserProfile:
    username: str
    email: str
    age: int

With our UserProfile class defined, we can create a factory using Polyfactory:

from polyfactory import PolyFactory

user_profile_factory = PolyFactory(UserProfile)

This simple line of code sets us up with a factory that will produce instances of UserProfile. But there’s so much more we can do!

Customizing Your Data

One of the standout features of Polyfactory is its ability to customize the generated data. For instance, if we want to ensure that the username generated by our factory always starts with a letter, we can override the default behavior:

from random import choice, randint
import string

def custom_username():
    return choice(string.ascii_letters) + ''.join(choice(string.ascii_letters + string.digits) for _ in range(7))

user_profile_factory.override('username', custom_username)

Now, every time we call our factory, it’ll generate a username that meets our criteria. This kind of customization is incredibly useful, especially when we’re trying to simulate real-world scenarios.

Pydantic Models and Validation

Next up, let’s discuss how we can integrate Pydantic models into our mock data pipeline. Pydantic provides data validation, which is particularly helpful when we need to ensure that our data meets specific criteria. Here’s how we can define a Pydantic model:

from pydantic import BaseModel

class UserProfileModel(BaseModel):
    username: str
    email: str
    age: int

By using UserProfileModel, we can now create a factory that not only generates data but also validates it:

user_profile_model_factory = PolyFactory(UserProfileModel)

This setup allows us to generate data and catch any validation errors early in the development process.

Nesting Models

But what if your data structures are more complex and require nested models? Polyfactory shines here as well. Let’s say we want to include an address in our user profile:

@dataclass
class Address:
    street: str
    city: str
    zip_code: str

@dataclass
class ExtendedUserProfile:
    username: str
    email: str
    age: int
    address: Address

We can create a factory for ExtendedUserProfile and include a factory for the nested Address class:

address_factory = PolyFactory(Address)

extended_user_profile_factory = PolyFactory(ExtendedUserProfile, {'address': address_factory})

This way, our mock data will include complete user profiles with realistic addresses attached!

Calculated Fields and Overrides

Let’s also talk about calculated fields. Sometimes, you might want to generate a field based on the values of other fields. For instance, if we want to derive the user’s year of birth from their age, we can add a calculated field like this:

@property
def year_of_birth(self):
    return 2023 - self.age  # assuming the current year is 2023

By integrating this into our model, we keep our data dynamic and relevant. But what if our calculation needs specific values? Overrides in Polyfactory allow us to control that:

def calculate_age_based_on_birth_year():
    return randint(18, 65)

extended_user_profile_factory.override('age', calculate_age_based_on_birth_year)

This approach ensures that our age values remain realistic and aligned with our calculated fields.

Conclusion: Moving Forward with Mock Data Pipelines

Designing production-grade mock data pipelines using Polyfactory opens up a world of possibilities for developers. By leveraging Python’s data classes, Pydantic models, and attrs-based classes, you can create rich, realistic mock data tailored to your application’s needs. From customization to validation and nested structures, Polyfactory stands out as an essential tool in the developer’s toolkit. So, what are you waiting for? Start building your mock data pipelines today!

Sam Torres

Sam Torres

Digital ethicist and technology critic. Believes in responsible AI development.

Related Posts