Practising people analytics methods can be difficult in the absence of data. Unfortunately, privacy concerns associated with employee data can make accessing datasets difficult. As a consequence, it can be useful for analysts, both novice and seasoned, to be able to generate synthetic datatsets for testing and explaining analytical methods.
The current article focuses on the generation of synthetic data using R. We will examine the generation of the following two datasets:
- A basic random dataset; and
- Creating multiple relationships between variables.
We have approached the above steps in a “recursive” fashion. By this we mean that we have developed functions that you can use “as-is”, or modify, to create synthetic datasets in your own context.
1. A basic random dataset
When the dataset requirements are simple we can use R’s inbuilt random generators. This can be done in one of two ways:
- sampling from a statistical distribution using the - rnormfunction (e.g., for a random value from a Normal distribution). This may be appropriate when creating a continuous variable such as age, tenure, or wage.
- sampling from a defined list of options and probabilities with - sample. This is more appropriate when creating nominal variables in the dataset such as gender, job family, or location.
Using a combination of these functions, the following code can be used to generate synthetic data that approximates real world data.
Code
# libraries
library(dplyr)
library(purrr)
library(lubridate)
library(ggplot2)
# create a function that will generate 7 variables with fixed responses
generate_basic_data <- function(num_rows){
  
    
  sample_replace <- function(x, prob = NULL) {
    base::sample(x = x, prob = prob, size = num_rows, replace = T)
  }
tibble::tibble(
    id              = 1:num_rows,
    age             = rnorm(num_rows, mean = 40, sd = 5),
    hire_date       = sample_replace(seq.Date(from = dmy("01/01/1990"), to = today(), by = "1 day")),
    job_family      = sample_replace(c("Engineering", "Sales", "Administration"), prob = c(0.6,0.25, 0.15)),
    contract_type   = sample_replace(c("Full Time", "Part Time"), prob = c(0.9, 0.1)),
    employment_type = sample_replace(c("Permanent", "Contract"), prob = c(0.7,0.3)),
    state           = sample_replace(c("VIC","NSW","QLD"), prob = c(0.5, 0.3, 0.2))
  )
}
# generate 10 rows of data
generate_basic_data(num_rows = 10)# A tibble: 10 × 7
      id   age hire_date  job_family     contract_type employment_type state
   <int> <dbl> <date>     <chr>          <chr>         <chr>           <chr>
 1     1  35.2 2022-08-13 Engineering    Full Time     Permanent       NSW  
 2     2  43.4 2023-03-16 Administration Full Time     Contract        QLD  
 3     3  36.9 2010-10-26 Administration Full Time     Contract        VIC  
 4     4  35.9 2019-07-21 Sales          Full Time     Permanent       QLD  
 5     5  33.9 1999-08-26 Administration Full Time     Permanent       QLD  
 6     6  35.4 2003-02-11 Engineering    Full Time     Contract        VIC  
 7     7  41.2 2010-08-24 Engineering    Part Time     Permanent       VIC  
 8     8  43.5 2015-03-19 Engineering    Full Time     Permanent       VIC  
 9     9  37.8 2006-06-28 Engineering    Full Time     Permanent       VIC  
10    10  38.7 2006-05-07 Engineering    Full Time     Permanent       QLD  Explanation
This R code above defines a function named generate_basic_data that creates a dataset with seven variables. You specify the number of rows you want in the dataset by passing a value to the num_rows parameter when you call the function.
Here’s a step-by-step explanation:
a. Function Definition
The line generate_basic_data <- function(num_rows){ … } creates a function named generate_basic_data. When you want to use this function later (see the last line of code in the block), you’ll tell it how many rows you want by providing a number for num_rows.
b. Custom Sample Function
Inside the function, there’s a nested helper function named sample_replace. This helper function makes it easier to sample values repeatedly from a given list, with the possibility of repeating values.
c. Creating the Dataset
The tibble (i.e., Tidy terminology for a data frame) function creates a table of data (similar to an Excel sheet with rows and columns). Each line within this function is defining a column for our table:
- id: Just a sequential number from 1 to the number of rows you’ve asked for. Think of this as a unique identifier for each row.
- age: Generates random ages that are normally distributed with an average age of 40 and a standard deviation of 5. This means most ages will be close to 40, but there will be some variation.
- hire_date: Randomly selects dates between January 1, 1990, and today. Since- sample_replaceis used, some dates might be repeated in different rows.
- job_family: Randomly selects a job family from the choices “Engineering”, “Sales”, and “Administration”. There’s a 60% chance of picking “Engineering”, 25% chance for “Sales”, and 15% for “Administration”.
- contract_type: Randomly selects contract type status. “Full Time” will be picked 90% of the time, while “Part Time” will be picked 10% of the time.
- employment_type: Randomly selects the employment type. “Permanent” jobs will appear 70% of the time, while “Contract” jobs will appear 30% of the time.
- state: Randomly selects a state from “VIC”, “NSW”, and “QLD” with respective probabilities of 50%, 30%, and 20%.
Once you run the function, for example generate_basic_data(100), it will generate a dataset with 100 rows, filled with the kinds of values described above. In essence, this function is useful for generating a synthetic dataset about employees, including information about their ages, hiring dates, job families, employment types, etc.
2. Creating relationships between variables
If you’re trying to create data that has similar properties to a real dataset, you may want to begin by performing an exploratory analysis to understand what relationships are present in the existing data and decide which are relevant for your synthetic data (i.e., those you want to replicate). The previous code can be updated to take these differences into account. A separate data generation function will need to be created for each variable relationship.
1. Simple two variable relationship
Say the main office for our example company was in the Australian state of Victoria. We would expect the administrative staff to be more likely to work there, as opposed to being randomly distributed across the Australian states. To make this work we will need to change how we randomly select a state to be dependent on the job family. The following code below brings this relationship to life.
Code
sample_state <- function(job_family){
  
    # Different State probabilities depending on job family
    prob_for_family <- list(
                        "Engineering"    = c(0.5, 0.3, 0.2),
                        "Sales"          = c(0.5, 0.3, 0.2),
                        "Administration" = c(0.8, 0.1, 0.1)
                            )
    # Randomly select the state using the per-job probabilities
    purrr::map_chr(job_family, ~sample(c("VIC","NSW","QLD"), size = 1, prob = prob_for_family[[.]])
    )
}This code defines a function named sample_state that randomly samples (selects) a state (“VIC”, “NSW”, or “QLD”) based on the given job family. Different job families have different probabilities of being associated with a particular state. The probabilities are provided in the same order as the states.
Let’s break down the function:
sample_state <- function(job_family){ … } defines a function that takes a single argument job_family (e.g., “Engineering”). Inside the function, a list named prob_for_family is created. This list provides the probabilities of each state being selected for each job family. Engineering and Sales both have the same probabilities, while Administration has a much higher probability of being in Victoria (“VIC”).
The map_chr function does all the work. Essentially, for each job_family provided it samples a state based on the respective probabilities provided in the prob_for_family list, and the state is generated.
2. Complex multi-variable relationships
The approach above can be extended as the interactions become more complex. For example, we could imagine a scenario in which employment type is dependent on both the hire date and the job family. More specifically, short-term contracts could be more likely to be attributable to the following two conditions:
- Sales staff; or 
- Less tenured hires–those with less than 5 years tenure. 
The following code documents how these relationships are achieved when generating synthetic data.
Code
sample_employment <- function(hired, job) {
    # Probability map of Permanent/Contract employment for combinations of 
    # new/old hires and job family
    prob_map <- list(
      new_hire = list(
        "Engineering" = c(0.7, 0.3),
        "Sales" = c(0.4, 0.6),
        "Administration" = c(0.7, 0.3)
      ),
      old_hire = list(
        "Engineering" = c(0.9, 0.1),
        "Sales" = c(0.7, 0.3),
        "Administration" = c(0.9, 0.1)
      )
    )
 
    hire_status = ifelse(
      lubridate::time_length(hired %--% today(), "years") < 5, 
      "new_hire", 
      "old_hire"
    )
    
    purrr::map2_chr(
      hire_status, job, 
      ~sample(c("Permanent", "Contract"), size = 1, prob = prob_map[[.x]][[.y]])
    )
  }Let’s explain this code in simple terms.
a. Setting up probabilities
First, it creates a set of rules (the prob_map list). This set of rules gives different chances of having a “Permanent” or “Contract” job based on the job role and whether the person was newly hired (less than 5 years ago) or has been with the company longer.
For example, according to the rules:
- A new engineer has a 70% chance of being permanent and 30% chance of being on contract. 
- An engineer who’s been hired for over 5 years has a 90% chance of being permanent. 
b. Determining Hire Status
For each person, the code checks how long ago they were hired using the lubridate::time_length function. If it’s less than 5 years, they are labeled as “new_hire”. Otherwise, they’re an “old_hire”.
c. Assigning Employment Type
Based on the hire status and job role of each person, the code uses the rules set up in the first step to take a guess at whether they have a “Permanent” or “Contract” job. This is done using the purrr::map2_chr function, which applies the sample function to assign an employment type based on the given probabilities for each combination of hire status and job role.
So, by using this sample_employment function and providing a list of hiring dates and job roles, you’d get a list of assignments of whether each person has a “Permanent” or “Contract” job based on their job role and how long they’ve been hired.
3. Pulling it all together
To bring it all together, we can now swap out how we randomly selected state and employment type in the first basic example with our two functions above that randomly generate data based on the relationship with other variables. These other variables are generated earlier (e.g. hire_date & job_family), so that they can be used to inform the generation of employment_type.
The following code produces a randomised dataset with our desired relationships that can then be saved and used later for trialing and/or learning analytics methods.
Code
generate_complex_data <- function(num_rows){
    sample_replace <- function(x, prob = NULL){
        sample(x = x, prob = prob, size = num_rows, replace = T)
        }
tibble::tibble(
    id              = 1:num_rows,
    age             = rnorm(num_rows, mean = 40, sd = 5),
    hire_date       = sample_replace(seq.Date(from = dmy("01/01/1990"), to = today(), by = "1 day")),
    job_family      = sample_replace(c("Engineering", "Sales", "Administration"), prob = c(0.6,0.25, 0.15)),
    contract_type   = sample_replace(c("Full Time", "Part Time"), prob = c(0.9, 0.1)),
    employment_type = sample_employment(hire_date, job_family),
    state           = sample_state(job_family) 
  )
}
# call the generate_complex_data function to create a dataset of 10 rows
generate_complex_data(num_rows = 10)# A tibble: 10 × 7
      id   age hire_date  job_family     contract_type employment_type state
   <int> <dbl> <date>     <chr>          <chr>         <chr>           <chr>
 1     1  43.8 2008-09-27 Engineering    Full Time     Permanent       NSW  
 2     2  33.6 2010-06-04 Engineering    Full Time     Permanent       VIC  
 3     3  35.0 1990-11-27 Engineering    Full Time     Permanent       NSW  
 4     4  34.0 1991-07-25 Sales          Full Time     Permanent       VIC  
 5     5  43.8 2022-10-12 Administration Part Time     Contract        VIC  
 6     6  35.3 2021-11-24 Engineering    Full Time     Permanent       VIC  
 7     7  44.5 2016-02-27 Engineering    Full Time     Permanent       NSW  
 8     8  42.2 2013-06-23 Engineering    Full Time     Permanent       VIC  
 9     9  39.3 2021-06-18 Sales          Part Time     Contract        NSW  
10    10  41.8 2002-08-05 Engineering    Full Time     Permanent       VIC  Conclusion
The above approach works well when there are a smaller number of variables and relationships to account for. It is capable of creating synthetic datasets with a greater number of variables, howver, at the risk of greater complexity and difficulty to maintain.
Once your needs for synthetic data have outgrown the approach above, there are alternatives that may be worth exploring. The fabricatr package provides a method to generate synthetic datasets with quite sophisticated interactions between variables. Or if you have a specific real dataset whose characteristics you are looking to re-create, the synthpop package may be appropriate.
Irrespective, we hope the example above is useful in helping you develop realistic synthetic data that facilitates greater learning and experimentation in the field of People Analytics.
Reuse
Citation
@online{pearce2023,
  author = {Stephen Pearce and Adam D McKinnon},
  title = {Creating {Synthetic} {People} {Analytics} {Data}},
  date = {2023-08-10},
  url = {https://www.adam-d-mckinnon.com//posts/2023-08-10-synthetic_data},
  langid = {en}
}
