Forecasting Future Personnel Requirements Using Machine Learning

Modeltime Timetk Tidymodels echarts4r R

A practical demonstration of using machine learning to predict personnel requirements based on customer volumes.

Adam D McKinnon https://adam-d-mckinnon.com/
11-13-2021
Photo by [LYCS Architecture](https://unsplash.com/@lycs) on [Unsplash](https://unsplash.com/).

Figure 1: Photo by LYCS Architecture on Unsplash.

The goal is to know what’s coming…

Around this time last year I began immersing myself in Forecasting. My learning and professional experience since suggests to me that there are considerable opportunities to apply Forecasting methods in people related decision-making. I see Forecasting informing decisions such as: - retail rostering through the forecasting sales volumes, - workloads for talent acquisition professionals through forecasting requisitions volumes, - job design by forecasting volumes of service delivery staff, - Strategic Workforce Planning (SWP) through anticipation of skill demand, and - many more (I suspect imagination may be the limiting factor)!

While the last article I wrote on Forecasting was fun, this article adopts a more pragmatic stance by providing a practical demonstration of forecasting customer contact volumes to manage staff rosters.

Business Context

The Brisbane City Council (Australia) resolves hundreds of customer contact enquiries by residents each day. In a bid to ensure ongoing customer centricity, the council offers multiple channels through which residents can contact the council. It is expected that council staff will resolve a particular number of customer contact enquiries each day. The rate of resolution varies by channel and is measured as follows (these rates are fictitious):

  1. Email: 175 / employee / day,
  2. Face to Face: 75 / employee / day,
  3. Mail: 75 / employee / day,
  4. Online (e.g., Twitter, LinkedIn, etc.): 150 / employee / day,
  5. Voice: 200 / employee / day, and
  6. Web (e.g., council website contact forms): 150 / employee / day.

Business Challenge

The council would like to accurately roster sufficient staff to resolve contact volumes, while not “over rostering” staff to minimise costs. To do this we will train several Machine Learning Models using historic rates of contact for each of the various contact channels, and then forecast customer contact volumes by channel over the next month. With the forecasted volumes, we can then calculate the required number of staff based on expected resolution rates (see above).

Workflow

The workflow for addressing this business challenge is as follows:

  1. Setup & Data

    • libraries
    • ingest the data
    • data wrangling / feature engineering & dataset creation
  2. Modeling

    • data preprocessing
    • create the ML models
    • evaluate model performance
    • determine the best model for each customer contact channel
  3. Forecasting & Rostering

    • generate future channel forecasts & determine staff requirements

1. Setup & Data

Libraries

We begin by bringing in the required libraries for the workflow. The most notable call-outs are the libraries related to the ModelTime ecosystem for time series analysis.

Show code

Ingest Data

The data comes from the following site. In preparation of this work I had joined the files beginning January 2020 through to March 2021. This joined file is read here, as are Queensland Public Holidays between 2020 to 2021 (from the tsibble package). I felt public holidays may influence contact enquiry volumes. In addition, we define our forecast window (28 days), and then visualise the data for each of the customer contact channels to see what we are working with.

Show code
# ingest the customer contact time series
customer_contact_ts_tbl <- readxl::read_excel(path = "personnelforecasting_files/contact_customer_enquiries_jan2020_mar2021.xlsx") %>% 
                           janitor::clean_names() %>% 
                           dplyr::rename(id = channel, value = volume) 


# create a tibble of public holidays in QLD
qld_public_holidays_tbl <- tsibble::holiday_aus(year = 2020:2021, state = "QLD")

# define our forecast horizon
forecast_horizon <- 28


# plot the time series
graphs_tbl <- customer_contact_ts_tbl %>% 
    dplyr::group_by(id) %>% 
    tidyr::nest() %>% 
    dplyr::mutate(
        .plot = purrr::map(data, ~ timetk::plot_time_series(
                                                          .data = .x,
                                                          .date_var = date,
                                                          .value = value,
                                                          .smooth = TRUE))
                  ) %>% 
    ungroup()


xaringanExtra::use_panelset() 

email

Show code
 graphs_tbl$.plot[[1]] 

face2face

Show code
 graphs_tbl$.plot[[2]] 

mail

Show code
 graphs_tbl$.plot[[3]] 

online

Show code
 graphs_tbl$.plot[[4]] 

voice

Show code
 graphs_tbl$.plot[[5]] 

web

Show code
 graphs_tbl$.plot[[6]] 

Data Wrangling & Feature Engineering

The process of creating the full dataset is comprehensive and will therefore be broken down into the following bullet-points:

Show code
nested_data_tbl <- customer_contact_ts_tbl %>%

    # pad out the date range and add zero's where needed
    dplyr::group_by(id) %>%
    timetk::pad_by_time(
        .date_var = date,
        .by = "day",
        .pad_value = 0,
        .start_date = "2020-01-01",
        .end_date   = "2021-03-31"
    ) %>%
    
    dplyr::ungroup() %>%
    
    # log transform the data
    dplyr::mutate(value = log1p(value)) %>% 

    # extend the timeseries for nested modelling
    extend_timeseries(
        .id_var        = id,
        .date_var      = date,
        .length_future = forecast_horizon
    ) %>%

    # add public holiday data - holidays may impact customer contact
    dplyr::left_join(qld_public_holidays_tbl, by = c("date" = "date")) %>%
    
    # change the public holidays data to a numeric variable
    dplyr::mutate(
        holiday = ifelse(is.na(holiday), 0 , 1)
    ) %>%

    # add some new features based on the date and volume
    dplyr::group_by(id) %>%
    timetk::tk_augment_fourier(date, .periods = c(7, 14, 28)) %>%
    timetk::tk_augment_lags(.value = value, .lags = 28) %>%
    tk_augment_slidify(
        .value = value_lag28,
        .f = ~ mean(., na.rm = TRUE),
        .period = c(7, 14, 28, 28*2),
        .align = "center",
        .partial = TRUE
    ) %>%

    dplyr::filter(!is.na(value_lag28)) %>%
    dplyr::ungroup() %>%

    # nest the timeseries data and then split the dataset by actual and future
    nest_timeseries(
        .id_var        = id,
        .length_future = forecast_horizon
    ) %>% 
    
    # split the nested timeseries data by channel (i.e., training and testing datasets)
    split_nested_timeseries(
        .length_test = forecast_horizon
    )

2. Modeling

Data Preprocessing

The data pre-processing involves use of the Recipes package. This recipe creates a number of new features based on the date variable, removes some unwanted features created, normalises some of the numeric features created so as to minimise undue influence on the models created, and then one-hot encodes all nominal variables.

Show code
# define the recipe using the training data
recipe_spec <- recipes::recipe(value ~ ., data = extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(date) %>% # create date features
    recipes::step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>% # remove unwanted features
    recipes::step_normalize(date_index.num, date_year) %>% # normalises two numeric variables
    recipes::step_dummy(all_nominal(), one_hot = TRUE) # one hot encodes nominal variables

Create the Machine Learning Models

In this section we will create seven models in total. Six will be used for forecasting (XGboost model, Boosted Prophet Time Series model, Random Forest model, Multivariate Adaptive Regression Spline (MARS) model, Neural Net model, and a Support Vector Machine model), while the seventh is used to prove how much better our models are than “guessing”. This comparison involves generating what is called a “Naive” forecast. The Naive forecast simply carries the last known value forward as a forecast. The accuracy metrics associated with the Naive forecast can then be compared with the metrics from our more sophisticated models to validate their accuracy, and ultimately their utility. This was something I started doing earlier this year on the back of advice from a colleague. The intent is to ensure your models are good (a simple litmus test), and more importantly, are an excellent way of illustrating the practical utility of the models to non-technical audiences.

Show code
### Naive Model ----
wkfl_fit_naive <- workflows::workflow() %>%
    workflows::add_model(
        modeltime::naive_reg() %>% 
        parsnip::set_engine("naive")
    ) %>%
    workflows::add_recipe(recipe_spec)



# XGboost
wkfl_fit_xgboost <- workflows::workflow() %>%
    workflows::add_model(
        spec = boost_tree(mode = "regression") %>% 
        parsnip::set_engine("xgboost")
    ) %>%
    workflows::add_recipe(recipe_spec %>% 
                              recipes::update_role(date, new_role = "indicator")) 


# * Prophet Boost ----
wkfl_fit_prophet_boost <- workflows::workflow() %>%
    workflows::add_model(
        spec = prophet_boost(
            seasonality_daily  = FALSE,
            seasonality_weekly = FALSE,
            seasonality_yearly = FALSE
        ) %>%
            parsnip::set_engine("prophet_xgboost")
    ) %>%
    workflows::add_recipe(recipe_spec)



# Random Forest
wkfl_fit_rf <- workflows::workflow() %>%
    add_model(
        spec = rand_forest() %>% 
        parsnip::set_engine("ranger")
    ) %>%
    workflows::add_recipe(recipe_spec %>% 
                              recipes::update_role(date, new_role = "indicator")) 



# MARS
wkfl_fit_mars <- workflows::workflow() %>%
    workflows::add_model(
        spec = mars(mode = "regression") %>%  
        parsnip::set_engine("earth")) %>%
    workflows::add_recipe(recipe_spec %>% 
                   update_role(date, new_role = "indicator"))


# Neural Net
wkfl_fit_nnet <- workflows::workflow() %>%
    workflows::add_model(
        spec = mlp(mode = "regression") %>% 
        parsnip::set_engine("nnet")
    ) %>%
    workflows::add_recipe(recipe_spec %>% 
                              recipes::update_role(date, new_role = "indicator")) 


# SVM
wkfl_fit_svm <- workflows::workflow() %>%
    workflows::add_model(
        spec = svm_rbf(mode = "regression") %>% 
        parsnip::set_engine("kernlab")
    ) %>%
    workflows::add_recipe(recipe_spec %>% 
                   update_role(date, new_role = "indicator")) 

Evaluate Model Performance

After adding the ML models into a table, we determine their accuracy against the Test dataset. Below you can view a table indicating six accuracy metrics that come standard with the Modeltime package. With the exception of rsq the general rule of thumb is smaller numbers are better. It is important to keep in mind that we initially log transformed our customer contact volumes, which reduces their absolute value and the breadth of the range of values in the time series. Practically this means the accuracy metrics are artificially lower than if we had stayed with the actual volumes. That said, these default models are performing well, even without model tuning. This is further exemplified when comparing the models to the Naive Forecast. If we were to only rely on the rmse values and compare them (i.e., Naive forecast to six ML models) for each customer contact channel volume (e.g., web, face2face, etc.), we could say that our models are between three and five times more accurate than a Naive forecast.

Show code
# Modeltime Table
nested_modeltime_tbl <- modeltime::modeltime_nested_fit(
    # Nested data
    nested_data = nested_data_tbl,

   # Add workflows
    model_list = list(
        wkfl_fit_naive,
        wkfl_fit_xgboost,
        wkfl_fit_prophet_boost,
        wkfl_fit_rf,
        wkfl_fit_mars,
        wkfl_fit_nnet,
        wkfl_fit_svm
        )
)


nested_modeltime_tbl %>%
    modeltime::extract_nested_test_accuracy() %>%
    modeltime::table_modeltime_accuracy(
        .round_digits = 2,
        .show_sortable = TRUE,
        .interactive = TRUE,
        defaultPageSize = 7,
        bordered = TRUE,
        striped = TRUE,
        compact = TRUE,
        defaultColDef = reactable::colDef(width = 69),
        style = list(fontFamily = "Work Sans, sans-serif", fontSize = "13px"),
        columns = list(
            .model_id   = reactable::colDef(show = FALSE),
            .type = reactable::colDef(show = FALSE),
            .model_desc = reactable::colDef(width = 220)
                       )
    )
Show code
test_forecasts_tbl <- nested_modeltime_tbl %>%
    modeltime::extract_nested_test_forecast() %>%
    dplyr::group_by(id) %>%
    tidyr::nest() %>%
    dplyr::mutate(
                .plot_forecast = purrr::map(data, ~modeltime::plot_modeltime_forecast(.data = .x,
                                                                                      .conf_interval_show = FALSE,
                                                                                      .legend_max_width = 15,
                                                                                      .plotly_slider = TRUE
                                                                                      ))
                )%>%
    dplyr::ungroup()

xaringanExtra::use_panelset()

We can also visualise the forecasts from each of the models for each customer contact channel. The visualisations produced are interactive (you can zoom by using the slider at bottom of the visualisations and click on the legend to add/remove models from the visualisation) and we can visually inspect how well the models perform relative to one another and the actual data (i.e., the Test dataset). This is another great way of showing to business stakeholders how well our models perform, thereby creating greater confidence, and ideally use, of the model forecasts.

email

Show code
 test_forecasts_tbl$.plot_forecast[[1]] 

face2face

Show code
 test_forecasts_tbl$.plot_forecast[[2]] 

mail

Show code
 test_forecasts_tbl$.plot_forecast[[3]] 

online

Show code
 test_forecasts_tbl$.plot_forecast[[4]] 

voice

Show code
 test_forecasts_tbl$.plot_forecast[[5]] 

web

Show code
 test_forecasts_tbl$.plot_forecast[[6]]