Having previously worked on local population forecasts for a variety of reasons and appreciating the importance of this type of effort for planning and decision making, we developed a Machine Learning (ML) model for predicting future populations based on historic census information. The predictions of this model are available to Seer subscribers as an explorable dataset. The ultimate test of any ML model is its performance on new data, and we look forward to following up with an analysis of our accuracy in predicting the 2021 census populations.
The Census is upon us
The art of data driven leadership is in extracting maximal information from the available data to make optimal decisions for the future. Our work at Seer is about helping communities to better understand themselves, access and make best use of the information they need to tell stories and make their case for funding and change.
Census information is foundational to those efforts, and our most recent census is now close to 5 years old. Communities are relying on their population estimates from 2016 to plan services for their young parents, their students, their working age and elderly populations of 2021.
As 2021 is an Australian census year, Australians will be asked once again to fill out and return a census form designed to capture a data snapshot of the Australian population. The imminence of the 2021 census means that the predictions made by demographers (and data scientists) about how our communities evolve between Censuses will soon have hard figures against which they can be compared.
Machine Learning for population forecasting
Population forecasting is a long-studied objective within the discipline of demography. The forces of birth, death ageing and migration are simple concepts that interact in complex ways to shape the evolution of a population. Traditional methods for population forecasting involve estimating the effects of these forces through empirical modelling.
Machine Learning (ML) has become the new default approach to many tasks involving forecasting and prediction, and the nascent ML discipline has been among the most rapidly evolving fields in both academia and industry. We thought it would be fun to develop an ML model that would predict future populations based on historic Census information alone. Rather than construct an empirical model with specific inputs for the forces that impact population changes, and a sensible a-prior formula for their assembly into a population, an ML model would learn about birth, death ageing and migration (or something like this) from historic population data.
We used historic population counts by age and sex from the census years of 2001, 2006, 2011, and 2016. The model input, X, contains population counts from two consecutive census years and the target output Y is the next census year population. From the available data, we construct two historic training examples for each location (X0, X1, Y) (2001, 2006, 2011), (2006, 2011, 2016).
The location type chosen was Statistical Area Level 2 (SA2) area which typically represent resident populations of between 5,000 and 20,000 people. Smaller area types (e.g. SA1) exhibited a high degree of unpredictable variability from census to census. Larger area types (e.g. SA3) resulted in too few training examples. Even so, using SA2 areas only yields around 4,000 records.
Our model architecture is illustrated in Figure 1, and the rationale as follows.
Population counts are positive values greater than zero, so the model should perform regression.
The input and output will be 1-dimensional vectors of population counts by age and sex. A densely connected feed-forward architecture is a natural fit. The output constraint of predictions greater than zero lends nicely to the use of Rectified Linear Unit (ReLU) activations.
A simple form of this model would be a single layer applying a linear transformation to the input. Each output would be a linear combination of the inputs with learned coefficients that encode the way future population counts tend to depend on historic population counts.
The small size of the dataset means that over-fitting is a risk with even simple model architectures. We experimented with a variety of model architectures involving combinations of dense layers and additive residual ‘skip’ connections, halting model training when model performance on a hold-out test dataset stopped improving. We found that the simple model involving a single layer exhibited the best test set performance compared with more complex model architectures. Finally, we trained 10 of these simple models in parallel, and averaged their individual predictions.
Each model outputs population counts by age (0 to 79, then ‘80-85’ and ‘85+’) and sex (Male, Female), call these the ‘raw’ outputs. From these raw outputs, totals are constructed for comparison with totals available in the census information, call these the ‘expanded’ outputs.
ML models are trained to minimise ‘loss’ but deciding what ‘good enough’ looks like is tricky – classification models are more readily evaluated, and performance benchmarks more easily set and understood. We selected a relatively simple performance benchmark for our model.
The input to the model consists of information from two previous census years. The tendency for future populations to resemble historic populations (over small periods of time) suggests that these present and historic populations can serve as simple benchmark predictions – our model predictions should be better than simply using the present population as the prediction of the future population. Call this our benchmark or ‘dummy’ model.
The dummy model gives a Mean Square Error (MSE) that is more than twice the MSE of our real model predictions for the raw model output, and more than four times the MSE of our real model predictions for the expanded outputs.
We can get an intuitive understanding of the performance of a model like this by inspecting some example predictions, presented in Figure 2.
The model demonstrates an ability to forecast age-specific population growth and decline. In some cases the prediction resembles a simple linear extrapolation of the historic populations, particularly for younger age ranges. Often this results in areas with large young populations (20-40 years) where this peak does not appear to shift over time, indicating that this population is expected to be replenished by migration.
In other cases, in particular older age ranges, the model predictions suggest a population ageing in-place. This is characterised by a right-ward stepping of the population mass in an approximately 5-year increment.
Where to from here
The ultimate test of any ML model is its performance on new data, and the newest data is just about to be collected (the 2021 Census). Having published the predictions of our model for 2021, we will be waiting excitedly for the opportunity to compare our predictions with the true 2021 census counts when they become available.
Seer Plus subscribers can explore the Seer Population Forecasts on the platform here.
If you don’t yet have a Seer login, create your free subscription here.
Contact us if you would like to discuss how our Population Forecast model could help your organisation.
Co-founder & Chief Data Scientist
Feel free to email me with any questions