Advanced Predictive Analytics for Managing Overnight Shelter Capacity

Unifying BigQuery, DBT, H2o and APIs for Social Good

4 min readNov 8, 2023

Motivation

My motivation for this project comes from a desire to bridge the gap between academic exercises and real-world applications in data science. While my previous projects involved use of static data in CSV files, I recognize the need to evolve beyond this static framework. Moving forward, my goal is to build and maintain projects using data pipelines that can handle the ebb and flow of real-world changing/updating data. This project is a step towards that ambition.

Also note that this article is an excerpt. For a detailed write up, check out this link to my personal portfolio website.

Introduction

In the dynamic urban landscape of Toronto, overnight shelters represent more than just a temporary haven for the homeless. These facilities are a vital component of the city’s social support network, providing safety and stability for individuals and families facing the harsh realities of urban life. As an essential service, these shelters embody the collective effort to uphold the welfare of all citizens, ensuring that the city’s commitment to caring for its most vulnerable members is not just a promise, but a reality.

Problem Statement

The challenge lies in effectively forecasting the occupancy rates of these shelters. Accurate predictions are not merely statistical triumphs, they are an important tool in the complex machinery of social support, determining the operational efficacy and resource optimization of these critical infrastructures. The implications of such forecasts extend beyond the confines of the shelters themselves, influencing policy-making,
urban planning, and emergency response initiatives. The margin for error is narrow: underestimate, and we risk leaving individuals or families exposed to the elements. Overestimate, and we divert precious resources that could be utilized more effectively elsewhere. The quest for a reliable prediction model is not just a technical endeavor, it is a moral imperative to ensure that the well being of those in need is safeguarded.

Solution Strategy

The proposed solution leverages the power of machine learning to analyze historical shelter occupancy data and generate accurate forecasts for future overnight shelter occupancy. The approach is to train machine learning models that can identify patterns and trends from past data, considering various factors such as location, day of the week, month of the year, weather conditions, etc. By feeding the models with historical data, they learn to anticipate future demands with a considerable degree of accuracy. This predictive capability is enhanced by incorporating advanced algorithms and ensemble methods that can adapt to the changing dynamics of urban life. The end goal is a robust model capable of aiding shelter administrators to anticipate nightly occupancy rates and thus manage their operations with unprecedented foresight and efficiency.

Solution Implementation

Data Collection and Management:

Utilize the Toronto Open Data API to collect historical occupancy data from multiple overnight shelters.
Collect historical daily weather data from AccuWeather API to augment dataset for modeling.

Data Storage (BigQuery)

Store the collected data securely in Google BigQuery, ensuring the database scales with the inflow of new data and is optimized for complex analytical queries.

Data Transformations (DBT)

Utilize DBT (Data Build Tool) to perform and manage necessary data transformations, ensuring that the data is clean, structured, and ready for model consumption.

For details on the various DBT models used in this project, checkout the detailed write up here.

Machine Learning & Model Development (H2o AutoML)

Utilize H2o AutoML , an automated machine learning platform, to build and train predictive models.
Evaluate various models and their performances, tuning hyper-parameters and selecting the best-performing model for deployment.

Deployment (Shiny)

Deploy the trained machine learning model via a Shiny app.
Implement robust app features that allow for real-time updates and the monitoring of model performance over time.

Note: For this first phase, I focused on only a handful of locations (about 40). I hope to scale the solution to ALL locations in the next phase.

Shiny App

You can interact with the shiny app here.

Reproducible Code

Link to R project code repo.
Link to DBT project code repo.

If you find the code repositories beneficial, please consider giving them a star.

Next Steps

There are still a few things left to complete to enhance this project;

Accuracy Tracking: Include statistical measures like MAE and RMSE of predicted values vs actuals.
Scale modeling and predictions to all locations.
Better model/experiment tracking with MLflow.
Enhance shiny application with use of modules.

Finally, the current process for creating predictions and updating the BigQuery database depends on some scheduled scripts running on my local machine, I aim to transition this process to some cloud platform.