Predicting Hourly Bike Rentals
Many major cities worldwide are embracing self-service bike sharing programs for several compelling reasons:
- Promoting Eco-Friendly Transportation: These programs encourage people to explore alternative means of transportation, leading to reduced auto traffic and lower levels of pollution.
- Boosting Tourism: Bike sharing initiatives attract tourists, providing them with a convenient and eco-friendly way to explore the city’s attractions.
- Enhancing Physical Health: By encouraging physical activity, bike sharing programs contribute to improved public health and well-being.
Undoubtedly, such bike sharing programs offer numerous benefits. Therefore, it is crucial for cities to ensure the availability of these bikes at the right locations and times.
In this project, my goal is to develop a predictive model for hourly bike rental demand using a dataset sourced from the UCI Machine Learning Repository. While I previously worked with this dataset at the beginning of my data science learning journey and faced challenges, I have gained valuable experience since then. Now, I am determined to revisit this project and improve on my previous attempts.
The dataset also includes a research paper by Sathishkumar V.E and Yonguyn Cho, which I found insightful. My aim is to have accuracy metrics similar to or better than the metrics presented in the paper. Though my initial attempts fell short, I am ready to take on this challenge once again with renewed knowledge and determination. Let’s go!
NOTE: As I mentioned earlier, my goal is to obtain accuracy metrics close to or better than the metrics in the research paper. I do not tie the outcome of this analysis to any real world scenario. In the real world, the goal of building machine learning models goes beyond good accuracy metrics.
Let’s start with understanding the data.
Data Understanding / Exploratory Data Analysis
This dataset has 14 features and 8,760 observations. See data descriptions below -
- Date : year-month-day
- Rented Bike count — Count of bikes rented at each hour (this is our outcome feature)
- Hour — Hour of the day
- Temperature-Temperature in Celsius
- Humidity — %
- Windspeed — m/s
- Visibility — 10m
- Dew point temperature — Celsius
- Solar radiation — MJ/m2
- Rainfall — mm
- Snowfall — cm
- Seasons — Winter, Spring, Summer, Autumn
- Holiday — Holiday/No holiday
- Functional Day — NoFunc(Non Functional Hours), Fun(Functional hours)
The feature functional_day has 2 unique values, Yes and No. No means bike rentals were closed, thus rented_count for these days will be 0. I excluded these rows (functional_day = No), as I am only interested in making predictions for days when bike rentals are open. The plots in the following section help offer a sense of direction with feature selection and feature engineering prior to modeling.
1. Distribution of Bike Rentals Overall
Observation: Daily count of bike rentals is highly left-skewed with the majority of values in the 0 to 2000 range. We see very few values above 3000.
2. Distribution of Bike Rentals by Season and Hour of Day
Observation:
- Throughout the day, distinct peak hours are evident, particularly between 4 pm and 8 pm, as well as during 7 am to 9 am. The overall peak hour is 6 pm.
- Bike rentals surge during the summer, autumn, and spring seasons, with the majority of rentals occurring. Understandably, the winter season records the lowest number of bike rentals.
Overall, these patterns in hourly and seasonal rentals indicate that these features could hold good predictive power. We could also explore creating a “peak hour influence feature” to better capture additional pattern seen at peak hours.
3. Distribution of Bike Rentals by Season and Day of Week
Observation: During the Autumn season, Wednesdays and Saturdays show the highest average bike rentals, while in the Summer season, Wednesdays and Fridays record the highest average rentals. The observation of higher bike rentals on specific days of the week during different seasons highlights the importance of including “day of week” features in the model. By incorporating categorical variables representing weekdays, such as “Wednesday,” “Friday,” and “Saturday,” the model can capture the day-specific rental patterns. These variables act as indicators for the days with higher rental demand.
4. Bike Rentals vs Temperature
Observation: The scatter plot above further confirms this relationship as you can clearly see increasing number of bike rentals as the seasons change from winter to summer. Additionally We notice the number of bike rentals increasing as the temperature also increases, further highlighting how weather and seasonality affects bike rentals.
5. Correlation of Features
Observation: The correlation heatmap reveals a strong positive correlation of 0.56 between rented_count and temp, which is logical as more people are outdoors in warmer weather. Furthermore, rented_count also shows a reasonably good positive correlation with hour and dew_point. For linear models, high correlations between predictors (e.g., dew_point and temp with a correlation of 0.91) can be problematic, but tree-based models help mitigate this concern.
Feature Engineering
From EDA above, we saw evident variations in rented_count by hour of day, day of week, season, etc. Since our dataset already included features for hour of day and season, the main focus of feature engineering in this project was create additional time series features from the date column such as day of week, day of month, day of year, week of the month, week of the year, quarter of the year, etc. Additionally, I included a feature to capture if the hour of day was am or pm.
Modeling Approach
As mentioned earlier, I chose tree-based models for their simplicity and effectiveness in capturing non-linear patterns in the data. These models stratify predictors into regions and make predictions based on the mean or mode response value within each region. For this project, I utilized Random Forest, Xgboost, and Cubist models.
The standard modeling process included the following steps:
- Data Splitting — Creating training and test sets.
- 10-Fold Cross-Validation — Ensuring robust model evaluation.
- Preprocessing Pipelines — Preparing the data for modeling.
- Model Specifications — Defining the tree-based model setups.
- Hyper-Parameter Tuning (2 rounds) — Optimizing model performance.
- Metrics Evaluation — Assessing model accuracy and performance.
- Refitting — Refitting the best model on the test set for final evaluation.
Overall, the approach focused on simplicity, interpretability, and accurate predictions using tree-based models.
Hyper-Parameter Tuning Round 1
After the first round of tuning, the Xgboost and Cubist models performed better than the Random Forest (in terms of R-Square and RMSE). Recall my goal was to get as close to the test accuracy metrics in the research paper. See below -
In the paper, the Cubist model performed the best, both on the training and test sets with RMSE of 70.76 on training set and 139.64 on the test set. In comparison, below is a table of the metrics (training set) I obtained after my first round of tuning;
We can see that our models don’t perform as good on the training set.
Hyper-Parameter Tuning Round 2
After round 1 of tuning, I noticed additional opportunities to improve accuracy metrics for the Xgboost model, by limiting the number of trees learn_rate, and min_n parameters to specific ranges than correspond to lower RMSE:
The plot above shows the hyper-parameters for the Xgboost model plotted against accuracy metrics. Notice a slight reduction in RMSE for the parameters in the following ranges:
- trees: lower RMSE in the 1000–2000 range.
- learn_rate: lower RMSE in the -2.0 — -1.0 range.
- min_n: lower RMSE in the 20–30 range.
The other parameters do not really show any ranges with a clear reduction in RMSE.
I also followed the same strategy for the Cubist model, limiting the range for committees and neighbors to specific ranges.
Both models showed improvement in accuracy metrics (on training data) after a second round of tuning, with the Xgboost showing much more significant improvement (percent improvement value in green) than the Cubist model. However, compared with MAE of 40.59 and RMSE of 70.76 for the Cubist model in the research paper (on training) data, our models don’t quite match up.
Evaluating Performance on Test Set
On test data however, the Xgboost performed the best with MAE of 71.72 and RMSE of 123.52. These metrics a slightly better than the test set metrics in the research paper which had MAE of 78.45 and RMSE of 139.64.
Variable Importance
Variable Importance (VIP) helps understand the relative influence of different predictors (features) on the outcome or target feature. VIP aims to answer questions like:
- Which features have the most significant impact on the target variable?
- How does the inclusion or exclusion of a particular feature affect the model’s performance?
- Which features can be used to simplify the model while retaining its predictive power?
Taking a look at the VIP plot for the Xgboost model below, we can see that temp and hour are the most important features in predicting rented_count.
Conclusion
While this project was all about chasing accuracy, real world use cases of machine learning involve much more than just accuracy metrics. Additionally, by engineering more features and additional tuning strategies, I’m sure these metrics can be improved further, though I will save that for another day. Finally note that at the time of doing this analysis, I was focused on using the Tidymodels framework. Better results could probably be achieved in less time by using more automated ML tools like H2O and Pycaret.
Reproducible Code
Source code for this project can be found in my Github repo. See section 7 in this file for additional code for refitting the Xgboost model on the full dataset and making predictions.