Overview

This project analyzes hotel booking records to uncover drivers of cancellations and pricing (ADR) patterns, and builds predictive models to support operational decisions (e.g., overbooking and customer targeting).


Dataset & Prep


Exploratory Analysis (ADR / Pricing Patterns)

ADR summary statistics (before outlier removal):

ADR Distribution

Seasonality (ADR by month / year): Monthly ADR by Year

ADR differs by hotel type: Monthly ADR by Hotel Type

Correlation overview: Correlation Heatmap

ADR differs across customer types: ADR by Customer Type


Cancellation Drivers (EDA)

Cancellation rate increases with lead time: Cancellation by Lead Time

More special requests → lower cancellation probability: Cancellation by Special Requests


Predictive Modeling: Cancellation Prediction

We trained multiple classifiers and evaluated on the test set:

Test Accuracy

ROC-AUC

ROC curve comparison: ROC Curve

Confusion matrices: Confusion Matrices


Explainability (Top Features)

Across models, the strongest signals include:

Logistic (top features): Top Features - Logistic

Random Forest (top features): Top Features - RF


Discussion

This project demonstrates that hotel cancellations are not random events, but are strongly driven by a small set of behavioral and operational signals.

Key Drivers of Cancellation Risk

Across multiple models (Logistic Regression, Random Forest, and XGBoost), we observed consistent agreement on the most predictive features:

These findings suggest that cancellation behavior is heavily influenced by booking commitment (deposit policy), historical customer behavior, and engagement indicators such as special requests.

Feature importance consistency across models strengthens the interpretability and business reliability of the results.

Model Performance Trade-offs

While Logistic Regression provides a strong interpretable baseline, ensemble models achieved substantially better predictive power:

This highlights the value of non-linear models in capturing complex booking interactions, especially under heterogeneous customer segments.

Pricing Patterns (ADR Insights)

In addition to cancellations, our exploratory analysis revealed clear seasonality and structural differences in ADR:

These insights can support revenue management and dynamic pricing decisions beyond cancellation prevention.


Conclusion

We successfully developed an end-to-end analytics pipeline for hotel booking pattern discovery, combining:

Our best-performing model (Random Forest) achieved:

This enables hotels to identify high-risk bookings early, improve overbooking strategies, and optimize marketing and pricing policies.


Limitations & Future Work

Despite strong results, several limitations remain:

  1. Temporal feature constraints
    The dataset lacks booking update-time dynamics, which could further improve cancellation forecasting.

  2. Model generalization
    Results may vary across regions or hotel chains; future work could validate robustness on newer datasets.

  3. Advanced modeling opportunities
    Deep learning approaches or sequential customer-behavior models may capture richer booking trajectories.

Future extensions include incorporating real-time booking changes and expanding toward fully integrated ADR regression forecasting.


Business Value

Cover Image Credit: me, my friend Harper's dog Charlie