NYC Airbnb Price Predictor
What makes an Airbnb listing expensive in New York City? Is it the borough, the room type, how close you are to Times Square, or something less obvious? This project tackles that question using the NYC Airbnb Open Data - a dataset of over 100,000 listings across all five boroughs.
The result is a two-model price intelligence system: a regression model that predicts the exact nightly price of a listing, and a classification model that predicts whether a listing falls into the Budget, Mid-Range, or Premium tier.
The Data
The dataset was sourced from Kaggle (arianazmoudeh/airbnbopendata) and required significant cleaning before it was usable. Starting from 102,599 raw rows, the pipeline handled dollar-sign string parsing, borough name typos, negative minimum nights, impossible availability values, duplicate listings, and ~15% missingness in review-related columns. The final cleaned dataset contains 68,803 listings and 19 features.
One notable finding during EDA: the price column is synthetically uniform between $50 and $1,200, which limits how strongly any model can predict it. This context is important when interpreting the R2 score.
Exploratory Data Analysis
Five research questions were explored to understand the structure of the NYC Airbnb market before modeling.
Question 1: How does price vary across NYC's five boroughs?
Question 2: Does room type significantly affect price?
Question 3: Is there a relationship between number of reviews and price?
Question 4: How does availability vary by borough and room type?
Question 5: Where are NYC Airbnb listings located geographically?
Key Findings
Question 1: Manhattan and Brooklyn account for the overwhelming majority of listings while the Bronx and Staten Island are niche markets with far fewer options. Price distributions across boroughs are surprisingly similar in shape, reflecting the synthetic nature of the price column.
Question 2: Entire home listings dominate the market and show the widest price spread. Hotel rooms, despite being rare, are priced comparably to entire apartments. Shared rooms make up a very small fraction of total supply.
Question 3: There is virtually no linear correlation between review count and price. Price is driven by structural factors like location and room type rather than by guest feedback or popularity.
Question 4: Availability distributions are bimodal across most boroughs - hosts are either almost always available or almost never available, with little in between. This suggests two distinct host behaviors: professional hosts and occasional hosts.
Question 5: The geographic map reveals the real shape of NYC's five boroughs. Entire home listings dominate Manhattan's core, while private rooms spread more evenly across Brooklyn and Queens, reflecting different traveler profiles.
Feature Engineering
Raw features alone explained almost none of the price variance. The following engineered features were created to give the models more signal:
distance_from_times_square- straight-line distance in km from Times Square, derived from GPS coordinatesis_manhattan- binary flag for Manhattan listingsreviews_per_listing- total reviews divided by host listing count, a proxy for host engagementis_high_availability- binary flag for listings available more than 180 days per year
Categorical features were encoded using Label Encoding for neighbourhood and One-Hot Encoding for room type, cancellation policy, and host identity verification.
Clustering
KMeans clustering (k=10) was applied on listing coordinates, price, distance from Times Square, availability, and minimum nights. The resulting cluster labels were validated using the elbow method, silhouette scores, a geographic scatter plot, and a PCA 2D projection. Clusters align closely with NYC borough and neighbourhood boundaries, confirming that the algorithm found real geographic structure in the data rather than random groupings. The cluster ID was then added as a new feature for both regression and classification models.
Regression - Predict Nightly Price
Three models were trained and compared against a raw-feature baseline:
| Model | MAE | RMSE | R2 |
|---|---|---|---|
| Baseline (no engineering) | $288.02 | $332.79 | ~0.00 |
| Linear Regression | $288.03 | $332.83 | ~0.00 |
| XGBoost | $210.40 | $258.48 | 0.397 |
| Random Forest (winner) | $210.24 | $258.25 | 0.398 |
Random Forest won by a narrow margin over XGBoost. Linear Regression, despite benefiting from the full engineered feature set, performed at near-zero R² - identical to the baseline. This strongly confirms that the relationship between listing characteristics and nightly price is fundamentally non-linear, a pattern that tree-based models can detect but linear models cannot regardless of how many features are provided.
Classification - Predict Price Tier
Price was binned into three tiers and three classifiers were compared:
| Model | Accuracy |
|---|---|
| Logistic Regression | 37% |
| KNN (k=5) | 44% |
| Random Forest (winner) | 53% |
All three models outperform random guessing (33% on balanced classes). Random Forest again wins, consistent with the regression results.
Price tiers:
- Budget: $0 to $400
- Mid-Range: $400 to $800
- Premium: $800 and above
How to Use
import pickle
# Regression - predict exact nightly price
with open('random_forest_airbnb.pkl', 'rb') as f:
reg_model = pickle.load(f)
with open('scaler_airbnb.pkl', 'rb') as f:
reg_scaler = pickle.load(f)
price = reg_model.predict(reg_scaler.transform([your_features]))
print(f"Predicted price: ${price[0]:.2f}")
# Classification - predict price tier
with open('classification_winner.pkl', 'rb') as f:
clf_model = pickle.load(f)
with open('scaler_clf.pkl', 'rb') as f:
clf_scaler = pickle.load(f)
tier = clf_model.predict(clf_scaler.transform([your_features]))
print(f"Predicted tier: {tier[0]}")






