NYC Airbnb Price Predictor

What makes an Airbnb listing expensive in New York City? Is it the borough, the room type, how close you are to Times Square, or something less obvious? This project tackles that question using the NYC Airbnb Open Data - a dataset of over 100,000 listings across all five boroughs.

The result is a two-model price intelligence system: a regression model that predicts the exact nightly price of a listing, and a classification model that predicts whether a listing falls into the Budget, Mid-Range, or Premium tier.

The Data

The dataset was sourced from Kaggle (arianazmoudeh/airbnbopendata) and required significant cleaning before it was usable. Starting from 102,599 raw rows, the pipeline handled dollar-sign string parsing, borough name typos, negative minimum nights, impossible availability values, duplicate listings, and ~15% missingness in review-related columns. The final cleaned dataset contains 68,803 listings and 19 features.

One notable finding during EDA: the price column is synthetically uniform between $50 and $1,200, which limits how strongly any model can predict it. This context is important when interpreting the R2 score.

Exploratory Data Analysis

Five research questions were explored to understand the structure of the NYC Airbnb market before modeling.

Question 1: How does price vary across NYC's five boroughs?

Question 2: Does room type significantly affect price?

Question 3: Is there a relationship between number of reviews and price?

Question 4: How does availability vary by borough and room type?

Question 5: Where are NYC Airbnb listings located geographically?

Key Findings

Question 1: Manhattan and Brooklyn account for the overwhelming majority of listings while the Bronx and Staten Island are niche markets with far fewer options. Price distributions across boroughs are surprisingly similar in shape, reflecting the synthetic nature of the price column.

Question 2: Entire home listings dominate the market and show the widest price spread. Hotel rooms, despite being rare, are priced comparably to entire apartments. Shared rooms make up a very small fraction of total supply.

Question 3: There is virtually no linear correlation between review count and price. Price is driven by structural factors like location and room type rather than by guest feedback or popularity.

Question 4: Availability distributions are bimodal across most boroughs - hosts are either almost always available or almost never available, with little in between. This suggests two distinct host behaviors: professional hosts and occasional hosts.

Question 5: The geographic map reveals the real shape of NYC's five boroughs. Entire home listings dominate Manhattan's core, while private rooms spread more evenly across Brooklyn and Queens, reflecting different traveler profiles.

Feature Engineering

Raw features alone explained almost none of the price variance. The following engineered features were created to give the models more signal:

distance_from_times_square - straight-line distance in km from Times Square, derived from GPS coordinates
is_manhattan - binary flag for Manhattan listings
reviews_per_listing - total reviews divided by host listing count, a proxy for host engagement
is_high_availability - binary flag for listings available more than 180 days per year

Categorical features were encoded using Label Encoding for neighbourhood and One-Hot Encoding for room type, cancellation policy, and host identity verification.

Clustering

KMeans clustering (k=10) was applied on listing coordinates, price, distance from Times Square, availability, and minimum nights. The resulting cluster labels were validated using the elbow method, silhouette scores, a geographic scatter plot, and a PCA 2D projection. Clusters align closely with NYC borough and neighbourhood boundaries, confirming that the algorithm found real geographic structure in the data rather than random groupings. The cluster ID was then added as a new feature for both regression and classification models.

Regression - Predict Nightly Price

Three models were trained and compared against a raw-feature baseline:

Model	MAE	RMSE	R2
Baseline (no engineering)	$288.02	$332.79	~0.00
Linear Regression	$288.03	$332.83	~0.00
XGBoost	$210.40	$258.48	0.397
Random Forest (winner)	$210.24	$258.25	0.398

Random Forest won by a narrow margin over XGBoost. Linear Regression, despite benefiting from the full engineered feature set, performed at near-zero R² - identical to the baseline. This strongly confirms that the relationship between listing characteristics and nightly price is fundamentally non-linear, a pattern that tree-based models can detect but linear models cannot regardless of how many features are provided.

Classification - Predict Price Tier

Price was binned into three tiers and three classifiers were compared:

Model	Accuracy
Logistic Regression	37%
KNN (k=5)	44%
Random Forest (winner)	53%

All three models outperform random guessing (33% on balanced classes). Random Forest again wins, consistent with the regression results.

Price tiers:

Budget: $0 to $400
Mid-Range: $400 to $800
Premium: $800 and above

How to Use

import pickle

# Regression - predict exact nightly price
with open('random_forest_airbnb.pkl', 'rb') as f:
    reg_model = pickle.load(f)
with open('scaler_airbnb.pkl', 'rb') as f:
    reg_scaler = pickle.load(f)

price = reg_model.predict(reg_scaler.transform([your_features]))
print(f"Predicted price: ${price[0]:.2f}")

# Classification - predict price tier
with open('classification_winner.pkl', 'rb') as f:
    clf_model = pickle.load(f)
with open('scaler_clf.pkl', 'rb') as f:
    clf_scaler = pickle.load(f)

tier = clf_model.predict(clf_scaler.transform([your_features]))
print(f"Predicted tier: {tier[0]}")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support