🏠 Creating a model to predict housing prices using Python [Kaggle Beginner]

programming

Predicting house prices is an important challenge for the real estate industry and financial institutions. In this article, we will explain how to create a model to predict house prices in Python using a dataset from Kaggle's "House Prices: Advanced Regression Techniques" competition. We will explain step by step from data preprocessing to model evaluation so that even beginners can understand.

Data Acquisition and Summary

First, download the data from the Kaggle "House Prices: Advanced Regression Techniques" competition page. The dataset contains training data (train.csv) and test data (test.csv). The training data contains features and sale prices of houses. 

1
import pandas as pd Read # data train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') Check the shape of # data print(f"Training data shape: {train.shape}") print(f"Test data shape: {test.shape}")

The training data contains 1460 samples, and the test data contains 1459 samples. Each sample contains features of a house (e.g., area, age, number of rooms, etc.). 

Data Preprocessing

Next, we perform data preprocessing, which includes handling missing values, converting data types, etc. 

Checking for missing values

1
# Check the number of missing values missing_values = train.isnull().sum() missing_values = missing_values[missing_values > 0] print(missing_values)

Features with missing values require appropriate handling, for example, numerical data is typically imputed with the median, and categorical data is typically imputed with the mode.

Data Type Conversion

Some numeric data should really be treated as categorical data, for example 'MSSubClass' is a numeric value representing the class of building but should be treated as categorical data. 

1
# Convert 'MSSubClass' to categorical type train['MSSubClass'] = train['MSSubClass'].astype(str) test['MSSubClass'] = test['MSSubClass'].astype(str)

Feature Selection and Engineering

To improve the performance of a model, it is important to select important features and create new features.

Checking the correlation coefficient

First, check the correlation coefficient between each feature and the objective variable (SalePrice). 

1
import seaborn as sns import matplotlib.pyplot as plt # Calculate correlation coefficients for numerical data corr_matrix = train.corr() # Display features with high correlation coefficients top_corr_features = corr_matrix.index[abs(corr_matrix["SalePrice"]) > 0.5] plt.figure(figsize=(10,10)) sns.heatmap(train[top_corr_features].corr(), annot=True, cmap="RdYlGn") plt.show()

In this way, features that have a strong correlation with the objective variable can be identified.

Creating new features

For example, we can create a new feature representing the total area of the home by summing the basement area (TotalBsmtSF) and the above-ground living area (GrLivArea). 

1
# Create feature for total area train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF'] test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']

Building and evaluating the model

Once preprocessing and feature engineering are complete, we build and evaluate a model, here we use a linear regression model.

1
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import numpy as np # Define the objective variable and featuresX = train.drop(['SalePrice', 'Id'], axis=1) y = train['SalePrice'] # Split into training and validation dataX_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42) # Train the modelmodel = LinearRegression() model.fit(X_train, y_train) # Make predictions using validation datay_pred = model.predict(X_valid) # Calculate RMSErmse = np.sqrt(mean_squared_error(y_valid, y_pred)) print(f"RMSE: {rmse}")

RMSE (Root Mean Squared Error) is the root mean square of the difference between the predicted value and the actual value and is an index to evaluate the accuracy of the model.

Prediction and submission file creation

Finally, we will make predictions on the test data and create a file to submit to Kaggle.

1
# Predict with test data test_predictions = model.predict(test.drop('Id', axis=1)) # Create submission file submission = pd.DataFrame({ 'Id': test['Id'], 'SalePrice': test_predictions }) # Save as CSV file submission.to_csv('submission.csv', index=False)

You can evaluate the performance of your model by uploading this CSV file to the Kaggle competition page.

summary

In this article, we explained how to create a model to predict house prices in Python using data from Kaggle's "House Prices: Advanced Regression Techniques" competition. By learning the series of steps, including data preprocessing, feature engineering, building and evaluating a model, and creating prediction and submission files, you can understand the basic process of machine learning. In the future, we will aim to improve prediction accuracy by trying more advanced models (e.g., XGBoost and LightGBM). 

*This article is based on the following Kaggle notebooks:

House Prices Solution for Beginners!! 

A Detailed Regression Guide with House-pricing

Copied title and URL