House Price Prediction
Data Science Regression Project
House prices constitute a major part of one’s yearly expenditure. Most financial advisors agree that people should spend no more than 28 percent of their gross monthly income on housing expenses.
In a city like Lagos, Nigeria for example, most people complain of the rents being on the high side and will prefer to buy their own houses and avoid rent. Home owners as well complain of the high cost of running a house.
For this data science project, I was not able to get enough data for Lagos housing, however there is a fantastic housing dataset I found on Kaggle for Bengaluru housing.
Bengaluru (also called Bangalore) is the capital of India’s southern Karnataka state. The center of India’s high-tech industry, the city is also known for its parks and nightlife. Lagos, Nigeria is also home to Nigeria’s high-tech industry.
Knowing the price of a house before purchase or rent helps one to make better financial plans and avoid running into unnecessary debt. Are houses overpriced? Is there a direct correlation between number of bedrooms and house prices in an urban area? What factors mostly impact the prices of houses? These questions and others are what I plan to use data science to answer.
The dataset for this analysis was gotten from Kaggle. The data was stored in a csv format, it is structured, organized in rows and columns.
This dataset is for just Begaluru housing and it does not necessarily depict what happens in other urban cities.
Data Cleaning and Manipulation
Python is the tool I have chosen to use for this project. Pandas libraries provide efficient cleaning tools and visualizations in order to gain quick insights.
The first task undertaken was loading the required libraries.
I downloaded the dataset from kaggle. This csv file contains 13,320 rows and 9 columns. The columns are area_type, availability, location, size, society, total_sqft, bath, balcony and price.
Digging into the area type column, we observe that most of the areas are super built, typical of a high-tech region.
There are a few columns in this dataset which are not needed for the project, we would proceed to drop these.
Going forward, we will check the dataset for missing values.
Since these missing pieces are considerably small (less than 100) compared to the size of the dataset which has 13,320 rows, we will process to drop the rows containing the missing values.
After dropping these missing values, we observed that the number of rows in the dataset has been reduced to 13,246 from 13,320.
Exploring the size column shows a little bit of inconsistencies in the naming/labeling. We observed that some values are in BHK e.g 2 BHK, 3 BHK, 4BHK etc. and others are in bedroom such as 2 bedroom, 3 bedroom, 4 bedroom etc. Googling the meaning of BHK helped me realize that BHK means bedroom, hall and kitchen so 3 BHK for example effectively means the house has 3 bedrooms.
We will proceed to harmonize these different labels by returning only the numerical value of the number of rooms using a lambda function. this will be done under a new column.
The new column now has only numeric values which is good for analysis.
df3.total_sqft.unique() The total_sqft column is observed to have some inconsistencies as well. Some of the values are given in units other than sq feet and there are some given as a range. The numbers given as a range are converted to the mean of that range. We will write a function to catch the non-floats and convert them to floats where possible else we would just turn the values into NaN.
df5 = df4.copy()
df5[‘price_per_sqft’] = df5[‘price’]*100000/df5[‘total_sqft’]
A new column is created to get the price per sq foot of each house. The prices are in lakh, 1 lakh is 100,000 rupees. The unit for the calculated new column is in sq feet per rupee.
In order to reduce the outliers in the house size compared to room size, we will create a function to exclude those apartments where the size of the house in sq feet is 300 times more than the size of the bedroom.
The data has a lot of outliers as shown in the histogram below.
A function was written to remove price per square feet outliers.
A scatter plot was made to show how the price per square feet varies with the total square feet area of 2 BHK and 3 BHK apartments.
We proceeded to remove BHK apartments whose price per square feet is is less than the mean price per sq feet of 1 BHK apartment. This action reduced the number of rows to 7,329.
Also, we checked for apartments where the number of bathrooms is more than the number of bhks by 2 or more then we removed them from the dataset. This process is also a way to remove some of the outliers.
Build Machine Learning Model
Using One Hot Encoding for location.
A further two rows were dropped since they are not relevant for the project. the size column was dropped because it is the same as the bhk column but it is composed of strings and numbers. The price per square feet was also dropped as it was mainly created to help identify some outliers and subsequently drop them. The dataset now has just 5columns: location, total_sqft, bath, price and bhk.
The location values were converted into dummy variables.
The ‘other’ column was dropped.
The model was built.
The Linear Regression model.
The 5 iterations show a score above 77%. This is a good score.
Next we define a function to predict the price using the linear regression model.
A couple of tests were done for the model.
The full project can be accessed here.