Real Estate News -->

Iowa News
# Housing Prices in Ames, Iowa: Kaggle’s Advanced Regression Competition

## Residential real estate prices are fascinating… and frustrating ... The goal, for the project and the original competition, was to predict housing prices in Ames, Iowa. While this particular competition is no longer active, the premise proved to be ...

### Archived Story

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)
ShareTweet
Introduction
Residential real estate prices are fascinating… and frustratingThe homebuyer, the home-seller, the real estate agent, the economist, and the banker are all interested in housing prices, but they’re sufficiently subtle that no one person or industry has a complete understanding.
For our third overall project and first group project we were assigned Kaggle’s Advanced Regression Techniques CompetitionThe goal, for the project and the original competition, was to predict housing prices in Ames, IowaWhile this particular competition is no longer active, the premise proved to be a veritable playground for testing our knowledge of data cleaning, exploratory data analysis, statistics, and, most importantly, machine learningWhat follows is the combined work of Quentin Picard, Paul Ton, Kathryn Bryant, and Hans Lau.
Data
As stated on the Kaggle competition description page, the data for this project was compiled by Dean De Cock for educational purposes, and it includes 79 predictor variables (house attributes) and one target variable (price)As a result of the educational nature of the competition, the data was pre-split into a training set and a test set; the two datasets were given in the forms of csv files, each around 450 KB in sizeEach of the predictor variables could fall under one of the following:
lot/land variables
location variables
age variables
basement variables
roof variables
garage variables
kitchen variables
room/bathroom variables
utilities variables
appearance variables
external features (pools, porches, etc.) variables
The specifics of the 79 predictor variables are omitted for brevity but can be found in the data_description.txt file on the competition websiteThe target variable was Sale Price, given in US dollars.
Project Overview (Process)
The lifecycle of our project was a typical oneWe started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stackingOf course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.
Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but we used Python for individual model training and ensembling/stackingUsing R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing with curriculum than by anything else.
Throughout the project our group wanted to mimic a real-world machine learning project as much as possible, which meant that although we were given both a training set and a test set, we opted to treat the given test set as if it were “future” dataAs a result, we further split the Kaggle training data into a private training set and a private testing set, with an 80/20 split, respectivelyThis allowed us to evaluate models in two ways before predicting on the Kaggle test data: with RMSE of predictions made on the private test set and with cross validation RMSE of the entire training set.
Given the above choices, the process for training and evaluating each individual model was broken down as follows:
Grid search to tune hyper parameters (if applicable) to private train set
Fit model to private train set with tuned parameters
Predict on private test set; if RMSE okay, proceed.
Fit new model to Kaggle train set with private train hyperparameters
Cross-validate on Kaggle train set; if CV RMSE okay, proceed.
Predict on Kaggle test set
Submit predictions to Kaggle
Computing RMSE of predictions made on the private test set served as a group sanity check, in that if anything was amiss with a model we found out at this point and could correct for it before proceedingThe decision to fit a new model to the Kaggle train in step 4 set using the private train hyperparameters found in step 1 was one for which we never felt completely at ease; we were trying to minimize overfitting to the Kaggle training set, but also recognized that we weren’t using all the possible information we could by not re-tuning using the full training setOne benefit to this choice was that we never saw any major discrepancies between our own cross validation scores in step 5 and the Kaggle scores from step 7With more time, we would have further investigated whether keeping the private train parameters or re-tuning for the final models was more beneficial.
The last noteworthy choice we made in our overall process was to feed differently-cleaned datasets to our linear-based models and our tree-based modelsIn linear-based models (linear, ridge, LASSO, elastic net, spline), prediction values are continuous and sensitive to outliers so we opted to sacrifice information about “rare” houses in favor of gaining better predictions on “common” housesWe also wanted to minimize the likelihood of rare levels only showing up in our test data and/or causing columns of all 0’s in dummifed dataWe executed this tradeoff by releveling any nominal categorical variables that contained extremely rare classes, where “rare” was defined to be “occuring in less than 1% of the observations.”
For example, the Heating variable had six levels but four of them combined accounted for only about 1% of the observations; so, we combined these four levels into a new ‘other’ level so that releveled Heating had only three levels, all accounting for 1% or more of the observationsDepending on the variable, we either created an ‘other’ level as in the example or we grouped rare levels into existing levels according to level similarity in the variable documentation.
We opted not to relevel data fed into our tree-based models because tree predictions are more robust to outliers and rare classes; trees can separate rare observations from others through splits, which prevents common observation predictions from being distorted by rare observations in fitting.
Data Cleaning and EDA
We were fortunate with the Kaggle data in that it came to us relatively cleanThe only basic cleaning tasks were to correct typos in levels of categorical variables, specify numeric or categorical variables in R, and rename variables beginning with numbers to satisfy R’s variable name requirementsThere were a number of “quality” and “condition” variables that had levels of Poor, Fair, Typical/Average, Good, and Excellent which we label encoded as integers 1-5 to preserve their inherent ordinalityFor nominal categorical variables, we used one-hot encoding from the ‘vtreat’ package in R(Most machine learning algorithms in Python require all variables to have numeric values, and although R’s machine learning algorithms can handle nominal categorical variables in non-numeric/string form, R’s computations effectively use one-hot encoding under the hood in these cases.)
After taking care of these basic cleaning issues, we needed to address missingness in our dataBelow is a plot that shows the variables containing missing values and the degree to which missingness occurs in each (shown in yellow):
For all variables except Garage Year Built and Lot Frontage, we performed basic mode imputationMode imputation was chosen for simplicity and because it could be done with both categorical and numerical dataOf course, mode imputation has the negative side effect of artificially decreasing variance within the affected variable and therefore would not be appropriate for variables with higher degrees of missingnessIt was for this reason, in fact, that we approached missingness differently for Garage Year Built and Lot Frontage.
For missing values in Garage Year Built, we imputed the Year Built for the houseWe justified this because most garages are built at the same time as the house, and houses without garages get no penalty or benefit by having the Garage Year Built equal to the Year Built for the house.
For Lot Frontage, the variable with the greatest degree of missingness, was researched and explored in order to arrive at an imputation strategyLot Frontage was defined to be the linear feet of street connected to the propertyGiven that most properties were either regularly or only slightly irregularly shaped according to Lot Shape, we deduced that most properties would have Lot Frontage values that were correlated with Lot Area valuesHowever, since length is measured in units and area is measure in square units, we found it most appropriate to relate log(Lot Frontage) with log(Lot Area), so as to get a linear relationship between the twoSee plot below.
Thus, we imputed missing values for Lot Frontage by fitting a linear model of log(Lot Frontage) regressed onto log(Lot Area), predicting on the missing values for Lot Frontage using Lot Area, and then exponentiating (inverse log-ing) the result.
The next step in EDA after finding and addressing missingness was to look at outliersLooking at boxplots of both Sale Price and log(Sale Price) we saw that there were quite a few outliers, where ‘outliers’ mathematically defined to be observations lying more than 1.5*IQR (Inner Quartile Range) above the third quartile and below the first quartile.
Although removing the outliers may have improved our predictions for average-priced homes, we were hesitant to do so due to the relatively small size of our sample (~1600 observations in Kaggle training set)We felt that removing any outliers without further justification than them simply being outliers would likely jeopardize our predictions for houses in the extremes.
Since outliers would have the most impact on the fit of linear-based models, we further investigated outliers by training a basic multiple linear regression model on the Kaggle training set with all observations included; we then looked at the resulting influence and studentized residuals plots:
From these, we saw that there were only two observations that could justifiably be removed: observation 1299 and observation 251These were both beyond or on the lines representing Cook’s Distance, meaning that as individual observations they had a significant impact on the regression formula; as such, we removed these two observations from consideration.
The last bit of preprocessing we did was dealing with multicollinearityThis, as for outliers, was cleaning done mostly for the benefit of linear-based models; in fact, it was done only for vanilla multiple linear regression since regularization in ridge, LASSO, and elastic net models deals with collinearity by constructionTo eliminate collinearity issues, we used the findLinearCombos() function from the ‘caret’ package in R on our dummified dataThis function identified linear combinations between predictor variables and allowed us to easily drop linearly dependent variables.
Feature Engineering
For feature engineering we took a two-pronged approach: we used anecdotal data/personal knowledge and we did research.
Garage interaction: Garage Quality * Number of Cars Garage Holds.
If a home has a really great or really poor garage, the impact of that quality component on price will be exacerbated by the size of the garage.
Total number of bathrooms: Full Bath + Half Bath + Basement Full Bath + Basement Half Bath.
In our experience, houses are often listed in terms of total number of bedrooms and total number of bathroomsOur data had total number of bedrooms, but lacked total number of bathrooms.
Average room size: Above-Ground Living Area / Total Number of R00ms Above Ground.
“Open concept” homes have been gaining popularity and homes with large rooms have always been popular, and with the provided variables we believe average room size might address both of these trends.
Bathroom to room ratio: (Full Bath + Half Bath) / Number of Bedrooms Above Ground
The number of bathrooms desired in a house depends on the number of bedroomsA home with one bedroom will not be viewed nearly as negatively for having only one bathroom as would a house with three bedrooms.
Comparative size of living area: Above-Ground Living Area / mean(Above-Ground Living Area)
This variable attempts to capture house size as it directly compares to other houses, but in a manner different from mere number of rooms.
Landscape-ability interaction: Lot Shape * Land Contour
Landscaped homes tend to sell for more (see this article) and the ability for a lot to be easily landscaped is, in part, determined by the shape of the lot (regular, slightly irregular, irregular, very irregular) and the land contour (level, banked, hillside, depressed)Certain combinations may be workable (regular shape, hillside) while other combinations (very irregular shape, hillside) may make landscaping difficult and/or expensive.
Of the six features that we added, only “landscape-ability” resulted from research; we either already had the important industry variables in the data (total above-ground square footage or neighborhood, for example) or we did not have the information we needed to create the desired variablesAdditional features we would have liked to have added include geolocation data for proximity to schools and shopping centers, quality of nearby schools, and specific rooms/house features remodeled (if applicable).
One step that we glossed over a bit was extensive adding and removing of featuresWe noticed with just a few trials that removing existing features seemed to negatively impact the performance of our models and that the models improved when we added featuresGiven that the regularized linear models would give preference to more important features via larger coefficients and tree-based models would give preference to more important features by splitting preferences, we opted not to spend too much time manually adding and removing featuresEssentially, we allowed our models to select the most predictive features themselves.
Modeling
For our individual models, we trained the following:
Multiple linear regression
Ridge regression
LASSO regression
Elastic net regression
Spline regression
Basic decision tree
Random forest
Gradient boosted tree
XGBoosted tree
Overall our models consistently had RMSE values between .115 and .145As stated earlier, our cross validation scores were usually very close to our scores on KaggleTo our surprise, our linear-based models did very well compared to our tree-based modelsEven the much-hyped XGBoost model was at best on par with our linear-based spline and elastic net models, and our random forest models tended to be much worseAn abbreviated table of our results is shown below:
Elastic Net
Spline
Random Forest
Gradient Boost
XGBoost
Ensemble
Cross Validation
0.12157
0.11181
0.14171
0.12491
0.12282
0.11227
Held-out Test
0.11762
0.11398
0.13834
0.11403
0.11485
n/a
Kaggle
0.12107
0.11796
n/a
n/a
n/a
0.11710
As we endeavored to find a final, “best” model via ensembling and stacking, we followed the advice of our instructor (and successful Kaggler) Zeyu by attempting to combine models with different strengths and weaknesses.
We took the predictions from each of our best base models (Spline, Gradient Boost, XGBoost) as the features to use for the next levelOur best result came from a weighted average of the three, with weights determined by a grid searchThe blend of 76% Spline, 14.5% Gradient Boost, and 9.5% Xgboost gave us a 0.11227 RMSE on our training set and 0.11710 on Kaggle.
We also experimented with using a second level meta-model to do the stacking, but with a linear meta-model, the coefficients were highly unstable because the predictions were so closely related to each other, and with a gradient boost meta-model, we were unable to beat our best base model alone.
Conclusions
As mentioned above, we were surprised by the strength of our various linear models in comparison to our tree modelsWe suspect this had a lot to do with the data itself, in that Sale Price (or rather, log(Sale Price)) likely has a relatively linear relationship with the predictor variablesThis highlights one of the most important takeaways from this project: that linear models have a real place in machine learning.
In situations like this, where performance between a simpler model and a more complex model are similar, the better interpretability and the ease of training of the simpler model may also prove to be deciding factors on model choice. In most cases, stacking multiple base-layer models into a second-layer is impractical, despite performing slightly better.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
Related
ShareTweet
To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
Comments are closed.
Search R-bloggersRecent popular posts
RStudio Keyboard Shortcuts for Pipes
Automatic output format in Rmarkdown
M4 Forecasting Competition
Most visited articles of the week
How to write the first for loop in R
Installing R packages
How to perform a Logistic Regression in R
How to Make a Histogram with Basic R
Using apply, sapply, lapply in R
Tutorials for learning R
Computing and visualizing PCA in R
In-depth introduction to machine learning in 15 hours of expert videos
Simple Linear Regression
Sponsors
// https://support.cloudflare.com/hc/en-us/articles/200169436-How-can-I-have-Rocket-Loader-ignore-my-script-s-in-Automatic-Mode-
// this must be placed higherOtherwise it doesn't work.
// data-cfasync="false" is for making sure cloudflares' rocketcache doesn't interfeare with this
// in this case it only works because it was used at the original script in the text widget
function createCookie(name,value,days) {
var expires = "";
if (days) {
var date = new Date();
date.setTime(date.getTime() + (days*24*60*60*1000));
expires = "; expires=" + date.toUTCString();
}
document.cookie = name + "=" + value + expires + "; path=/";
}
function readCookie(name) {
var nameEQ = name + "=";
var ca = document.cookie.split(';');
for(var i=0;i This way, it allows them to use their browser's cache.
var random_number = readCookie("ad_random_number_cookie");
if(random_number == null) {
var random_number = Math.floor(Math.random()*100*(new Date().getTime()/1000));
createCookie("ad_random_number_cookie",random_number,1)
}
file += '?t='+random_number;
var rawFile = new XMLHttpRequest();
rawFile.onreadystatechange = function ()
{
if(rawFile.readyState === 4)
{
if(rawFile.status === 200 || rawFile.status == 0)
{
// var allText = rawFile.responseText;
// document.write(allText);
document.write(rawFile.responseText);
}
}
}
rawFile.open("GET", file, false);
rawFile.send(null);
}
// readTextFile('https://raw.githubusercontent.com/Raynos/file-store/master/temp.txt');
readTextFile("https://www.r-bloggers.com/wp-content/uploads/text-widget_anti-cache.txt");
Jobs for R usersRays Research & Development AnalystPROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALISTQuantitative Econometrician @ San Francisco, California, U.S.Postdoctoral Research Fellow in Healthcare Systems Engineering @ Maryland, U.S.R Programmer & Statistician for Academic ResearchData ScientistPostdoctoral Data Scientist: GIS and mHealth Full list of contributing R-bloggers
R-bloggers was founded by Tal Galili, with gratitude to the R community
Is powered by WordPress using a bavotasan.com design.
Copyri

## Trending Iowa News:

City Selling Four Homes In Cedar Rapids - "Structure Only"
Branstad makes state appointments
Voter apathy equals higher taxation
Ted Cruz briefly refused to say Donald Trump's name while campaigning for him in Iowa
USDA Announces 11 Iowa Counties as part of Secretarial Natural Disaster Designation
Iowa's vital records could get easier to see
Jared Kushner to be named Trump’s senior adviser
Michael Hatting joins Hinshaw & Culbertson LLP’s Minneapolis office
Scott apparently obtains abbey insurance
Iowa couple offers free house with one big catch
Utah real estate investment firm buys Siegen Lane properties for $2.2M
Search For Real Estate Agent Killer
Meet a Delegate: A.J. Spiker
2016 race kicks off with long day of auditions in Iowa
Trump: I know how to solve America's problems
Iowa City real estate market seeming to turn around
Former Iowa real estate agent sentenced for fraud
Wasendorf properties hit market this week
Barnes & Noble will trim number of stores; fate of Waterloo store uncertain
Iowa Finance Authority Awards more than $730,000 in Grants for Homelessness Services in Iowa