Computing Housing Prices Using Tensorflow

Matthew MacFarquhar
3 min readOct 25, 2022

--

In this tutorial, we will create a DNN (Deep Neural Network) to estimate the prices of houses in Washington State.

The Data

I got the Data from Kaggle, It is a pretty clean dataset with 18 columns.

df = pd.read_csv('housing_data.csv')
df.head()

First, we will just read in the data and display the first few rows. From checking the columns out, a few of the input columns (date, zip code, street) seem quite unique across the dataset. When we encode the data we take a one-hot-coding approach for these string fields (using .get_dummies), so if the street column has 1000 different entries we turn that into 1000 columns with the value of 1 if the house is at that street and a 0 if it is not. This blows up our input set so we drop those columns that would cause that feature ballooning.

X = pd.get_dummies(df.drop(["price", "date", "country", "street"], axis=1))
y = df["price"]
X.head()

The Model

We will create a DNN with one normalization layer and 3 neural layers.

Normalization

We normalize inputs so that the model can train more effectively and does not have to deal with a large number bias for different inputs. (i.e. number of bedrooms and sqft should have the same initial importance but since sqft >>> number of bedrooms, it starts out with a larger impact which the model has to then unlearn)

normalizer = Normalization()
normalizer.adapt(X)

We normalize a column by taking the value, subtracting the mean across the column and dividing by the standard deviation. (There are a few Normalization methods, but Tensorflow uses the method I stated above which is called Standard Normalization)

The Structure

model = Sequential()
model.add(normalizer)
model.add(Dense(units=16, activation='relu', input_dim=len(X.columns)))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=16, activation='relu'))
model.add(Dense(units=1))
model.summary()

We put in our normalization layer and then have 3 multi-neuron layers with 16, 32, and 16 neurons followed by our output prediction node layer.

Training & Evaluation

model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_absolute_error')

We set our optimizer to Tensorflow’s Adam optimizer, and our loss metric is mean absolute error (i.e. | predicted val — actual val|).

history = model.fit(X, y, epochs=200, batch_size=16, validation_split=0.1)

We then fit the model with a 90–10 train-validation split and store the history.

def plot_loss(history):
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylim([0, 1000000])
plt.xlabel('Epoch')
plt.ylabel('Error [Price]')
plt.legend()
plt.grid(True)
plot_loss(history)

After plotting the loss history, you should see something like the below image.

As you can see our training loss went down a decent amount but our validation loss plateaued a lot higher up. To improve the generalization of the model, I would add some dropout to the layers and I would try to extract some textual meaning from the street data and date to improve accuracy (to capture some things like neighborhood and time of sale).

--

--

Matthew MacFarquhar
Matthew MacFarquhar

Written by Matthew MacFarquhar

I am a software engineer working for Amazon living in SF/NYC.

No responses yet