This is a more complex example of Keras, utilizing Regression. This utilizes a good sized dataset from Kaggle, but does contain a little bit of data cleansing before we can build out the model. Unfortunately the model we end up building isn’t perfect and requires more tuning or some final dataset alterations, but it’s a good example none the less. More information below.

## Keras / Tensorflow Regression - Example¶

Here we're going to attempt to utilize Keras/Tensorflow to predict the price of homes based upon a set of features.

The data being used comes from Kaggle:

### Imports¶

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
```

### Data Exploration and Cleansing¶

```
df = pd.read_csv('DATA/kc_house_data.csv')
```

Since we're going to predict prices, we can do a quick distribution. We can see the vast majority sit around 500k, and we have some outliers all the way out to 7m+ (but very few).

```
plt.figure(figsize=(15,6))
sns.distplot(df['price'])
```

One thing we may want to do is get rid of these outliers (at least to an extent). These will certainly affect our model and may skew results. Sorting by price (and from the visual above) we can see that ~3.5m may be a logical cutoff for keeping data.

```
df.sort_values('price',ascending=False).head(20)
```

If we take this concept, we can find out what percentage of homes are above 350k

```
#Original
df.shape
```

```
#Above 350k
df[df['price'] <= 3500000].shape
```

This accounts for less than a percentage of homes, so we could just kick these out completely.

```
#Kick out outliers
df = df[df['price'] <= 3500000]
```

We do have a date column in the DataFrame, but it'll probably make more sense to convert these to month/year to allow for better analysis

```
#Convert string to date
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].apply(lambda date : date.year)
df['month'] = df['date'].apply(lambda date : date.month)
```

```
#Drop since we no longer need the original field
df = df.drop('date',axis=1)
```

In addition - we can drop a few other fields we won't need. ID will not serve any purpose in our model and can be thrown away. In addition, Zipcode is really not in a useful format for predictive analysis and would need to be transformed in some way. However, in reading up on this dataset - folks seem to agree that this data seems bad or incorrect. So, I'm going to cut my losses and also toss it.

```
df = df.drop('id',axis=1)
df = df.drop('zipcode',axis=1)
```

```
df.head()
```

### Build Model¶

```
#Separate features from label
X = df.drop('price',axis=1)
y = df['price']
```

```
from sklearn.model_selection import train_test_split
```

```
#Perform splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
```

#### Perform Scaling¶

Since the data across all fields varies, it's best to scale it first

```
from sklearn.preprocessing import MinMaxScaler
```

```
scaler = MinMaxScaler()
```

```
#Fit and transform training and test set
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
```

#### Fit to Neural Network¶

```
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
```

```
#Try to base model off size of dataset
X_train.shape
```

```
model = Sequential()
#Make it 19 neurons since we have 19 cols
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.25)) #Choose somewhere between 0/1 (1 = 100%) of neurons to turn off randomly to avoid overfitting
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
```

```
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=25)
```

In an attempt to see how our model actual performs (and if we're overfitting), we're going to utilize the validation_data parameter to pass in the test values, so we can compare the predictions against actual. A batch_size of 128 should help to avoid overfitting (smaller should be better here). I've also implemented a dropout of 25% on the layers, and an early stop in case overfitting is detected.

```
model.fit(x=X_train,y=y_train.values,
validation_data=(X_test,y_test.values),
batch_size=128,epochs=400,callbacks=[early_stop])
```

Compare the losses of the training vs. test dataset to see if we're overfitting.

If you see the orange line (val_loss) go up towards the end it means your overfitting, because now you have a much larger loss on your validation data.

With the plot below, we do see the val_loss start to slightly rise above 200 epochs which could indicate a little bit of overfitting with the data. The early stopping caught this and stopped us from going the entire 400 epochs

```
losses = pd.DataFrame(model.history.history)
losses.plot()
```

#### Predict and Evaluate¶

```
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
```

```
#Grab predictions from the testing set
predictions = model.predict(X_test)
```

```
#MAE
mean_absolute_error(y_test,predictions)
```

```
#Mean price
df['price'].mean()
```

The Mean Absolute Error indicates we're off by an average of 158k, which isn't great given we're looking at an average home price of ~536k. Meaning we're off a little under 30%. I'm guessing we're still encountering too much interference from outliers above 350k - so let's continue.

```
#MSE
np.sqrt(mean_squared_error(y_test,predictions))
```

```
#Best possible score is 1.0
#How much variance is explained by our model?
explained_variance_score(y_test,predictions)
```

You can see we're only explaining approximately 64% of variance in the data. We could try to plot this data out to see if can visually see how well we're actually doing.

```
plt.figure(figsize=(12,6))
plt.scatter(y_test,predictions)
plt.plot(y_test,y_test,'r')
```

We're actually doing a good job on predictions on prices below 1.5m or so. The problem seems to exist for the more expensive houses.

### Predict New Home Prices¶

So how can we predict the price of a net new home? In the example below we take the first house out of the DataFrame and remove the price. Then we scale the data and predict using our trained model. Now we can compare the price predicted against the actual.

```
#Assign the first house in the DataFrame to a new DataFrame
single_house = df.drop('price',axis=1).iloc[0]
```

```
#Scale the values AND reshape the results (so the shape matches the expected shape for the model)
single_house = scaler.transform(single_house.values.reshape(-1,19))
```

```
#Predict the price
model.predict(single_house)
```

```
#Check the real price
df.head(1)
```

We're overshooting the price here a bit, we're predicting 280k, but the actual house is 222k. We'd probably need to spend some more time tweaking the model to get things to come out a bit better. It takes a bit to get things perfect - this result is after I've gone through 4-5 different models and I'm getting much closer to the actual.