Predicting airline passengers using LSTM and Tensorflow

In this article, we will go over what an RNN is, what an LSTM is, and what a GRU is. We will then use an LSTM to train a model on flight data to predict future flight trends.

Before we build our network, we need to understand what an RNN is, what problems it can be used on and how it differs from a regular DNN. RNN means Recurrent Neural Network, Recurrent is defined as “occurring repeatedly”. These network repeat a single layer over and over again instead of propagating data to deeper networks (like in DNN).

In the diagram above, we show that we are interacting with the same layer but just feeding more data in overtime. X goes into the model, operations are preformed, then the output is sent to the output and to V (which then gets fed back into the layer alongside the next X value in the time series.

RNNs are great for sequential data, where the previous data that was consumed is very important to predicting an output. For example, for predicting stock prices, imagine we had a stock sequence $70, $80, $90 and we fed that into an RNN, we would expect the output to be $100 based on the previous data. Now imagine we had the sequence $110, $100, $90, when we feed that into an RNN we would expect it to give us a value like $80. The RNN is more powerful than the DNN in this case since it lets us remember past input. If we fed the sequence into a DNN on the other hand, there would be no ability to remember previous inputs, so the two $90 inputs would generate the same output regardless of the trend.

The problem with a basic RNN is that we have the Vanishing Gradient Problem which basically means the deeper the layer, the less the model can adjust since gradients get smaller and smaller as they pass back through the model. RNNs can have many many layers, (based on the number of time steps i.e. if there are 50 time steps there are essentially 50 layers).

To combat the Vanishing Gradient Problem, we introduce the GRU and LSTM.

The GRU has 4 gates: The Update Gate, The Reset Gate, The Current Memory Gate, and The Final Memory Gate

Update Gate r(t)

This tells us how much of the data from previous time steps should be passed along to the future, it applies the sigmoid function to the the sum of the input data X(t) and the output from the last time step H(t-1).

Reset Gate z(t)

This tells us how much of the previous data we should forget. It is also obtained by applying the sigmoid function to the the sum of the input data X(t) and the output from the last time step H(t-1).

Current Memory Gate h_bar(t)

This gate tells us how much of the previous info and the new info makes its way into the model’s memory. We compute it by applying tanh to the products of the Update Gate r(t) and the new input x(t).

Final Memory Gate h(t)

This gate tells us what memory state to pass into the next time step of the model. We compute it by taking the current Memory Gate’s output h_bar(t) and multiplying it by 1- z(t) and summing it with the last timeStamp’s memory h(t-1) * z(t) where z(t) is the Reset Gate’s output.

LSTMs are a lot like GRUs except they have an additional cell state that they pass on through the network. The update gate it i(t), the reset gate is f(t), the current memory gate is C_bar(t), and the final gate is h(t).

Our two new gates o(t) and C(t) manage the cell state which gets passed through to subsequent iterations through the network.


The Data and the code is found at this Repo

First, lets read and plot the data.

df = pd.read_csv('airline-passengers.csv', usecols=[1])

Next, we will regularize the data (always a good idea). In this example we use a MinMax scale to fit everything between 0 and 1. The highest value gets 1, the lowest get 0 and everything else is interpolated in between.

scaler = MinMaxScaler(feature_range=(0,1))
df = scaler.fit_transform(df)

Next we will split the data into train and test data.

train_size = int(len(df) * 0.7)
train, test = df[0:train_size], df[train_size:]

Now we will format our data, this data will have number of look_back X features and 1 Y output.

def create_dataset(dataset, look_back=1):
X,Y = [], []
for i in range(len(dataset)-look_back-1):
X.append(dataset[i:(i+look_back), 0])
Y.append(dataset[i + look_back, 0])
return np.array(X), np.array(Y)

In this case we will use 5 look backs, so an X entry will have values of t-5 through t-1, and the y will have the value at t. The more look_back data you give the model, the better it will predict the training data (although that will also just overfit the model, so it requires some hyper-parameter tuning).

look_back = 5
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))


Finally we get to make our model, we will have on LSTM layer followed by two Dense layers.

model = Sequential()
model.add(LSTM(8, input_shape=(1, look_back)))

Below is the summary.

Model: "sequential"
Layer (type) Output Shape Param #
lstm (LSTM) (None, 8) 448

dense (Dense) (None, 8) 72

dense_1 (Dense) (None, 1) 9

Total params: 529
Trainable params: 529
Non-trainable params: 0


Now we will compile the model.

model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001))
history =, trainY, epochs=100, batch_size=1)

and plot the learning data.

plt.plot(history.history['loss'], label='loss')
plt.ylim([0, 0.01])
plt.ylabel('Error [passangers]')


Now we can go and make some predictions!

# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = np.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = np.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

We made some predictions, and got our RMSE (Root Mean Squared Error). We obtain that by summing all the sums of square differences between the prediction and the ground truth, then dividing by the number of data points, and finally taking the square root.

Train Score: 24.27 RMSE
Test Score: 73.31 RMSE

Now we can show our predictions overlayed on the ground truth.

# shift train predictions for plotting
trainPredictPlot = np.empty_like(df)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = np.empty_like(df)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(df)-1, :] = testPredict
# plot baseline and predictions



I am a software engineer working for Amazon living in SF/NYC.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store