Sequence classification is a type of basic problem in natural language processing. This mini project illustrates the basic methods of conducting sequence classification using LSTM model. Such algorithm can be used to detect spam comments or reviews on the internet.

Packages:

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import tensorflow as tf
import keras
from string import punctuation

from sklearn.model_selection import train_test_split

Data Introduction

Human-generated sentences:

  • 169389 observations
  • Examples: “Grease Monkey’s current market”; “Schlumberger’s management shift, asset restructuring”; ‘expansion without inflationary instability,” he said’

Machine-generated sentences

  • 500 observations
  • Examples: ‘Np^g tj5vQ key NKVZl31 ZV’ ‘EcN !d7g moTL!3c* e^n qsG page l0u’; ‘@@rvbv 5r gYXWL police nVV8 RZD.fV&2n Jc0 EQ2iX’; ‘pZ80yue ^ difference 8Z8Z i VhK,Tn Mqj!RpIy’

Task: Classify the two classes based on to input sentences/sequences

Baseline accuracy: 0.9971

  • This dataset is imbalanced. The baseline accuracy is the proportion of correctness when we classify all data points to human class。

Method and Models

Sequence Classification Problem Model: Long Short-term memory(LSTM)

  • Tokenizer Encoding
  • Embedding
  • 30-Unit LSTM Cell
  • Train test split: 85%, 15%
  • Metric: Binary cross entropy

Model: Long Short-term memory(LSTM)

Something need to pay attention to

  • The training set must include all of the characters that the test set have.

  • This is an imbalanced dataset, the machine language only has 500 observations. Bootstrap might be helpful for predicting the machine class.

Model Enhancement Approaches

  • Dropout rate: 0.2 to avoid overfitting
  • Learning rate starts at 0.0006; Learning rate scheduler with a factor of 0.25 and patience of 1
  • Remove punctuations inside the sentence
  • Bootstrap the data generated by machine in the training data to solve imbalance problem

Character-level Encoding

def encode(X):
    data_tokens = " ".join(X)
    tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
    tokenizer.fit_on_texts(data_tokens)
    encoded_X = np.array(tokenizer.texts_to_sequences(X))
    return encoded_X

Model

# truncate and pad input sequences
max_length = 150
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

embedding_vecor_length = 32

# Build LSTM Model
def build_model():
    model = Sequential()
    model.add(Embedding(100, embedding_vecor_length, input_length=max_length))
    model.add(LSTM(30, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.0006), metrics=['val_accuracy'])
    return model

Result

Normal LSTM

  • Validation(Test) accuracy: 0.999882
  • Overall accuracy on entire dataset: 0.999765

Remove all punctuations

  • Validation(Test) accuracy: 0.998784
  • Less accurate than fitting in dataset without removing punctuations

With Bootstrapping

  • Validation(Test) accuracy: 0.999882
  • Overall accuracy on entire dataset: 0.999935

Conclusion

  • Dropout significantly avoided overfitting problem
  • Remaining punctuations actually helps improve model accuracy
  • Bootstrapping increases overall accuracy
  • Data points that were predicted incorrectly:

  • Such observation points are actually contains punctuations inside words and are not formal enough in sentence structure, being difficult for the model to classify.