Mini Project----Human-generated and Machine-generated Language Classification

Sequence classification is a type of basic problem in natural language processing. This mini project illustrates the basic methods of conducting sequence classification using LSTM model. Such algorithm can be used to detect spam comments or reviews on the internet.

Packages:

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import tensorflow as tf
import keras
from string import punctuation

from sklearn.model_selection import train_test_split

Data Introduction

Human-generated sentences:

169389 observations
Examples: “Grease Monkey’s current market”; “Schlumberger’s management shift, asset restructuring”; ‘expansion without inflationary instability,” he said’

Machine-generated sentences

500 observations
Examples: ‘Np^g tj5vQ key NKVZl31 ZV’ ‘EcN !d7g moTL!3c* e^n qsG page l0u’; ‘@@rvbv 5r gYXWL police nVV8 RZD.fV&2n Jc0 EQ2iX’; ‘pZ80yue ^ difference 8Z8Z i VhK,Tn Mqj!RpIy’

Task: Classify the two classes based on to input sentences/sequences

Baseline accuracy: 0.9971

This dataset is imbalanced. The baseline accuracy is the proportion of correctness when we classify all data points to human class。

Method and Models

Sequence Classification Problem Model: Long Short-term memory(LSTM)

Tokenizer Encoding
Embedding
30-Unit LSTM Cell
Train test split: 85%, 15%
Metric: Binary cross entropy

Model: Long Short-term memory(LSTM)

Something need to pay attention to

The training set must include all of the characters that the test set have.
This is an imbalanced dataset, the machine language only has 500 observations. Bootstrap might be helpful for predicting the machine class.

Model Enhancement Approaches

Dropout rate: 0.2 to avoid overfitting
Learning rate starts at 0.0006; Learning rate scheduler with a factor of 0.25 and patience of 1
Remove punctuations inside the sentence
Bootstrap the data generated by machine in the training data to solve imbalance problem

Character-level Encoding

def encode(X):
    data_tokens = " ".join(X)
    tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
    tokenizer.fit_on_texts(data_tokens)
    encoded_X = np.array(tokenizer.texts_to_sequences(X))
    return encoded_X

Model

# truncate and pad input sequences
max_length = 150
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

embedding_vecor_length = 32

# Build LSTM Model
def build_model():
    model = Sequential()
    model.add(Embedding(100, embedding_vecor_length, input_length=max_length))
    model.add(LSTM(30, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.0006), metrics=['val_accuracy'])
    return model

Result

Normal LSTM

Validation(Test) accuracy: 0.999882
Overall accuracy on entire dataset: 0.999765

Remove all punctuations

Validation(Test) accuracy: 0.998784
Less accurate than fitting in dataset without removing punctuations

With Bootstrapping

Validation(Test) accuracy: 0.999882
Overall accuracy on entire dataset: 0.999935

Conclusion

Dropout significantly avoided overfitting problem
Remaining punctuations actually helps improve model accuracy
Bootstrapping increases overall accuracy
Data points that were predicted incorrectly:

Such observation points are actually contains punctuations inside words and are not formal enough in sentence structure, being difficult for the model to classify.