Using #AI in #Cryptanalysis
SETUP
First we set some helper variables that define the project
ROOT_DIR = 'C:\\Users\\Andrew\\Documents\\Work\\string-prediction\\Final' # The home location of this project
SEED = 42 # The seed to ensure consistency across experiments, not used in production
INPUT_SIZE = 24 # A helper variable to define the size of the input
OUTPUT_SIZE = 17 # A helper variable to define the size of the output
Next, we load the data from the csv file. We also cache the content of the csv file in “pickle” format. This is because csv format files take a few seconds to load but pickle format files load almost instataneously, therfore, to keep the developer sane, we utilize the cache in development.
import pandas as pd # pandas is a commonly used data processing library, used here to load the csv initially
DATA_FILE = '{}/{}'.format(ROOT_DIR, 'RealWorkload.csv') # the filepath of the csv file
DATA_PICKLE = '{}/{}'.format(ROOT_DIR, 'data.pickle') # the cached version of the csv file for faster loading in development
data = pd.read_csv(DATA_FILE, header=None, names=['input', 'output']) # use pandas to read the input
data = data.dropna() # drop any rows with null values just in case
data.to_pickle(DATA_PICKLE) # cache csv file in a "pickle" format for faster loading in development
# data = pd.read_pickle(DATA_PICKLE) # loading csv file in pickle format, not used here
data.head() # display the first five rows in the console
DATA PREPARATION
We then transform the data into x for the input features, and y for the output predictions. We could have continued to use pandas as our data structure of choice, however we don’t need the advanced features of pandas, therefore we simply load the data input numpy arrays which are much more lightweight.
import numpy as np # numpy is a utility mathematical library
x = np.zeros((len(data), INPUT_SIZE)) # construct an empty container that will track the input values
for i, row in enumerate(data['input'].values): # iterate through the input values
arr = row.split('|') # split the string by the '|' delimiter
for j, value in enumerate(arr): # iterate through each value in a row
x[i, j] = int(value) # set the appropriate value in the container to the value in the input as an integer
y = data['output'].values # simply copy over the output values into the y container
# note that even though we are constructing the x and y containers separately, it is guaranteed that their values will sync up.
# Such that the first x element will correspond to the first y element
We split the data into three portions.
- Test – used for final testing, completely unseen by the model
- Validation – used for tuning the model hyperparameters, functions as a “pretend” test set
- Train – the data used to train the model
The key difference between the test set and validation set is the fact that the models will be tuned for the validation set but not the test set.
import string
from sklearn.model_selection import train_test_split # sklearn is a commonly used data science library
SIZE = 100000 # we define our working-set size
x_work, y_work, x_test, y_test = x[:SIZE], y[:SIZE], x[SIZE:], y[SIZE:] # we split the data into a working set and test set
x_train, x_val, y_train, y_val = train_test_split(x_work, y_work, test_size=0.2, random_state=SEED)
# we then further split the working set into a training and validation using the SEED variable to
# control randomness during development, the validation set is 20% of the working set
We then build and train our models. We use a random forest classifier to predict each character in the output sequence. Through experimentation this was the best performing approach.
A high level description of a random forest classifier is that decision trees are built by construction a tree of decision points based on the input data. The decision points are adjusted in training. Multiple trees are built and their combined result is the prediction.
For more information see here: https://towardsdatascience.com/understanding-random-forest-58381e0602d2
Further models that were experimented with include a simple feedforward network, a recurrent network and an autoencoder neural network. However, this approach was the best performing by a large margin.
from sklearn.ensemble import RandomForestClassifier
# a random forest classifier is an advanced implementation of decision trees and
# has been shown to be the best statistical learning method
models = []
for i in range(OUTPUT_SIZE):
target = list(map(lambda v: v[i], y_train)) # we fetch the appropriate target digits
model = RandomForestClassifier(random_state=SEED, n_estimators=10) # we use 100 estimators in our random forest setup
model.fit(x_train, target)
models += [model]
print('finished with model #{}'.format(i))
EVALUATION
Now we evaluate the model to see how it performs and how well it generalizes
from sklearn.metrics import accuracy_score # a utility helper function that evaluates models
import matplotlib.pyplot as plt # import plotting library
def measure(x, y):
output = []
for i in range(OUTPUT_SIZE): # we iterate through each character in the output string
target = list(map(lambda x: x[i], y)) # we fetch the appropriate expected target
predicted = models[i].predict(x) # we use our trained model to predict the the character
accuracy = accuracy_score(target, predicted) # we check the accuracy of the predictions against the expected
output += [accuracy * 100]
return output
def graph_results(results, title):
plt.bar(np.arange(OUTPUT_SIZE), results, align='center', alpha=0.5)
plt.xticks(np.arange(OUTPUT_SIZE), np.arange(1, OUTPUT_SIZE + 1))
plt.ylabel('accuracy %')
plt.xlabel('character in output')
plt.title(title)
plt.show()
train_results = measure(x_train, y_train)
val_results = measure(x_val, y_val)
test_results = measure(x_test, y_test)
now we graph the results to see that the training set attains a near-perfect performance which is to be expected with tree based models. The validation and test sets are consistent with each other, showing that the first 8 digits and the 11th digit are the most accurate. The 9th digit, the security digit, is the least accurate as expected. Everything after the 9th digit (with the exception of digit 11) also has drastically reduced accuracy
graph_results(train_results, 'train accuracy')
graph_results(val_results, 'validation accuracy')
graph_results(test_results, 'test accuracy')
USAGE
Now we construct our model for production by training on the entire dataset and serialized to file.
models = []
for i in range(OUTPUT_SIZE):
target = list(map(lambda v: v[i], y)) # we fetch the appropriate target digits
model = RandomForestClassifier(random_state=SEED, n_estimators=10) # we use 100 estimators in our random forest setup
model.fit(x, target)
models += [model]
print('finished with model #{}'.format(i))
import pickle
for i in range(OUTPUT_SIZE):
pickle.dump(models[i], open('{}/model_{}.pickle'.format(ROOT_DIR, i), 'wb')) # serializing models to pickle format
print('finished {}'.format(i))
DEPLOYMENT
After serializing the models, they can then be used by our server. We use a flask server that loads the models and responds to inputs to make predictions. The code of which lives in “server.py” and should be fairly self-explanatory.
To run the server, make sure python is installed from
- standard distribution – https://www.python.org/downloads/
- anaconda distribution – https://www.anaconda.com/distribution/ – recommended
After installing python:
- Open the command line program – cmd on windows, terminal on osx
- Navigate to the server folder
- Run the install file – note this may take some time depending on the computer specifications
and may look like the program has frozen, please allow it to finish
a. For osx: run “./install.sh” b. For windows: run “install” - Run the run file
a. For osx: run “./run.sh”
b. For windows: run “run” - After the server starts, a process will run to load the model. The console output will denote the progress and the final message “ALL MODELS LOADED” will indicate the server is ready to use.
Now a Flask server will be available at “http://localhost:5000”. This server can be queried by any REST client. For testing, I recommend PostMan or Restlet Client.
The following endpoints are supported:
- /test – a test endpoint that responds with a test message
- /info – responds with test probabilities of each character in the output, can be used to estimate confidence
- /- accepts a parameter “text” with 24 signed integers separated with “|”, returns the predicted 17 character output sequence
For example: http://localhost:5000?text=-2|-72|-11|-2|18|100|-69|15|93|120|15|-97|-35|52|85|-114|53|-123|-1|-101|-38|125|-100|113
Will return: 1D8HB58D04F177301