Home > Uncategorized > Using #AI in #Cryptanalysis

Using #AI in #Cryptanalysis

August 21, 2019 Infinite Loop Development Ltd Leave a comment Go to comments

pandas_logo

SETUP

First we set some helper variables that define the project

In [8]:

ROOT_DIR = 'C:\\Users\\Andrew\\Documents\\Work\\string-prediction\\Final'  # The home location of this project
​
SEED = 42  # The seed to ensure consistency across experiments, not used in production
INPUT_SIZE = 24  # A helper variable to define the size of the input
OUTPUT_SIZE = 17  # A helper variable to define the size of the output

Next, we load the data from the csv file. We also cache the content of the csv file in “pickle” format. This is because csv format files take a few seconds to load but pickle format files load almost instataneously, therfore, to keep the developer sane, we utilize the cache in development.

In [5]:

import pandas as pd  # pandas is a commonly used data processing library, used here to load the csv initially
​
DATA_FILE = '{}/{}'.format(ROOT_DIR, 'RealWorkload.csv')  # the filepath of the csv file
DATA_PICKLE = '{}/{}'.format(ROOT_DIR, 'data.pickle')  # the cached version of the csv file for faster loading in development
​
data = pd.read_csv(DATA_FILE, header=None, names=['input', 'output'])  # use pandas to read the input
data = data.dropna()  # drop any rows with null values just in case
data.to_pickle(DATA_PICKLE)  # cache csv file in a "pickle" format for faster loading in development
# data = pd.read_pickle(DATA_PICKLE)  # loading csv file in pickle format, not used here
data.head()  # display the first five rows in the console

Out[5]:

	input	output
0	-2\|-72\|-11\|-2\|18\|100\|-69\|15\|93\|120\|15\|-97\|-35\|…	1D8HB58D04F177301
1	-105\|-53\|20\|-126\|-87\|13\|-124\|65\|58\|-116\|63\|34\|…	JM1BJ225621628507
2	90\|-56\|-40\|-3\|95\|0\|42\|4\|-112\|48\|-37\|-10\|-7\|115…	JN1BJ1CP6HW007566
3	116\|-85\|109\|-127\|30\|-30\|23\|13\|40\|127\|-97\|67\|-1…	WBA3B1G57ENN90705
4	-121\|-22\|102\|72\|-31\|-110\|-40\|36\|-117\|-119\|86\|-…	1G1PK5SB4E7391908

DATA PREPARATION

We then transform the data into x for the input features, and y for the output predictions. We could have continued to use pandas as our data structure of choice, however we don’t need the advanced features of pandas, therefore we simply load the data input numpy arrays which are much more lightweight.

In [9]:

import numpy as np  # numpy is a utility mathematical library
x = np.zeros((len(data), INPUT_SIZE))  # construct an empty container that will track the input values
for i, row in enumerate(data['input'].values):  # iterate through the input values
    arr = row.split('|')  # split the string by the '|' delimiter
    for j, value in enumerate(arr):  # iterate through each value in a row
        x[i, j] = int(value)  # set the appropriate value in the container to the value in the input as an integer
y = data['output'].values  # simply copy over the output values into the y container
​
# note that even though we are constructing the x and y containers separately, it is guaranteed that their values will sync up.
# Such that the first x element will correspond to the first y element

We split the data into three portions.

Test – used for final testing, completely unseen by the model
Validation – used for tuning the model hyperparameters, functions as a “pretend” test set
Train – the data used to train the model

The key difference between the test set and validation set is the fact that the models will be tuned for the validation set but not the test set.

In [10]:

import string
from sklearn.model_selection import train_test_split  # sklearn is a commonly used data science library
​
SIZE = 100000  # we define our working-set size
x_work, y_work, x_test, y_test = x[:SIZE], y[:SIZE], x[SIZE:], y[SIZE:]  # we split the data into a working set and test set
x_train, x_val, y_train, y_val = train_test_split(x_work, y_work, test_size=0.2, random_state=SEED)
# we then further split the working set into a training and validation using the SEED variable to 
# control randomness during development, the validation set is 20% of the working set

We then build and train our models. We use a random forest classifier to predict each character in the output sequence. Through experimentation this was the best performing approach.

A high level description of a random forest classifier is that decision trees are built by construction a tree of decision points based on the input data. The decision points are adjusted in training. Multiple trees are built and their combined result is the prediction.

For more information see here: https://towardsdatascience.com/understanding-random-forest-58381e0602d2

Further models that were experimented with include a simple feedforward network, a recurrent network and an autoencoder neural network. However, this approach was the best performing by a large margin.

In [81]:

from sklearn.ensemble import RandomForestClassifier  
# a random forest classifier is an advanced implementation of decision trees and 
# has been shown to be the best statistical learning method
​
models = []
for i in range(OUTPUT_SIZE):
    target = list(map(lambda v: v[i], y_train))  # we fetch the appropriate target digits
    model = RandomForestClassifier(random_state=SEED, n_estimators=10)  # we use 100 estimators in our random forest setup
    model.fit(x_train, target)
    models += [model]
    print('finished with model #{}'.format(i))

finished with model #0
finished with model #1
finished with model #2
finished with model #3
finished with model #4
finished with model #5
finished with model #6
finished with model #7
finished with model #8
finished with model #9
finished with model #10
finished with model #11
finished with model #12
finished with model #13
finished with model #14
finished with model #15
finished with model #16

EVALUATION

Now we evaluate the model to see how it performs and how well it generalizes

In [82]:

from sklearn.metrics import accuracy_score  # a utility helper function that evaluates models
import matplotlib.pyplot as plt  # import plotting library
​
def measure(x, y):
    output = []
    for i in range(OUTPUT_SIZE):  # we iterate through each character in the output string
        target = list(map(lambda x: x[i], y))  # we fetch the appropriate expected target
        predicted = models[i].predict(x)  # we use our trained model to predict the the character
        accuracy = accuracy_score(target, predicted)  # we check the accuracy of the predictions against the expected
        output += [accuracy * 100]
    return output
​
def graph_results(results, title):
    plt.bar(np.arange(OUTPUT_SIZE), results, align='center', alpha=0.5)
    plt.xticks(np.arange(OUTPUT_SIZE), np.arange(1, OUTPUT_SIZE + 1))
    plt.ylabel('accuracy %')
    plt.xlabel('character in output')
    plt.title(title)
    plt.show()

In [83]:

train_results = measure(x_train, y_train)
val_results = measure(x_val, y_val)
test_results = measure(x_test, y_test)

now we graph the results to see that the training set attains a near-perfect performance which is to be expected with tree based models. The validation and test sets are consistent with each other, showing that the first 8 digits and the 11th digit are the most accurate. The 9th digit, the security digit, is the least accurate as expected. Everything after the 9th digit (with the exception of digit 11) also has drastically reduced accuracy

In [84]:

graph_results(train_results, 'train accuracy')

In [85]:

graph_results(val_results, 'validation accuracy')

In [86]:

graph_results(test_results, 'test accuracy')

USAGE

Now we construct our model for production by training on the entire dataset and serialized to file.

In [87]:

models = []
for i in range(OUTPUT_SIZE):
    target = list(map(lambda v: v[i], y))  # we fetch the appropriate target digits
    model = RandomForestClassifier(random_state=SEED, n_estimators=10)  # we use 100 estimators in our random forest setup
    model.fit(x, target)
    models += [model]
    print('finished with model #{}'.format(i))

finished with model #0
finished with model #1
finished with model #2
finished with model #3
finished with model #4
finished with model #5
finished with model #6
finished with model #7
finished with model #8
finished with model #9
finished with model #10
finished with model #11
finished with model #12
finished with model #13
finished with model #14
finished with model #15
finished with model #16

In [90]:

import pickle
​
for i in range(OUTPUT_SIZE):
    pickle.dump(models[i], open('{}/model_{}.pickle'.format(ROOT_DIR, i), 'wb'))  # serializing models to pickle format
    print('finished {}'.format(i))

finished 0
finished 1
finished 2
finished 3
finished 4
finished 5
finished 6
finished 7
finished 8
finished 9
finished 10
finished 11
finished 12
finished 13
finished 14
finished 15
finished 16

DEPLOYMENT

After serializing the models, they can then be used by our server. We use a flask server that loads the models and responds to inputs to make predictions. The code of which lives in “server.py” and should be fairly self-explanatory.

To run the server, make sure python is installed from

standard distribution – https://www.python.org/downloads/
anaconda distribution – https://www.anaconda.com/distribution/ – recommended

After installing python:

Open the command line program – cmd on windows, terminal on osx
Navigate to the server folder
Run the install file – note this may take some time depending on the computer specifications
and may look like the program has frozen, please allow it to finish
a. For osx: run “./install.sh” b. For windows: run “install”
Run the run file
a. For osx: run “./run.sh”
b. For windows: run “run”
After the server starts, a process will run to load the model. The console output will denote the progress and the final message “ALL MODELS LOADED” will indicate the server is ready to use.

Now a Flask server will be available at “http://localhost:5000”. This server can be queried by any REST client. For testing, I recommend PostMan or Restlet Client.

The following endpoints are supported:

/test – a test endpoint that responds with a test message
/info – responds with test probabilities of each character in the output, can be used to estimate confidence
/- accepts a parameter “text” with 24 signed integers separated with “|”, returns the predicted 17 character output sequence

For example: http://localhost:5000?text=-2|-72|-11|-2|18|100|-69|15|93|120|15|-97|-35|52|85|-114|53|-123|-1|-101|-38|125|-100|113
Will return: 1D8HB58D04F177301

Categories: Uncategorized

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Network Programming in .NET

Using #AI in #Cryptanalysis

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook

Network Programming in .NET

Using #AI in #Cryptanalysis

Share this:

Leave a comment Cancel reply

Follow me on Twitter

Archives

Like us on Facebook