Our group uses the Neurosky MindWave, a $99 wireless EEG with a single elecdrode placed roughly on Fp2. These sensors have terrible spatial resolution (most EEG caps have over 60 sensors) and poor signal quality (the electrode is not precisely placed, and doesn't use gel to help conduct the signal). But, despite the limitations of our sensing, our machine learning-based approach to analyzing our data achieves up to 97% accuracy identifying between subjects, and 95%-100% accuracy classifying between mental tasks.

In this tutorial, I'll walk you through building and testing an SVM that takes an EEG reading and predicts who, among a group of people, the brainscan came from.

This tutorial elucidates our classification strategies. More broadly, it shows how to setup, train and test a support vector machine in Python.

At the end, I include a script that takes an EEG reading as input and guesses what mental task the subject was doing. That snippet should show you how to adapt the main tutorial to other cases. I also include some helpful links for further exploration.


I'm going to assume you've already done some data collection. In our group, we took ten-second recordings of 15 people doing various mental tasks (e.g., thinking about moving a finger; imagining a song, etc). Each participant performed ten trials. I recommend collecting at least ten trials per participant, as you'll need enough trials for both training and testing sets when doing cross-validation (more about this below).

I'm also going to assume you've already pre-processed your data. If you haven't done that, here's my tutorial on pre-processing Neurosky MindWave data. Preprocessing is just as crucial as classification!

Setting up the kitchen

Ok lets go! Install (or convince your friendly sysadmin to install) the brilliant scikit learn library.

Before we can start classifying anything, we need to get our data into a format that suits our SVM.

Loading your feature vectors from the disk

First we need to read the vectors from disk (JSON files) into memory.

import json, numpy

def loadVector(task,subject,trial):
    # open JSON file
    preprocessed_data = json.load(open(json,'rb'))
    V = preprocessed_data['power_spectra'] 
    # change all NaN values to 0 
    return [0 if math.isnan(v) else v for v in V['vector']]

notice the last line! ~Don't feed your SVMs null values.~ use zero instead.

Managing vectors in memory

You need 2 things to train a classifier:

  • features - which hold the attributes you're interested in classifying

  • labels - which communicate to the SVM what category each feature (or feature vector) falls into.

In our case, features are a row vector of pre-processed EEG readings, and labels represent the subject to whom eachreading belongs belongs.

~hint~: labels are integers! no problem in our case, since our subjects are identified by number as it is, but note that you can't use strings as labels ..

Canonically, feature vectors are represented by X and label vectors are represented by y.

Here's an example.

# items in X and y are related by their index in the array 
def assembleTrainingData(tasks,subjects,maxTrial):
    X = []
    y = []
    for task in tasks:
        for subject in subjects:
            for i in range(maxTrial):
                    v = loadVector(task, subject, i)
    return numpy.array(X,list), y

(for whatever it's worth., our filestructure is data/json/[task]/[subject]/[trial].json).

Training a classifier

Now we can start working on our classifier

At the top of your file, add:

from sklearn import svm

, now,

tasks = ['base', 'color', 'eye','finger', 'pass', 'song', 'sport']  
subjects = ['subject0','subject1', 'subject2']

# get X and y
X,y = assembleTrainingData(tasks,subjects,9)

# create the SVM classifier
clf = svm.LinearSVC()

# train the classifier on our data

Now the classifier's all trained! ~so easy. All we have to do is give it a data point it hasn't seen before (e.g., SVM.predict(Z) where Z is an novel feature vector) and see whether or not the classifier predicts the label correctly!

Not actually that easy: cross-validation

The above code shows the basics of how to train a classifier and how to use it for predicting the labels of unknown data. But understanding how well your classifier works is a bit more involved than all that.

If you want to get a good idea for how well your classifier performs, you're going to need to perform cross-validation.

Essentially, cross-validation takes one slice of the data it to train with, and witholds another slice of the data to test the classifier on after training.

Generally, we repeat this process a number of times, with different slices of the data on each run. Then we look the mean and standard deviation of classification accuracies across all trials to get an idea for how well our classifier works.

~golden rule~: Never test on your training set! That is, don't train your SVM on some data and then have your SVM do predictions about that same piece of data afterward. It won't be very informative.

sklearn's cross_validation module

Scikit Learn has an incredibly convenient module for cross-validation:


from sklearn import cross_validation  


scores = cross_validation.cross_val_score(clf, X, y, cv=7) 

now you can

print scores.mean(), scores.std()  

So, what happened here? Well, we did 7 "folds" of cross-validation (that is, sliced the data 7 different ways to create a number of different training and testing sets), then we found the mean and standard deviation of all seven rounds. With the mean and std, we can get a general sense for how well our classifier is able to distinguish vectors in our dataset.

Sklearn's cross_validation package is quite robust - you can read more about it here.

What's LinearSVC? Why not some other classifier? LinearSVC is a wrapper for the ultra-fast liblinear developed by Chih-Jen Lin et al at National Taiwan University. It's written in C, and it's the most performant classifier I know of for large datasets. Besies, it gets classification accuracy as good or better than more computationally expensive, nonlinear SVMs. Check out Lotte et al. 2007 for more on that.

Bringing it all together

Let's start with this file, trainingtools.py:

def assembleCorpus(tasks,subjects,maxTrial):
    X = []
    y = []
    for task in tasks:
        for subject in subjects:
            for i in range(maxTrial):
                    v = loadVector(task, subject, i)
    return numpy.array(X,list),y

# returns an array of classification accuracies
def test_subj_classification(tasks, subjects):
    # get the relevant vectors 
    X,y = assembleCorpus(tasks,subjects,9)
    # build an svm with a linear kernel
    clf = svm.LinearSVC()
    # split the data, fit a model,
    # and compute the score 7 consecutive times, with random splits each time
    scores = cross_validation.cross_val_score(clf, X, y, cv=7)
    return [scores.mean(), scores.std()]

This is a basic framework for (a) training a classifier on a subject's EEG data using some array of tasks for the training set and (b) performing cross-validation on that classifier.

Asking questions

Let's say our question is,

"Given an EEG reading, how well can we predict which user we collected the reading from?"

A couple of obvious answers come to mind:

  • It depends how many users we want to distinguish between! Distinguishing between 2 people will probably be easier than distinguishing between 15.

  • It depends how many users we want to distinguish between! Distinguishing between 2 people will probably be easier than distinguishing between 15

  • It depends on what tasks we put in the training set! Some tasks may be more "distinctive" between subjects than others.

With that in mind, a reasonable thing to do might be:

  • try all combinations of all number tasks (i.e., every task, every combination of two tasks, of three tasks, &c .. plus all tasks put together).

  • try different splits and sizes of the participant pool (e.g., two participants, three participants, six participants, all 15 participants, &c).

Now let's express that in code:

import csv, itertools  
import trainingtools

if __name__ == '__main__':

    all_tasks = ['base', 'color', 'eye','finger', 'pass', 'song', 'sport']
    subjects = ['subject0','subject1', 'subject2', 'subject3']

    # for starters, let's try to distinguish between all subjects.
    # let's train the classifier on one task at a time and see which task is best.
    tasks = []

    # now let's do it on every combination of tasks
    for i in range(len(all_tasks)-2):
         tasks.extend(itertools.combinations(all_tasks,i+2 ))

    # and finally let's try it with all the tasks in our training set
    tasks.append(['base', 'color', 'eye','finger', 'pass', 'song', 'sport'])

    # an array to hold each row of results for the eventual csv
    all_results = []

    # note that "task" here is actually an array of tasks
    for task in tasks:

        taskname = str(task).replace(',','/')

        if isinstance(task,str):
            task = [task]

        # results is an array [mean,std]
        results = trainingtools.test_subj_classification(task,subjects[0:7])
        print('~~~~~ ',results)

    # assemble a csv w all these results
    with open('../results/subjid_results_2.csv','wr') as f:
        csvwriter = csv.writer(f, delimiter=',',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
        for result_array in all_results:

    print('all done')

This script

  1. Generates every possible combination of tasks

  2. Goes through each combination, and

  3. Trains a classifier on the given range of subjects (in this example, six subjects, specified by test_subj_classification(task,subjects[0:7])

  4. Outputs a CSV with mean and standard dev of classification accuracy on each task

To see how accuracy changes given the number of subjects between which we try to distinguish, we need simply change [0:7] to something else and re-run the script.

Running time is about a minute on my machine - pretty fast.

Back matter

Here I have some further tips on exploring the wide world of machine learning, plus an extra example on task identification.

Increasing classification accuracy

Using machine learning is largely a process of iterative testing and tweaking. Although the details are outside the scope of this tutorial, but I've collected some useful links for tweaking your classifier using sklearn:

A final example - task identification

Given someone's brain scan, can we tell what mental task the person is doing?

This script goes through every pair of two tasks and, for each subject, cross-validates SVM performance on distinguishing between those two tasks:

import csv, itertools  
import trainingtools

if __name__ == '__main__':

    tasks = ['base', 'color', 'eye','finger', 'pass', 'song', 'sport']
    subjects = ['subject0','subject1', 'subject2', 'subject3']

    # get all pairs of tasks
    task_combos = itertools.combinations(tasks,2)

    all_results = []
    # test classification on accuracy on every task pair
    for task_combo in task_combos:


        # start a results row for the subject, first item "(task/task)"
        subject_results = []
        subject_results.append(str(task_combo).replace(',', '/'))

        # within every subject
        for subject in subjects:
            # return average accuracy for each subject
            print(subject, subject_results[-1])


    # assemble a csv w all these results
    with open('results_200.csv','wr') as f:
        csvwriter = csv.writer(f, delimiter=',',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
        for result_array in all_results:

    print('all done')