EEG data are highly variable, both between people and within individuals over time. Machine learning is an obvious (perhaps the only) way to build a workable, flexible application.

You can only build a good application with machine learning once you've built good feature vectors. If your feature vectors aren't easy to distinguish, no algorithm, no matter how fancy, will help you. And, as it turns out, you don't even need fancy algorithms for EEG - EEG-based brain-computer interface seems to be linear, so a simple, off-the-shelf SVM with a linear kernel should do you perfectly well. No need for crazy recurrent neural networks or other nonlinear solutions. All you need to do is generate good features for your EEG data.

In this tutorial, I walk you through pre-processing your raw EEG data to make useful feature vectors for training a classifier. For simplicity, I focus on a single-channel EEG, in this case the Neurosky MindWave, and provide some sample data for you to work with. The resulting feature vectors will be ready to feed into an SVM - and, for more information on that, you can go straight from this tutorial to my other tutorial on building a simple BCI with a Neurosky Mindwave.

Let's begin!

Record some samples

First, you'll need to record some data. The key here is to come up with categorical examples on which you can train your classifier. So, "focused" versus "zoning out", "thinking about moving left" versus "thinking about moving right", &c.

For this tutorial, I've recorded some sample data to get you started. Each folder represents a different mental gesture. For each mental gesture, I've recorded 3 different trials (labeled 0, 1, 2), and, in each trial, 10 seconds of data. I've split each 10-second trial into half-second recordings.

Why are the recordings JSON files? JSON is everywhere. It's easy to read, easy to send over the wire, and there are libraries for parsing it in every language. Unlike CSVs, querying the data is as easy as referring to a key by its name (compare to referencing to the numerical index of the CSV column, which is hard to read and prone to human error). If you're new to JSON, it's worth reading up on.

Compute power spectra

If you open one of those JSON files, you'll see that there's a field called raw_readings with an array of numbers in it. The Neurosky device advertises a sampling rate of 512 Hz, so, our half-second recordings should contain about 256 of these numbers, making a total of 512 readings per second.

Let's make those raw readings a little bit more interpretable by turning them into a power spectrum:

def pSpectrum(vector):

    A = np.fft.fft(vector)
    ps = np.abs(A)**2
    ps = ps[:len(ps)/2]

    return ps

Don't worry about the math too much (though do look at the fast fourier transform if you're interested and unfamiliar). The point here is that we're seeing what frequencies are present in the signal we're getting. Remember that EEG is traditionally analyzed using frequency bands - although we take a slightly different approach here (more on that later), the frequency components are where we look for informative signal.

Now we can get a power spectrum for each reading in each trial:

pspectra = [pSpectrum(v) for v in get_readings(gesture, trial)]

where get_readings is a function that returns a list of parsed json files for a given mental gesture and trial number.

Creating the feature vectors

We have a power spectrum for each reading in our mental gesture recording. How can we make these feature vectors even more useful?

  1. Let's compress multiple power spectra into a single, averaged power spectrum. Our feature vectors should represent the amount of time we want to use in our final application. The gestures in my sample should be distinguishable at lengths of 2 seconds (this is just my wild guess), so let's average 4, half-second recordings together.

  2. Let's take the resulting, averaged power spectrum, and sample this power spectrum at various points. Do we really need every point in our power spectrum to distinguish tasks? Probably not. In fact, the more compressed our feature vector, the better our SVM will perform, so long as we manage to pick points that are sufficiently informative. (More points could mean more noise, which will result in worse classification accuracy).

Using the functions defined in this utility library, let's take a list of raw values from four lists of 512 raw values each (corresponding to 2 seconds of data from the Neurosky)

avgPSpec = brainlib.makeFeatureVector(
  four_readings,
  100)

makeFeatureVector first turns the four raw readings into power spectra, then bins the averaged feature vector at 100 points.

Now, all we need to do is perform this step for 2-second chunks of each trial. If we do this across all three trials, we're left with 15 feature vectors per mental gesture. Each feature vector would represent 2 seconds of recorded EEG data. Those vectors are ready to feed into an SVM for classification. (For details on that process, see my tutorial on how to use an SVM to build a simple brain-computer interface).

A note on meaning

Did you notice that our feature vectors represent the entire frequency band? If you did, good looking out. If you didn't, take a closer look at the getXOutput function.

If you're really attentive, you'll notice that we could have just used the EEG_POWER field that the Neurosky SDK gives us, in which each number represents one of the traditional EEG frequency bands (low/high alpha, beta, etc). Why didn't we just do that? Well, we don't want to make strong assumptions about where in the frequency spectrum the useful signal lies. While it's possible that the traditional power bands do contain useful signal, it's also possible that useful signal will come from outside these bands.

EEG devices don't just pick up on EEG - they will detect any electromagnetic frequencies in the environment, and EEG is a relatively tiny signal compared to, say, EMG. Imagine if your main "tell" has to do with muscular movement in the forehead! Or some combination of muscle movement and EEG band activity. Although this isn't "traditional" BCI, unless you're the most hard-lined of EEG purists, you probably don't want to discard these frequencies. Meanwhile, known sources of noise that are traditionally filtered out (e.g. the 60Hz "buzz" produced by the hum of the electrical grid here in North America) will be ignored by virtue of our machine learning algorithms; since these parts of the signal never correlate with any label, our algorithm will effectively "ignore" them.

Isn't this elegant? It leaves room for patterns to emerge without you, the developer, needing to form specific a priori hypotheses about the signal. And, if you do take a look at your signals down the line (as you should), you can always open it up and see what frequencies may be distinctive, and sample the X-output accordingly.

Play around!

Now, go experiment! You can use the signals I provided, or record some of your own. As you go, ask yourself the following questions, and try playing around to find the answers:

  1. How many many power spectra should you sample across? For your application, does 10 seconds seem more realistic? 30 seconds? an hour?

  2. Where in the frequency spectrum do you want to focus? I sample across the whole thing for starters, but there's a world of possibilities here. One thing I've found to work well is sampling at logarithmically-spaced points in the averaged spectrum.

  3. What kind of average do you want to take of the multiple power spectra? Here, I've made a vector of the 3rd percentile of elements in the averaged power spectra, but you can really do anything. I've included a few things in that utility library: the straight, numerical mean (avgPowerSpectrum), and the average percentile that are lower. Try these, or make your own!

I have two big recommendations as far as technique is concerned:

  • Open up your data and look at it. Look at your data at verious points in your pre-processing pipeline. This is the best way to see how you can make optimizations, and where you might be going wrong in producing your feature vectors.

  • Train and test a classifier right away. LightTable is an awesome tool, though many people use IPython notebooks to a similar effect. What's important is embracing the iterative, test-and-test-again aspect of machine learning development. Try to make it easy to test how different feature vectors affect classification accuracy. See my tutorial on using an SVM with Neurosky data for more details on how to estimate classifier accuracy using cross-validation.

Good luck! If you build something cool, or have any questions, get in touch at ffff@berkeley.edu