MIRI Seminar on Data Streams, Spring 2014 Lab 2: Introduction to MOA R. Gavald\`a ========================================== Download MOA Start it using the instructions in http://moa.cms.waikato.ac.nz/getting-started/ The MOA manual may be useful too. We will use only the Classification tab. Let us first try the task "Measure Stream Speed". Have a look at the different stream generators: afffile: reads from an existing ARFF file, processing instances in sequence randomTreeGenerator: generates a random decision tree, then uses it to label instances hyperplane generator: simulates a hyperplane that oscillates and is used as linear classifier. You can change number of attributes, number of attributes that actually vary, speed of change, and (sigma) probability with which it changes direction of oscillation rbfgenerator: simulates a number of gaussians which randomly get assigned gaussians. there is a rbfgeneratordrift which lets us introduce drifts: centroids walk around See that there is also a task "Write Stream To ARFF": Generate a stream and write it to an ARFF file. Still within MeasureStreamSpeed, we can generate more complex streams with gradual or sudden changes by concatenating two or more streams. See the conceptddriftStream stream generator: lets us concatenate a base stream (to be chosen) and a drift stream (to be also chosen). You can choose the central position at which the change occurs, and the window over which the change occurs (i.e., change starts at p-w and is complete at p+w). The two distributions are merged using a sigmoid, and alternatively to the window you can use parameter alpha, the slope of the sigmoid at its middle point. You can merge more than one stream recursively, by letting the driftstream be itself a conceptdriftstream. A variant conceptdriftrealstream allows merging streams with different numbers of attributes. Not all classifiers need to support this. In the Classification tab still, check out the following tasks: - LearnModel: learns a classifier model from a stream and exports it to a file (as in WEKA, it is a serialized Java object) - EvaluateModel: reads a classifier from a file and evaluats it on a stream - EvaluatePrequential: builds a classifier and evaluates it on the same stream. Each stream element is first evaluated then used for training. Let's stay in this one. Now the bottom part of the screen becomes alive: you can plot evolution of accuracy, kappa statistics, RAM-hours (see next theory session), time, and memory used. Memory is probably in Mb. Prepare a stream with drift. Let the first part be a randomtreegenerator (so it's an easy task for decision tree classifiers) and the second one a hyperplane (so it's easy for linear classifiers). Ask for a few million instances or it (for a total of 4-10 million), and a few tens of attributes in both streams (say 50, and say 25 of them do drift). Now it will take a few minutes to run a classifier on this stream. Try the MajorityClass classifier. Its accuracy will be 1/number of classes (because classes are equiprobable in these generators). Try the Perceptron algorithm (a linear classifier that adapts to change, but of course is only good for linear problems). Should have relatively low accuracy in the first part, then higher one in the second one. Try a plain HoeffdingTree. This is the VFDT explained in class. It will do well in the first part, then worse in the second (and you'll see how its error oscillates as the hyperplane rotates). Try a HoeffdingAdaptiveTree. It'll still notice the change, but will do better on the second part, as it will learn more quickly the change and learn more quickly the hyperplane rotations. Note, on the other hand, it is slower and uses more memory (in my runs, from 30Mb aprpox to 50Mb aprox). ---- You do not have to deliver anything for this session. Just learn.