Photos, ramblings, whatever

Wednesday, October 6, 2010

Long time

I haven't posted on my blog for almost a year now.
My girlfriend is just back from a business trip to the Philippines, so I thought I'd share one of her pictures:

She's the one on the left

Friday, January 1, 2010

New year: 2010

Yesterday was a good one. For one thing the date looked like a binary number (010110). For another thing I had a good day at the beach with good friends.
Unfortunately the rousie work isn't working so well: I'm on the company's third gang, and unfortunately the shearers on the third gang are almost as unreliable as the weather, meaning I have so far only had one day of work. That hasn't stopped me from being fairly busy. I've also painted my grandmother's laundry.
Below are some of the photos I took on New Years Day.



A very good friend
My sister




Thursday, December 10, 2009

Shearing

On Tuesday I had a trial as a rousie for a shearing company, and yesterday was my first day of paid work as a rousie. Today I have the day off because bad weather was expected. For those who don't know what a rousie is: we take the fleece from the shearing stage, throw it onto the wool table (that's harder than it sounds but I got used to it quite quickly), skirt the fleeces (remove the daggy bits from the edges), and roll them up and put them in the press. We also keep the floor tidy. The hours are long and the work is hard but it's rather fun and the pay is good.
On another note: I just beat my computer at Chinese chess!

It was set to use an opening book and think 6 moves ahead. I had previously had 2 stalemates in a row with it set to think 5 moves ahead.

Friday, December 4, 2009

Marks

I just got the last of my marks back from the second semester.
  • CHIN102 (Chinese language 1B): A
  • MATH161 (Discrete Mathematics & Logic): A
  • COMP206 (Program and Data Structures): C+
I unfortunately didn't pass COMP311 due to having forgotten how to write a good academic essay. Something I must re-learn during the summer holiday.

Tuesday, December 1, 2009

Flowers in the Tararua range

On Sunday I accompanied a group of people who went into the forest in the vicinity of Holdsworth Lodge looking for Orchids. We didn't see any that were actually in flower (or at least I didn't) but I did take some nice pictures of various other things.
I'm not sure what this one is.

Lupin

A pretty little fern.

Birch mistletoe. It's an extremely rare plant and in danger of becoming extinct. Mostly because possums think it's chocolate, but also because it doesn't spread very fast.

Another shot of the birch mistletoe.

Thursday, November 26, 2009

That code in my last post

Turns out that code in my last post worked pretty well despite having a couple of bugs which I would have expected to make the whole thing not work. In fact, now they're fixed, it doesn't even work much better due to the low probability of any mutations causing the model organisms to take advantage of this afternoon's bugfixes.

As well as the bugfixes, I've now set it up so the virtual organisms get saved to the disk. This does make it more practical as a spamfilter, but I'll want to change the save location when I start integrating it with an emailer.

Here is the code. It still works basically as described in the previous post. But now the virtual organisms can actually do arithmetic. However, from what I can figure out, if I start the program from scratch (including deleting all previous virtual organisms), the ones which try to use the arithmetic unit get out-competed by the ones which do everything just by shifting round chunks of data. Some of them even manage to return different values even though their code doesn't contain a return instruction. I suspect most of them do this by putting the zero they get fed at the end of each file into the destination value of an instruction that gets executed later, turning it into a return instruction.

Anyway, here's the updated code:
using GLib;

namespace tta {
 public static delegate void sexecF(short value, void* target);
 public struct setWordFunc {
  sexecF exec;
  void* target;
 }
 public static delegate short gexecF(void * target);
 public struct getWordFunc {
  gexecF exec;
  void * target;
 }
 
 public class VM : GLib.Object {
  private setWordFunc[] outputMaps;
  private getWordFunc[] inputMaps;
  private short[] bytes;
  private int pc = 1;
  private bool state = false;
  public int timeout { get; set; }
  
  public VM(int size) {
   bytes = new short[size];
   outputMaps = new setWordFunc[size];
   inputMaps = new getWordFunc[size];
   timeout = -1; // if negative, leave alone. If positive, decriment each cycle until 0,
     // then halt.
  }
  
  public VM.array(short[] ram) {
   bytes = ram;
   outputMaps = new setWordFunc[bytes.length];
   inputMaps = new getWordFunc[bytes.length];
  }
  
  public void mapInput(getWordFunc input, int address) {
   inputMaps[address % inputMaps.length] = input;
  }
  
  public void mapOutput(setWordFunc output, int address) {
   outputMaps[address % outputMaps.length] = output;
  }
  
  public short getWord(int address) {
   int a = address % bytes.length;
   a = a < 0 ? -a : a;
   if(inputMaps[a].exec != null) {
    return inputMaps[a].exec(inputMaps[a].target);
   } else {
    return bytes[a];
   }
  }
  
  public void setWord(int address, short value) {
   int b = address % bytes.length;
   b = b < 0 ? -b : b;
   if(outputMaps[b].exec != null) {
    outputMaps[b].exec(value, outputMaps[b].target);
   } else {
    bytes[b] = value;
   }
  }
  
  public int size { get { return bytes.length; } }
  
  public static void jump(short value, void* target) {
   ((VM)target).pc = value - (((VM)target).state? 2 : 0); // - 2 so advancing will bring it to the right place
  }
  public static void halt(short value, void* target) {
   ((VM)target).state = false;
  }
  
  public void run() {
   state = true;
   while(state == true && timeout != 0) {
    execute(getWord(pc), getWord(pc + 1));
    pc = (pc + 2) % bytes.length;
    if(timeout > 0) {
     timeout--;
    }
   }
  }
  
  public void execute(int src, int dest) {
   setWord(dest, getWord(src));
  }
 }
 
 public class ALU : GLib.Object {
  private short a = 0;
  private short b = 0;
  
  public static void setOperandA(short value, void* target) {
   ((ALU)target).a = value;
  }
  
  public static void setOperandB(short value, void* target) {
   ((ALU)target).b = value;
  }
  
  public static short add(void* target) {
   var x = (ALU)target;
   return x.a + x.b;
  }
  
  public static short subtract(void* target) {
   var x = (ALU)target;
   return x.a - x.b;
  }
  
  public static short multiply(void* target) {
   var x = (ALU)target;
   return x.a * x.b;
  }
  
  public static short modulus(void* target) {
   var x = (ALU)target;
   return x.a % x.b;
  }
  
  public static short divide(void * target) {
   var x = (ALU)target;
   return ( x.b == 0 ) ? 0 : ( x.a / x.b);
  }
  
  public static short gt(void* target) {
   var x = (ALU)target;
   return (short)(x.b > x.a);
  }
  
  public static short eq(void* target) {
   var x = (ALU)target;
   return (short)(x.b == x.a);
  }
 }
 
}

namespace evospam {
 class Bot : GLib.Object {
  public int score { get; set; }
  private short[] dna = new short[512];
  
  public Bot() {
   for(int i = 0; i < 512; i++) {
    dna[i] = (short)(GLib.Random.next_int() % short.MAX);
   }
  }
  
  public Bot.load(string filename) {
   try {
    string temp;
    ulong length;
    FileUtils.get_contents(filename, out temp, out length);
    GLib.Memory.copy(dna, temp, (length > 1023) ? 1023 : length);
   } catch ( FileError e ) {
    // unloaded or terminally corrupted bots will be replaced as the population evolves.
   }
  }
  
  public void save(string filename) {
   try {
    FileUtils.set_contents(filename, (string)dna, 1024);
   } catch(FileError e) {
    // don't worry. Be happy.
   }
  }
  
  public Bot.mutate(Bot parent) {
   for(int i = 0; i < 512; i++){
    if(GLib.Random.next_double() > 0.05) {
     dna[i] = parent.dna[i];
    } else {
     dna[i] = (short)(GLib.Random.next_int() % short.MAX);
    }
   }
  }
  
  public Bot.crossover(Bot mother, Bot father) {
   int point = (int)(GLib.Random.next_int() % 512);
   int point2 = (int)(GLib.Random.next_int() % (512 - point)) + point;
   for(int i = 0; i < point; i++){
    if(GLib.Random.next_double() > 0.05) {
     dna[i] = mother.dna[i];
    } else {
     dna[i] = (short)(GLib.Random.next_int() % short.MAX);
    }
   }
   for(int i = point; i < point2; i++){
    if(GLib.Random.next_double() > 0.05) {
     dna[i] = father.dna[i];
    } else {
     dna[i] = (short)(GLib.Random.next_int() % short.MAX);
    }
   }
   for(int i = point2; i < 512; i++){
    if(GLib.Random.next_double() > 0.05) {
     dna[i] = mother.dna[i];
    } else {
     dna[i] = (short)(GLib.Random.next_int() % short.MAX);
    }
   }
  }
  
  private static struct retParam {
   short* retptr;
   tta.VM haltTgt;
  }
  public static void ret(short value, void* target) {
   retParam* a = (retParam*)target;
   *(a->retptr) = value;
   tta.VM.halt(value, a->haltTgt);
  }
  
  public static short readStream(void* target) {
   unowned GLib.FileStream io = (GLib.FileStream)target;
   if(!io.eof()) {
    return (short)io.getc();
   } else {
    return 0;
   }
  }
  
  public short run(GLib.FileStream message) {
   var tmp = new short[512];
   var alu = new tta.ALU();
   GLib.Memory.copy(tmp, dna, 1023);
   var vm = new tta.VM.array(tmp);
   vm.mapOutput({vm.jump, vm}, 503);
   vm.mapOutput({alu.setOperandA, alu}, 504);
   vm.mapOutput({alu.setOperandB, alu}, 505);
   vm.mapInput({alu.add, alu}, 506);
   vm.mapInput({alu.subtract, alu}, 507);
   vm.mapInput({alu.multiply, alu}, 508);
   vm.mapInput({alu.divide, alu}, 509);
   vm.mapInput({alu.gt, alu}, 510);
   vm.mapInput({alu.eq, alu}, 511);
   short r = 0;
   retParam xt = {&r, vm};
   vm.mapOutput({ret, &xt}, 0);
   vm.mapInput({readStream, message}, 0);
   vm.timeout = 2048;
   vm.run();
   return r;
  }
 }
}

int evaluate(evospam.Bot b) {
try{
 GLib.Dir ham = GLib.Dir.open("ham");
 GLib.Dir spam = GLib.Dir.open("spam");
 int[] score = {0, 0};
 int finalscore;
 for(int c = 0; c < 2; c++) {
  weak GLib.Dir cd = c == 0 ? ham : spam;
  string filename = cd.read_name();
  while(filename != null) {
   int guess = b.run(FileStream.open("%s/%s".printf(c == 0 ? "ham" : "spam", filename), "r"));
   int s = guess - 16383 * c;
   s = s < 0 ? -s : s;
   if(s < (2000)) {
    score[c]++;
   }
   filename = cd.read_name();
  }
 }
 finalscore = score[0] < score[1] ? score[0] : score[1];
 b.score = finalscore;
 return finalscore;
} catch(GLib.FileError e) { stderr.printf("Cannot open directory\n"); return 0; }
}

int main(string[] args) {
 evospam.Bot[] bots = new evospam.Bot[300];
 int top_score = 0;
 for(int i = 0; i < 300; i++) {
  string fn = "bots/%d.bot".printf(i);
  if(FileUtils.test(fn, FileTest.EXISTS)) {
   bots[i] = new evospam.Bot.load(fn);
  } else {
   bots[i] = new evospam.Bot();
  }
  int score = evaluate(bots[i]);
  top_score = score > top_score ? score : top_score;
 }
 stdout.printf("Dummy run top score = %d\n", top_score);
 while(top_score < 32) {
  int lowest = 0;
  int lowscore = bots[0].score;
  for(int i = 1; i < 300; i++) {
   if(bots[i].score < lowscore || (bots[i].score == lowscore && GLib.Random.next_double() < 0.03)) {
    lowscore = bots[i].score;
    lowest = i;
   }
  }
  if(top_score == 0) {
   bots[lowest] = new evospam.Bot();
  } else {
   if(GLib.Random.next_double() > 0.9) {
    bots[lowest] = new evospam.Bot.crossover(bots[GLib.Random.next_int() % 300], bots[GLib.Random.next_int() % 300]);
   } else {
    bots[lowest] = new evospam.Bot.mutate(bots[GLib.Random.next_int() % 300]);
   }
  }
  lowscore = evaluate(bots[lowest]);
  bots[lowest].save("bots/%d.bot".printf(lowest));
  if(lowscore > top_score) {
   stdout.printf("Highscore: %d; index: %d\n", lowscore, lowest);
   top_score = lowscore;
  }
 }
 return 0;
}



Ah, Finally found out how to get indentation to appear in HTML!

Labels:

Wednesday, November 25, 2009

evolution of spam filters

One of the many projects I'm coding right now is a spam filter. Well it's not just a spam filter: I'm pretty sure I'll reuse parts of it in some pretty diverse other things. Without further ado I'll share the prototype (in Vala)...
using GLib;

namespace tta {
public static delegate void sexecF(short value, void* target);
public struct setWordFunc {
sexecF exec;
void* target;
}
public static delegate short gexecF(void * target);
public struct getWordFunc {
gexecF exec;
void * target;
}

public class VM : GLib.Object {
private setWordFunc[] outputMaps;
private getWordFunc[] inputMaps;
private short[] bytes;
private int pc = 1;
private bool state = false;
public int timeout { get; set; }

public VM(int size) {
bytes = new short[size];
outputMaps = new setWordFunc[size];
inputMaps = new getWordFunc[size];
timeout = -1; // if negative, leave alone. If positive, decriment each cycle until 0,
// then halt.
}

public VM.array(short[] ram) {
bytes = ram;
outputMaps = new setWordFunc[bytes.length];
inputMaps = new getWordFunc[bytes.length];
}

public void mapInput(getWordFunc input, int address) {
inputMaps[address % inputMaps.length] = input;
}

public void mapOutput(setWordFunc output, int address) {
outputMaps[address % outputMaps.length] = output;
}

public short getWord(int address) {
int a = address % bytes.length;
a = a < b =" address" b =" b" pc =" value" state =" false;" pc =" 1;" state =" true;" state ="=" i =" getWord(pc);">> 8) & 0x00FF, i & 0x00FF);
pc = (pc + 1) % bytes.length;
if(timeout > 0) {
timeout--;
}
}
}

public void execute(int src, int dest) {
setWord(dest, getWord(src));
}
}

public class ALU : GLib.Object {
private short a = 0;
private short b = 0;

public static void setOperandA(short value, void* target) {
((ALU)target).a = value;
}

public static void setOperandB(short value, void* target) {
((ALU)target).b = value;
}

public static short add(void* target) {
var x = (ALU)target;
return x.a + x.b;
}

public static short subtract(void* target) {
var x = (ALU)target;
return x.a + x.b;
}

public static short multiply(void* target) {
var x = (ALU)target;
return x.a * x.b;
}

public static short modulus(void* target) {
var x = (ALU)target;
return x.a % x.b;
}

public static short divide(void * target) {
var x = (ALU)target;
return ( x.b == 0 ) ? 0 : ( x.a / x.b);
}

public static short gt(void* target) {
var x = (ALU)target;
return (short)(x.b > x.a);
}

public static short eq(void* target) {
var x = (ALU)target;
return (short)(x.b == x.a);
}
}

}

namespace evospam {
class Bot : GLib.Object {
public int score { get; set; }
private short[] dna = new short[512];

public Bot() {
for(int i = 0; i < i =" 0;"> 0.1) {
dna[i] = parent.dna[i];
} else {
dna[i] = (short)(GLib.Random.next_int() % short.MAX);
}
}
}

private static struct retParam {
short* retptr;
tta.VM haltTgt;
}
public static void ret(short value, void* target) {
retParam* a = (retParam*)target;
*(a->retptr) = value;
tta.VM.halt(value, a->haltTgt);
}

public static short readStream(void* target) {
unowned GLib.FileStream io = (GLib.FileStream)target;
if(!io.eof()) {
return (short)io.getc();
} else {
return 0;
}
}

public short run(GLib.FileStream message) {
var tmp = new short[512];
var alu = new tta.ALU();
GLib.Memory.copy(tmp, dna, 1023);
var vm = new tta.VM.array(tmp);
vm.mapInput({alu.add, alu}, 506);
vm.mapInput({alu.subtract, alu}, 507);
vm.mapInput({alu.multiply, alu}, 508);
vm.mapInput({alu.divide, alu}, 509);
vm.mapInput({alu.gt, alu}, 510);
vm.mapInput({alu.eq, alu}, 511);
short r = 0;
retParam xt = {&r, vm};
vm.mapOutput({ret, &xt}, 0);
vm.mapInput({readStream, message}, 0);
vm.timeout = 4000000;
vm.run();
return r;
}
}
}

int evaluate(evospam.Bot b) {
try{
GLib.Dir ham = GLib.Dir.open("ham");
GLib.Dir spam = GLib.Dir.open("spam");
int[] score = {0, 0};
int finalscore;
for(int c = 0; c < cd =" c" filename =" cd.read_name();" guess =" b.run(FileStream.open(" c ="=" s =" guess" s =" s" filename =" cd.read_name();" finalscore =" score[0]" score =" finalscore;" bots =" new" top_score =" 0;" i =" 0;" score =" evaluate(bots[i]);" top_score =" score"> top_score ? score : top_score;
}
stdout.printf("Dummy run top score = %d\n", top_score);
int counter = 0;
int overallCounter = 0;
while(top_score < lowest =" 0;" lowscore =" bots[0].score;" i =" 1;" score ="=" lowscore =" bots[i].score;" lowest =" i;">= 300) {
stdout.printf("...(lowest this round: %d)\n", lowscore);
counter = 0;
}
if(top_score == 0) {
bots[lowest] = new evospam.Bot();
} else {
bots[lowest] = new evospam.Bot.mutate(bots[GLib.Random.next_int() % 300]);
}
lowscore = evaluate(bots[lowest]);

if(lowscore > top_score) {
stdout.printf("Highscore: %d\n", lowscore);
top_score = lowscore;
counter = 0;
}
counter++;
overallCounter++;
}
stdout.printf("Generations after initialization: %d\n", overallCounter);
return 0;
}




And here is the output from a fairly typical run:

Dummy run top score = 0
Highscore: 3
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
Highscore: 4
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
Highscore: 7
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 1)
...(lowest this round: 0)
...(lowest this round: 0)
Highscore: 8
...(lowest this round: 0)
...(lowest this round: 0)
Highscore: 13
...(lowest this round: 3)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 2)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
Highscore: 16
...(lowest this round: 4)
...(lowest this round: 4)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 7)
...(lowest this round: 7)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 10)
...(lowest this round: 0)
...(lowest this round: 10)
Highscore: 17
...(lowest this round: 9)
...(lowest this round: 10)
Highscore: 18
...(lowest this round: 10)
...(lowest this round: 0)
...(lowest this round: 3)
...(lowest this round: 13)
...(lowest this round: 13)
Highscore: 19
...(lowest this round: 10)
...(lowest this round: 0)
...(lowest this round: 0)
...(lowest this round: 15)
...(lowest this round: 16)
...(lowest this round: 3)
...(lowest this round: 13)
Highscore: 23
...(lowest this round: 0)
...(lowest this round: 15)
...(lowest this round: 0)
...(lowest this round: 10)




There is a possibility I will post this without waiting for the program to complete.

Before I go, I'll discuss briefly how it works and some of its features. The code in the tta namespace is actually a computer in its own right, although one which only exists in software and the mind of anyone reading the code. Its first unusual property is that it halts as soon as it outputs anything. That's because it's only being run in order to produce one number (which is then used to decide whether it thinks the input is a spam message). It's also rather unusual as a computer in that it only operates on 16-bit words, there are no bitwise operations and no pure shift operators. In spite of these limitations it's actually Turing-complete, and if the restriction of one output only is removed it can be used to compute anything that will fit in its registers (or anything at all if programmed to use external memory).
Its other unusual feature is that it only has one operation, so there is no need to waste any bits specifying which one is to be used for each instruction! The instruction is "copy". It's 16 bits wide, the first 8 bits are the register ID for the original, and the remaining 8 bits are the register ID for the copy destination. Most registers simply store the data, but one is used for reading the file and returning the evaluation, and some others are used to control the arithmetic logic unit. This kind of system is called a transport triggered architecture. I thought of it before I read about it, but I read about it before I started coding this one.

Enough about that: the evolutionary algorithm is pretty standard. It basically works by continually throwing out whichever candidate program appears to be the worst solution and replacing it with a random variation of a randomly selected superior. Some of the new random variations don't work at all, so they're replaced almost at once. Some work better than any of the others, and have a tendency to fill the entire population with their descendants until something even better comes along.

The fitness function used here is actually the most original part of the algorithm. It works as follows: count the number of real emails that it classifies as real email. Then count the number of spam emails it classifies as spam. The score is the lowest number of these. My first approach was to simply count the number of correctly classified emails, but that worked rather poorly because the programs that classified either everything as spam or everything as real email had a huge advantage over the ones that actually read the email.

As for the final performance: the best score I've seen so far is 25. The samples it's working with are 32 emails each of spam and genuine, so that 25 score has a minimum of 78% accuracy. The overall accuracy is probably better because the score is only based on the aspect of the task it performs worst at.
Admittedly: the trials so far only measure performance in classifying 64 messages. For all I know it will fall flat in a real-life situation, but the only way to find that out is to test it.

Labels: