Because I was free to choose my implementation, and since I love working with Ruby on Rails and Ruby in general, I decided to work with Ruby scripts and various gems that would help me in creating the required scripts for the task at hand.
My first step was finding a gem that would allow me to index the files so they could be searched for keywords. Ferret is a nice little gem that allows indexing of text fragments. It was relatively easy for me to insert the contents of these XML files instead of the text fragments. Below is a snippet of code, written in pure Ruby that creates a new index with Ferret and stores some content in the files, depending on whether the content was correct or not
require 'rubygems'
require 'ferret'
require 'ankusa'
require 'ankusa/file_system_storage'
require 'rexml/document'
require 'fileutils'
include Ferret
file_list = Dir['input/*.xml']
unless file_list.empty?
puts 'The following XML files where discovered in the input directory:'
file_list.each_with_index { |f,i| puts "\t#{i}. #{File.basename f}"}
puts 'Press enter to continue with the script...'
gets
storage = Ankusa::FileSystemStorage.new 'core/data/storage.bin'
c = Ankusa::NaiveBayesClassifier.new(storage)
c.train :recipe, ""
c.train :unknown, ""
ferret = Ferret::Index::Index.new :default_input_field => :content,
:id_field => :content,
:path => 'core/data/index'
puts 'Processing files:'
file_list.each do |file|
xml = REXML::Document.new(File.open file, 'r')
content = xml.root.elements["normalize-space(//body/section/paragraph/text())"].to_s
title = xml.root.elements['//body/title/text()'].to_s
category = c.classify(content)
puts "#{File.basename file}"
puts "Category :: #{category}"
puts "Is this correct? [y/n]"
reply = false
reply = true if gets.strip.chomp.eql? 'y'
c.train :unknown, content if (!reply and category.eql? :recipe)
c.train :unknown, content if (reply and category.eql? :unknown)
c.train :recipe, content if(!reply and category.eql? :unknown)
c.train :recipe, content if (reply and category.eql? :recipe)
c.untrain :recipe, content if (!reply and category.eql? :recipe)
c.untrain :recipe, content if (!reply and category.eql? :unknown)
ferret << {:content => content, :title => title, :filename => File.basename(file)}
if ((category.eql?(:recipe) and reply) or (category.eql?(:unknown) and !reply))
FileUtils.move(file, 'output/recipes')
end
if ((category.eql?(:unknown) and reply) or (category.eql?(:recipe) and !reply))
FileUtils.move(file, 'output/unknown')
end
puts " #{File.basename file} processed..."
storage.save
end
else
puts 'No XML files discovered inside the input directory.'
end
I know the display might not be ideal on the blog, but bear with it.
To allow the classification, I relied on the 'Classifier' gem first which can be found on rubygems.org, but this gem contained a really primitive implementation of the Bayes Classifier, and in the end I was forced to abandon it because it did not meet my requirements.
Then I stumbled across the Ankusa gem which actually performed nicely for what I needed to finish the task at hand.
However, this gem contained a flaw in the code. When certain words were analysed by the gem, a log(0) calculation occured, crashing the Ruby script. If you're not a math genius, the log calculation of 0 would result in minus infinite, which is not that easy to work with.
To solve this problem, I created a branch on GitHub that contains my fix and all other changes already implemented by other people that have worked on the Ankusa gem.
Now, I have a complete working script that indexes, analyzes and classifies XML files.