Archive | April, 2012

Intro into AI Natural Language Processing Programming Solutions

4 Apr

As thousands students around the world I took Intro into AI class offered online for free by Standford University. It was an eye opening experience. To paraphrase somebody’s quote: “Any advance technology seems like magic to uninformed”. The class did the excellent to lift the veil over the magic of AI by covering fundamental concepts of Bayes Networks, Propositional Logic, Classical Planning, Markov Decision Processes, Particle Filters, Game Theory, Natural Language Processing (NLP) and many others.

But as a hands on programmer I got tired of pen and paper number crunching (I did program some trivial calculations) so optional NLP programming exercises scratched a coding itch. Solutions (in Clojure, of course) are available at github.com/vitalyper/aiclass. Readme has more details not to be repeated here. Here are some thoughts after solving more challenging problem 2 – restore two characters shredded text.

|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|”A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|” |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |

Hypothesis 1
Any solution would require use of some external data. I used 2500+ most frequently used English words but bigram probability distribution compiled from sufficiently large corpus would work better. This echoes theme from NLP lectures – solutions to many NLP problems require collection of large amount of data. For example, spell checkers are trivial to implement if you have good set of misspelled words.

Observation 1
Brute force is a reasonable approach when you have access to a decent hardware. I didn’t so I had to limit permutations in steps 1 and 2. Parallelizable brute force approach would be much simpler and slower.

Observation 2
Applicability to other languages.
As it stands solution could be applicable to other languages with the following properties: 1) word based 2) left-to-right direction 3) same punctuation set. The most used words set would have to change of course. Adjusting solution to trigrams would requie some additional work and refactoring.

Advertisements