Archive | Software RSS feed for this section

Intro into AI Natural Language Processing Programming Solutions

4 Apr

As thousands students around the world I took Intro into AI class offered online for free by Standford University. It was an eye opening experience. To paraphrase somebody’s quote: “Any advance technology seems like magic to uninformed”. The class did the excellent to lift the veil over the magic of AI by covering fundamental concepts of Bayes Networks, Propositional Logic, Classical Planning, Markov Decision Processes, Particle Filters, Game Theory, Natural Language Processing (NLP) and many others.

But as a hands on programmer I got tired of pen and paper number crunching (I did program some trivial calculations) so optional NLP programming exercises scratched a coding itch. Solutions (in Clojure, of course) are available at github.com/vitalyper/aiclass. Readme has more details not to be repeated here. Here are some thoughts after solving more challenging problem 2 – restore two characters shredded text.

|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|”A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|” |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |

Hypothesis 1
Any solution would require use of some external data. I used 2500+ most frequently used English words but bigram probability distribution compiled from sufficiently large corpus would work better. This echoes theme from NLP lectures – solutions to many NLP problems require collection of large amount of data. For example, spell checkers are trivial to implement if you have good set of misspelled words.

Observation 1
Brute force is a reasonable approach when you have access to a decent hardware. I didn’t so I had to limit permutations in steps 1 and 2. Parallelizable brute force approach would be much simpler and slower.

Observation 2
Applicability to other languages.
As it stands solution could be applicable to other languages with the following properties: 1) word based 2) left-to-right direction 3) same punctuation set. The most used words set would have to change of course. Adjusting solution to trigrams would requie some additional work and refactoring.

Advertisements

Clojure Conj 2011 Recap

17 Nov

Having attended first Conj last year I was looking forward to one more Clojure deep dive. And I was not disappointed once again. Similar to the last year the conference took place in Raleigh, NC but this time hotel was right in the middle of downtown with walking distance access to restaurants, shops, etc. All sessions were recorded, look forward to upcoming videos. Here is my attempt at quick recap with some personal comments.

Day 1, Thursday, Nov 10th

  • Session 1, Ideas Rant, Stuart Sierra
    Stuart, with occasion inspired blue hair, started the conference with rant-like collection of ideas/suggestions. Among them were: using clojure read/print for serialization, how to implement your own clojure structures for drop-in replacement for closure ones. He also spent some time talking about java.util.concurrent goodies as an option for cases where Clojure provided state management primitives don’t fit the bill. Somewhat mixed message IMO, given the clojure’s goal to hide much of the Java. Stuart also mentioned clojure.test.generative, input test generation library inspired by Haskell’s QuickCheck.
  • Session 2, Concurrent Stream Processing, David McNeil, Revelytix, Inc.
    One of two very good session on “Clojure in the real World”. Problem is to process SPARQL federated queries against multiple SQL datasources, aggregate and combine results. Solution was inspired by unix pipe processing.  s-expressions via custom DSL were compiled into tree of pipe connected nodes, where node could represent query or sub-query. Aggregation and post processing of the query results was done via fork/join pool. Few interesting notes:
    Fork/join is well suited only for cpu-bound work so any blocking io was done prior when fetching results.
    To allow for large streams system monitored virtual and transparently switched to memory buffered pipes.
  • Session 3, Clojail, Anthony Grimes
    Anthony quite possibly is the youngest member of Clojure community, he  is 17. Avid cat lower and Stuart Sierra’s groupie  with occasion inspired green hair. Main message : code is dangerous, but dangerous is good except there few cases when it is not: IRC channel, learning, etc. where sandboxing provides an answer. Reuse JVM sandboxing where possible. Sandboxing dynamic Clojure features is quite hard, Clojail is an attempt to achieve this. It has good customization and default security profile.
  • Session 4, ClojureScript, Chris Houser
    Chris presented some low-level details of ClojureScript (CLJS) from data types -> protocols -> macros -> compiler. CLS vector is copy-on-write wrapper over js array. 2 types of map implementations: ObjMap (when keys are string-like), HashMap when keys are objects. Protocols can extend foreign class into CLJS. Chris warned about potential implementation clash if you don’t own protocol and data structure.
  • Session 5, Striving To Make Things Simple and Fast, Phil Bagwell, Typesafe
    Phil made low level data structure talk bearable. He started on showing how easy it is to make Scala data structure code parallel via recently added support for parallel collections. Then he delved into details on solving the problem of developing fast immutable vector. It was fascinating to see his description of how work progressed step by step. In the end it took team 18 months to come up with performing solution. For more details see RRB-Trees: Efficient Immutable Vectors
  • Session 6, Intro to Logic Programming with Clojure, Ambrose Bonnaire-Sergeant
    Logic programming was one of the main themes of the Conj. Rich Hickey and Clojure/Core expressed a strong interest in this area. But in terms of the presentation, Ambrose came up short. He is young, sharp student from Australia but his delivery lacked conviction, coherence and deep understanding of the subject matter. After a while many attendees faded away into typing. IMO, this area would have been better covered by David Nolan, the author of core.logic but he was presenting next day.
  • Session 7, Hacking The Human Genome, Arnoldo J. Muller-Molina
    A definite hit, an excellent example of power of Clojure in the real world. Arnoldo used Clojure and similarity search to discover the rest of “landing sites” sequences of DNA of which only 10% have been known.
  • Session 8, Ousterhout’s Dichotomy Isn’t (Part 2 of the Simplicity/Power/Focus Series), Stuart Halloway
    Another rant like session from “Stue”. No many notes on this one, I must have been mentally fatigued plus there was very little to take away.
  • After Session, Scheme’s miniKarnen, William Byrd and Daniel Friedman
    These guys stole the day 1 show, they rock. Although I came up few minutes late, I was able to see them in the act, going through Scheme code, changing and trying things on the fly, making statements like “We can make our program to run backwards, you do want to run your programs backwards, don’t you?” Checkout their “Schemer Series” book The Reasoned Schemer

Day 2, Friday, Nov 11th

  • Session 1, Master Plan for Clojure Enterprise Domination, Neal Ford, ThoughtWorks
    Neal discussed how to sneak in Clojure into enterprise. He shared his experience on well-known CTO prospective of bringing new programming language into a mix. According to him Clojure can help solve enterprise problems in the areas of: RESTful integration, ORM, and rule engines. Neal proposed “cat burglary” approach which worked fine for Rails/Cucumber and Groovy/Gradle. Interesting thought on perceived rivalry with Scala – at this point Scala is Clojure’s friend and the enemy is Java’s status quo. When selling Clojure to management one should start in the following order: another language for JVM -> functional -> dynamic. At the end he appealed for wrapping everything in order to build bridges to other technologies.
  • Session 2, Predicate Dispatch, David Nolen
    David is a great example of young hacker generation drinking from a Clojure’s fire hose. His talk centered around a finding a better way of function dispatch. Although Clojure’s multimethods dispatch on more than first argument, they are essentially closed, the dispatch function is hardwired – potential extenders cannot add new dispatch conditions not already defined by the original dispatch function. Similarly, pattern matching favored in some functional languages exhibits the same problem. Predicate dispatch generalizes traditional method dispatch mechanisms by permitting arbitrary predicates to control method applicability and by using logical implication between predicates as the overriding relationship. As of now predicate dispatch is not in Clojure but David showed that it is plausible and could be added in the near future.
  • Session 3, Extending Javascripts from Clojure, Kevin Lynagh
    One of the least informative sessions – a lot of fluff not enough stuff. Kevin reiterated Javascript’s shortcomings: global vars, lack of modules, etc. and showed how ClojureScript overcomes them, nothing new here.
  • Session 4, From Linear to Incremental, Christophe Grand
    Christophe took on an interesting problem of incremental parsing/compilation in order to prevent full parsing/recompilation for small code changes. I wish I could give you more details but Christophe’s heavy accent diminished my note talking capability. The only other interesting detail I noted is that Christophe used technique of stack transplantation via diff memoization.
  • Session 5, Logs as Data, Mark McGranaghan, Heroku
    Mark presentation centered around Clojure implementation of central distributed logging at Heroku. Don’t treat logging as lines in the text file but as logging events with multiple uses: log aggregation, system status, system performance, etc. Parse log lines and convert them to Clojure map to be forwarded to logger processors.
  • Session 6, Modeling the world Bayesian network in Clojure, Chas Emerick
    This presentation covered Bayesian network (BN) from AI realm and it’s applicability to problem domains of process control, decision systems, and prediction. Chas mentioned that he used BN in his own business in document processing system to learn about different document formats. Chas announced that he will release probability library “raposo” soon.
  • Session 7, Clojure and Adroid, Daniel Gomez
    Daniel summed up the state of Clojure development for Android platform. He started with describing high level architecture of Dalvik VM and how it is different from JVM. Through rigorous testing and profiling Daniel showed that Clojure lags in the areas of code size and startup time. His presentation sparked a discussion on how to minimize Clojure’s runtime footprint and startup time.
  • Session 8, Keynote, Rich Hickey
    No rant-like talk from Rich this time. His presentation centered around future work stream for Clojure/Core team. Future areas of interest are: making Clojure leaner with different build targets, unification with ClojureScript, elusive goal of ClojureInClojure, getting rid of dependency on ASM byte code manipulation library, InvokeDynamic support in Java 7, leveraging core.logic for type checking and predicate dispatch, parallelism with algoritms not collections, transients and pods, accumulators and extensible reader.

Day 3, Saturday, Nov 12th

  • Session 1, Cascalog, Nathan Martz, Twitter
    Nathan’s presentation centered around anaylyzing Twitter’s Big Data with Clojure custom DSL on top of Hadoop’s MapReduce. I immediately draw parallel with earlier David’s McNeil’s presentation. Both share similar architecture but differ in query DSL and in task running backend component – one uses fork/join while other Hadoop. Cascalog supports different type of predicates: filters, aggregators, generators, and user defined functions. Joins in Cascalog are implicit and subqueries are supported. Nathan ran trough a lot of examples from trivial ones to more advanced.
  • Session 2, Functional Data Structures, Daniel Spiwek
    Daniel’s presentation was most enthusiastic by far. My first thought was – how many cups of coffee he had in the morning? But other people whom seen him present before pointed out that this is his style. Two groups of data structures: sequential and associative. Provide immutability and use structural sharing for performance. Implementation goal is to minimize copying and maximize sharing. Amortization as a way of distribute cost of infrequent slow operation. The latest state of the art is hybrid approach: combine sequential and associative characteristics. Pioneered by Rich Hickey and then ported to other languages. CPU locality of reference and JVM heap locality are the underlying low level reasons why they are fast enough.
  • Session 3, Macronomicon, Michael Fogus, Clojure/Core
    Another rant-like talk, somewhat more coherent and useful, centered primarily on the subject of macro use and abuse. Fogus echoed Christophe Grand’s presentation from last year on suggested macro use cases: creating binding forms, control flow and syntactic sugar when warranted. Drew attention to his work in progress on pre/post condition library Trammel. At the end Fogus showed how macros could be used to shift runtime calculation to a compile time.
  • Session 4, Lightning Talks, Ten minutes/Various Presenters
    CyclicTime – jodatime like library for sane date computations.
    Immutant – application server for clojure on top of Jboss using JRuby.
    UberLisp – lisp1 interpreter for the Arduino. The best by far, whittingly presented, auidence bursted into applauses several times.
    Korma– Sugar coated DSL on top of SQL.
  • Session 5, Performance in the Wild, Craig Andera, Clojure/Core
    Good presentation on how to optimize Clojure in a real-life project. Many common sense, but useful nevertheless techniques:
    Establishing performance baseline early in development cycle helps to catch deceptively small changes that could effect performance.
    Performance is not a scalar metric.
    Be aware of assumptions it a data model.
    Measure distribution of latency under most likely use case/s.
    Optimization loop: bechmark -> analyze -> recommend -> optimize; done when performance targets are met.
    Don’t make mistake of equating number of threads with concurrent clients.
    Tools used: jmeter plugin to chart the results, YourKit for profiling.
  • Session 6, Programming Music With Overtone, Sam Aaron
    Skipped this one to catch the flight.

Overall, it was a great experience. I talked to few dozen of people about Clojure. Similarly to the last year there were quite a few at playing/evaluating/trying to introduce stage. But I did see signs of wider adoption. Met few guys who took a stance “No mas Java” and found 100% Clojure jobs after a while. Met one guy from London who works at a bank where IT director had been bitten by Clojure bug and was pushing it from the top. Clojure developer community is quite diverse with three main backgrounds: young (mostly Ruby) hackers, old time Lispers, and Java converts. Relevance (Clojure/Core) does an admirable job of evolving language and ecosystem.

Project Euler Clojure Problem 003

13 Jan

After discovering Project Euler Clojure in my previous post I took a stab at problem 003 – “Find the largest prime factor of the composite number 600851475143”. In the spirit of true learning I did not look at any posted solutions while writing mine.

The first bright idea was to use lazy sequence primes from clojure.contrib.lazy-seqs in reverse order and simply working down divide target number by decreasing prime number looking for zero remainder. Well, it turns out that since primes is infinite simple filtering doesn’t work. Below enters infinite loop and should eventually run out of memory, I think. I just killed the process.

user=> (use '[clojure.contrib.lazy-seqs :only (primes)])
nil
user=> (filter #(= 0 (rem 600851475143 %)) primes)
; never came back...

After pondering some time I decided to abandon “primes in reverse order approach”. If I can not realize infinite lazy sequence, I can look for specific item using nth. Fast forward some time and here is the final solution.


 1 (use '[clojure.contrib.lazy-seqs :only (primes)])
 2 
 3 (defn max-fctr [target]
 4   (loop [nbr target, idx 0, fctrs []]
 5     (let [cp (nth primes idx), rmndr (rem nbr cp)]
 6       (if (= 0 rmndr)
 7           (recur (/ nbr cp) 0 (conj fctrs cp))
 8         (if (> cp nbr)
 9           (last fctrs)
10           (recur nbr (inc idx) fctrs))))))
11 (max-fctr 600851475143)

On line 3 I setup explicit recursion point. On line 5 I capture current prime (cp) from primes and compute reminder (rmndr). If reminder is zero we set target number to a larger factor as an optimization, reset primes index to zero and add smaller factor to an accumulating vector of primes. Otherwise, to terminate recursion we compare current prime to our target number and if it is greater return last found factor. If current prime is less, we recur with incremented index. On my laptop above runs in about 20 msecs.

Well it is time to check how above compares to other solutions at project euler’s site… Overall, not bad – my solution is on a shorter side and only the 2nd one that takes advantage of lazy primes. Please note that I did not compare performance of other solutions. The best solution IMHO is by mtgred followed by davirus which looses because it relies on external functions (eg. prime?). Interestingly the best solution has approach similar to mine and both shorter and faster (runs in about 800 microseconds on my laptop). Let me reproduce it here (I made it slightly more readable).


(defn prime-factors [n]
  (let [(some #(if (= 0 (rem n %)) %) primes)]
    (if (= f n) 
      #{f} 
      (conj (prime-factors (/ n f)) f))))
(apply max (prime-factors 600851475143)

mtgred uses clever trick with implicit recursion which dispenses with loop/recur and accumulating vector of factors. Also, using some instead of nth eliminates need for index and, my guess, makes it faster. Since unordered set is used #{f} it requires (apply max …).

Clojure Zipper Over Nested Vector

23 Nov

One of my coworkers posted a programming challenge of solving “triangle puzzle”.

Puzzle

Given above, start at the root then move done, examine left and right nodes, choose the max, then go down again. Please note that this is not a breadth first find max path problem seen at projecteuler-18. There is no backtracking, if we go left or right, higher possible sum could be potentially missed. For example, if we replace 8 in 3rd row and 5 in 4th row with 9, the result will still be the same – 27 and not 29.

Solution

clojure.zip has a build in support of turning nested vector into a zipper. So, the first step is to express our puzzle as a nested vector.

(require '[clojure.zip :as z])

(def nv [5 
         [9 
          [4 
           [0] [7]] 
          [6 
           [7] [1]]]
         [6 
          [6 
           [7] [1]]
          [8
           [1] [5]]]])

And here is the solution.

 1 (defn left-node [node-loc]
 2   (-> node-loc z/right z/down))
 3
 4 (defn right-node [node-loc]
 5   (-> node-loc z/right z/right z/down))
 6
 7 (defn sum-path [rt-loc, cf]
 8   (loop [rs (-> rt-loc z/down z/node) 
 9         cn (-> rt-loc z/down)]
10           (if-not (z/right cn)
11             rs
12             (let [lv (z/node (left-node cn))
13                   rv (z/node (right-node cn))]
14               (if (cf lv rv) 
15                 (recur (+ rs lv) (left-node cn))
16                 (recur (+ rs rv) (right-node cn)))))))

Pretty evident what is going on here. We use two utility functions to get left and right node. sum-path accepts starting node and compare function, depth first walks the zipper comparing left and right node values.

(def zv (z/vector-zip nv))
; max path - returns 27
(sum-path zv >)
; min path - return 18
(sum-path zv <)

Clojure Macro Average Time

22 Nov

In my previous post on binary search I wanted to evaluate performance of two implementations. Clojure provides dotimes and time. Wouldn’t it be nice if we can string them together, parse output from time and calculate average execution time. Checking source of time and poking around I discovered with-out-str macro. It is quite simple. It just dynamically rebinds default output stream *out* to java.io.StringWriter. Having this the rest is piece of cake.

 1 (use '[clojure.contrib.str-utils :only (re-split)])
 2 (defmacro time-avg
 3   "Captures time output, parses it and calculates average.
 4   Modeled after with-out-str. Example:
 5   (time-avg
 6      (dotimes [_ 5] (time (.run #(Thread/sleep (rand 100))))))"
 7   [& body]
 8   `(let [s# (java.io.StringWriter.)]
 9      (binding [*out* s#]
10        ~@body)
11      (let [strng# (str s#)
12            lns# (re-split #"\n" strng#)
13            flt-regex# #"\d+\.\d+"
14            lns-flts# (filter #(re-seq flt-regex# %) lns#)
15            flts# (map 
16              #(Float/parseFloat (first (re-seq flt-regex# %))) 
17              lns-flts#)
18            sum# (reduce + flts#)]
19        (println strng#)
20        (if (seq flts#)
21          (println "\"Average of" (count flts#) "run/s is" 
22               (/ sum# (count flts#)) "msecs\"")))))

Starting on line 11 we parse output using regexes, look for doubles, sum them up and print calculated average. The only thing I stumbled on was using suffix # in macro variable names.

Clojure code with Vim in WordPress

31 Oct

As of now WordPress doesn’t support pretty print of clojure code.
But fear not… To embed nicely formatted line-numbered code into default WordPress blog using vim.

  1. Open your file in vim.
    :set number
  2. Visually select code.
    :TOhtml
  3. Select html between

    <body></body>
  4. Paste it in your blog (in HTML view) inside

    <pre></pre>
  5. Don’t switch to Visual view. Save/update your post.

That is it. Here is a sample.

2011-01-13 Update
Default format of TOhtml changed in recent version of Vim (7.3.46 for Windows). It uses css styles from page header element. To get inline styles:

   :let g:html_use_css = 0

Clojure Binary Search

20 Oct

Embarking on “killing two birds with one stone” (play with clojure and brush up on algorithms) adventure, here is my humble implementation of binary search. Without consulting outside sources after about 10 hours in total (my train commute is about an hour each way) the best I could do was:

 1 (defn indexed [coll] (map vector (iterate inc 0) coll))
 2 (defn 
 3   #^{:test (fn [] (tst bin-srch-indx))}
 4   bin-srch-indx [coll el]
 5   (loop [vc-indx (vec (indexed coll))]
 6     (let [sz (count vc-indx), mi (quot sz 2),
 7          ce (get vc-indx mi), [idx v] ce]
 8       (cond
 9         (< el (first coll)) nil
10         (> el (last coll)) nil
11         (and (= mi 0) (or (> el v) (< el v))) nil
12         (= el v) idx
13         (> el v) (recur (subvec vc-indx mi sz))
14         (< el v) (recur (subvec vc-indx 0 mi))))))

indexed on line 1 is an interesting by itself. I came across it in Programming Clojure by Stuart Halloway. It creates new lazy sequence by calling vector with pair of first, then second, etc. items in (iterate (inc 0)) and passed in collection until latter collection is exhausted. bin-srch-indx converts input sequence into a vector, uses cond to catch edge cases and finally uses clojure’s recur and subvec to half the searched collection. On line 3 there is an example of closure to hook in clojure’s build-in test facility. We will talk about test function later.

Unsatisfied with above, I thought there might be a better algorithm for binary search – time to enlist help of Wikipedia. Here is a faster version which doesn’t use subvec but manipulates boundaries.

 1 (defn 
 2   #^{:test (fn [] (tst bin-srch))}
 3   bin-srch [coll el]
 4     (let [v (vec coll)]
 5       (loop [li (int 0), ri (int (count v))]
 6         (if (= (get v 0) el)
 7           0
 8           (let [p (int (/ (- ri li) 2))]
 9             (if
10               (> p 0)
11                 (let [next-p (int (+ li p)),
12                      ce (get v next-p)]
13                   (cond
14                     (= ce el) next-p
15                     (> ce el) (recur li next-p)
16                     (< ce el) (recur next-p ri)))
17               nil))))))

Being a diligent developer, I included test function to prove that code works.

  [f]
    (let 
      [ub1 99999
      ub2 (+ ub1 1)
      tc1 (range 10 ub1 1)
      tc2 (conj tc1 ub2)
      tc3 [11 13 15]
      pli 1
      pui 98
      dc 500
      pcol (take 100 (drop dc primes))]
        (assert (= nil (f tc1 1)))
        (assert (= nil (f tc1 (+ ub2 1))))
        (for [i (range 10 (+ ub1 1) 1)] 
          (assert (= (- i 10) (f tc1 i))))
        (for [i (range 10 (+ ub2 1) 1)] 
          (assert (= (- i 10) (f tc2 i))))
        (assert (= nil (f tc3 14)))
        (assert (= pli
           (f pcol (first (take 1 
              (drop (+ dc pli) primes))))))
        (assert (= pui
           (f pcol (first (take 1 
              (drop (+ dc pui) primes))))))))

Above sets up some test data and using assert verifies results. So, in your friendly REPL you can try:

user=> (use '[clojure.contrib.lazy-seqs :only (primes)])
nil
user=> (test bin-srch)
:ok
user=> (test bin-srch-indx)
:ok
user=> (bin-srch (take 100 primes) 11)
4

Also, here is a complete file bin-srch.clj. Since WordPress upload attachment gods don’t like clj extension, I had to use doc file extension. By the way, I don’t think it would hard to distinguish between text and binary files during upload and allow former. Let’s pray together: Hare WordPress upload attachment gods…:-)

In the next post we will take look how, with a help of clojure macro, we can measure running time of two implementations above.