Tag Archives: Clojure

Intro into AI Natural Language Processing Programming Solutions

4 Apr

As thousands students around the world I took Intro into AI class offered online for free by Standford University. It was an eye opening experience. To paraphrase somebody’s quote: “Any advance technology seems like magic to uninformed”. The class did the excellent to lift the veil over the magic of AI by covering fundamental concepts of Bayes Networks, Propositional Logic, Classical Planning, Markov Decision Processes, Particle Filters, Game Theory, Natural Language Processing (NLP) and many others.

But as a hands on programmer I got tired of pen and paper number crunching (I did program some trivial calculations) so optional NLP programming exercises scratched a coding itch. Solutions (in Clojure, of course) are available at github.com/vitalyper/aiclass. Readme has more details not to be repeated here. Here are some thoughts after solving more challenging problem 2 – restore two characters shredded text.

|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|”A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|” |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |

Hypothesis 1
Any solution would require use of some external data. I used 2500+ most frequently used English words but bigram probability distribution compiled from sufficiently large corpus would work better. This echoes theme from NLP lectures – solutions to many NLP problems require collection of large amount of data. For example, spell checkers are trivial to implement if you have good set of misspelled words.

Observation 1
Brute force is a reasonable approach when you have access to a decent hardware. I didn’t so I had to limit permutations in steps 1 and 2. Parallelizable brute force approach would be much simpler and slower.

Observation 2
Applicability to other languages.
As it stands solution could be applicable to other languages with the following properties: 1) word based 2) left-to-right direction 3) same punctuation set. The most used words set would have to change of course. Adjusting solution to trigrams would requie some additional work and refactoring.

Clojure Conj 2011 Recap

17 Nov

Having attended first Conj last year I was looking forward to one more Clojure deep dive. And I was not disappointed once again. Similar to the last year the conference took place in Raleigh, NC but this time hotel was right in the middle of downtown with walking distance access to restaurants, shops, etc. All sessions were recorded, look forward to upcoming videos. Here is my attempt at quick recap with some personal comments.

Day 1, Thursday, Nov 10th

  • Session 1, Ideas Rant, Stuart Sierra
    Stuart, with occasion inspired blue hair, started the conference with rant-like collection of ideas/suggestions. Among them were: using clojure read/print for serialization, how to implement your own clojure structures for drop-in replacement for closure ones. He also spent some time talking about java.util.concurrent goodies as an option for cases where Clojure provided state management primitives don’t fit the bill. Somewhat mixed message IMO, given the clojure’s goal to hide much of the Java. Stuart also mentioned clojure.test.generative, input test generation library inspired by Haskell’s QuickCheck.
  • Session 2, Concurrent Stream Processing, David McNeil, Revelytix, Inc.
    One of two very good session on “Clojure in the real World”. Problem is to process SPARQL federated queries against multiple SQL datasources, aggregate and combine results. Solution was inspired by unix pipe processing.  s-expressions via custom DSL were compiled into tree of pipe connected nodes, where node could represent query or sub-query. Aggregation and post processing of the query results was done via fork/join pool. Few interesting notes:
    Fork/join is well suited only for cpu-bound work so any blocking io was done prior when fetching results.
    To allow for large streams system monitored virtual and transparently switched to memory buffered pipes.
  • Session 3, Clojail, Anthony Grimes
    Anthony quite possibly is the youngest member of Clojure community, he  is 17. Avid cat lower and Stuart Sierra’s groupie  with occasion inspired green hair. Main message : code is dangerous, but dangerous is good except there few cases when it is not: IRC channel, learning, etc. where sandboxing provides an answer. Reuse JVM sandboxing where possible. Sandboxing dynamic Clojure features is quite hard, Clojail is an attempt to achieve this. It has good customization and default security profile.
  • Session 4, ClojureScript, Chris Houser
    Chris presented some low-level details of ClojureScript (CLJS) from data types -> protocols -> macros -> compiler. CLS vector is copy-on-write wrapper over js array. 2 types of map implementations: ObjMap (when keys are string-like), HashMap when keys are objects. Protocols can extend foreign class into CLJS. Chris warned about potential implementation clash if you don’t own protocol and data structure.
  • Session 5, Striving To Make Things Simple and Fast, Phil Bagwell, Typesafe
    Phil made low level data structure talk bearable. He started on showing how easy it is to make Scala data structure code parallel via recently added support for parallel collections. Then he delved into details on solving the problem of developing fast immutable vector. It was fascinating to see his description of how work progressed step by step. In the end it took team 18 months to come up with performing solution. For more details see RRB-Trees: Efficient Immutable Vectors
  • Session 6, Intro to Logic Programming with Clojure, Ambrose Bonnaire-Sergeant
    Logic programming was one of the main themes of the Conj. Rich Hickey and Clojure/Core expressed a strong interest in this area. But in terms of the presentation, Ambrose came up short. He is young, sharp student from Australia but his delivery lacked conviction, coherence and deep understanding of the subject matter. After a while many attendees faded away into typing. IMO, this area would have been better covered by David Nolan, the author of core.logic but he was presenting next day.
  • Session 7, Hacking The Human Genome, Arnoldo J. Muller-Molina
    A definite hit, an excellent example of power of Clojure in the real world. Arnoldo used Clojure and similarity search to discover the rest of “landing sites” sequences of DNA of which only 10% have been known.
  • Session 8, Ousterhout’s Dichotomy Isn’t (Part 2 of the Simplicity/Power/Focus Series), Stuart Halloway
    Another rant like session from “Stue”. No many notes on this one, I must have been mentally fatigued plus there was very little to take away.
  • After Session, Scheme’s miniKarnen, William Byrd and Daniel Friedman
    These guys stole the day 1 show, they rock. Although I came up few minutes late, I was able to see them in the act, going through Scheme code, changing and trying things on the fly, making statements like “We can make our program to run backwards, you do want to run your programs backwards, don’t you?” Checkout their “Schemer Series” book The Reasoned Schemer

Day 2, Friday, Nov 11th

  • Session 1, Master Plan for Clojure Enterprise Domination, Neal Ford, ThoughtWorks
    Neal discussed how to sneak in Clojure into enterprise. He shared his experience on well-known CTO prospective of bringing new programming language into a mix. According to him Clojure can help solve enterprise problems in the areas of: RESTful integration, ORM, and rule engines. Neal proposed “cat burglary” approach which worked fine for Rails/Cucumber and Groovy/Gradle. Interesting thought on perceived rivalry with Scala – at this point Scala is Clojure’s friend and the enemy is Java’s status quo. When selling Clojure to management one should start in the following order: another language for JVM -> functional -> dynamic. At the end he appealed for wrapping everything in order to build bridges to other technologies.
  • Session 2, Predicate Dispatch, David Nolen
    David is a great example of young hacker generation drinking from a Clojure’s fire hose. His talk centered around a finding a better way of function dispatch. Although Clojure’s multimethods dispatch on more than first argument, they are essentially closed, the dispatch function is hardwired – potential extenders cannot add new dispatch conditions not already defined by the original dispatch function. Similarly, pattern matching favored in some functional languages exhibits the same problem. Predicate dispatch generalizes traditional method dispatch mechanisms by permitting arbitrary predicates to control method applicability and by using logical implication between predicates as the overriding relationship. As of now predicate dispatch is not in Clojure but David showed that it is plausible and could be added in the near future.
  • Session 3, Extending Javascripts from Clojure, Kevin Lynagh
    One of the least informative sessions – a lot of fluff not enough stuff. Kevin reiterated Javascript’s shortcomings: global vars, lack of modules, etc. and showed how ClojureScript overcomes them, nothing new here.
  • Session 4, From Linear to Incremental, Christophe Grand
    Christophe took on an interesting problem of incremental parsing/compilation in order to prevent full parsing/recompilation for small code changes. I wish I could give you more details but Christophe’s heavy accent diminished my note talking capability. The only other interesting detail I noted is that Christophe used technique of stack transplantation via diff memoization.
  • Session 5, Logs as Data, Mark McGranaghan, Heroku
    Mark presentation centered around Clojure implementation of central distributed logging at Heroku. Don’t treat logging as lines in the text file but as logging events with multiple uses: log aggregation, system status, system performance, etc. Parse log lines and convert them to Clojure map to be forwarded to logger processors.
  • Session 6, Modeling the world Bayesian network in Clojure, Chas Emerick
    This presentation covered Bayesian network (BN) from AI realm and it’s applicability to problem domains of process control, decision systems, and prediction. Chas mentioned that he used BN in his own business in document processing system to learn about different document formats. Chas announced that he will release probability library “raposo” soon.
  • Session 7, Clojure and Adroid, Daniel Gomez
    Daniel summed up the state of Clojure development for Android platform. He started with describing high level architecture of Dalvik VM and how it is different from JVM. Through rigorous testing and profiling Daniel showed that Clojure lags in the areas of code size and startup time. His presentation sparked a discussion on how to minimize Clojure’s runtime footprint and startup time.
  • Session 8, Keynote, Rich Hickey
    No rant-like talk from Rich this time. His presentation centered around future work stream for Clojure/Core team. Future areas of interest are: making Clojure leaner with different build targets, unification with ClojureScript, elusive goal of ClojureInClojure, getting rid of dependency on ASM byte code manipulation library, InvokeDynamic support in Java 7, leveraging core.logic for type checking and predicate dispatch, parallelism with algoritms not collections, transients and pods, accumulators and extensible reader.

Day 3, Saturday, Nov 12th

  • Session 1, Cascalog, Nathan Martz, Twitter
    Nathan’s presentation centered around anaylyzing Twitter’s Big Data with Clojure custom DSL on top of Hadoop’s MapReduce. I immediately draw parallel with earlier David’s McNeil’s presentation. Both share similar architecture but differ in query DSL and in task running backend component – one uses fork/join while other Hadoop. Cascalog supports different type of predicates: filters, aggregators, generators, and user defined functions. Joins in Cascalog are implicit and subqueries are supported. Nathan ran trough a lot of examples from trivial ones to more advanced.
  • Session 2, Functional Data Structures, Daniel Spiwek
    Daniel’s presentation was most enthusiastic by far. My first thought was – how many cups of coffee he had in the morning? But other people whom seen him present before pointed out that this is his style. Two groups of data structures: sequential and associative. Provide immutability and use structural sharing for performance. Implementation goal is to minimize copying and maximize sharing. Amortization as a way of distribute cost of infrequent slow operation. The latest state of the art is hybrid approach: combine sequential and associative characteristics. Pioneered by Rich Hickey and then ported to other languages. CPU locality of reference and JVM heap locality are the underlying low level reasons why they are fast enough.
  • Session 3, Macronomicon, Michael Fogus, Clojure/Core
    Another rant-like talk, somewhat more coherent and useful, centered primarily on the subject of macro use and abuse. Fogus echoed Christophe Grand’s presentation from last year on suggested macro use cases: creating binding forms, control flow and syntactic sugar when warranted. Drew attention to his work in progress on pre/post condition library Trammel. At the end Fogus showed how macros could be used to shift runtime calculation to a compile time.
  • Session 4, Lightning Talks, Ten minutes/Various Presenters
    CyclicTime – jodatime like library for sane date computations.
    Immutant – application server for clojure on top of Jboss using JRuby.
    UberLisp – lisp1 interpreter for the Arduino. The best by far, whittingly presented, auidence bursted into applauses several times.
    Korma– Sugar coated DSL on top of SQL.
  • Session 5, Performance in the Wild, Craig Andera, Clojure/Core
    Good presentation on how to optimize Clojure in a real-life project. Many common sense, but useful nevertheless techniques:
    Establishing performance baseline early in development cycle helps to catch deceptively small changes that could effect performance.
    Performance is not a scalar metric.
    Be aware of assumptions it a data model.
    Measure distribution of latency under most likely use case/s.
    Optimization loop: bechmark -> analyze -> recommend -> optimize; done when performance targets are met.
    Don’t make mistake of equating number of threads with concurrent clients.
    Tools used: jmeter plugin to chart the results, YourKit for profiling.
  • Session 6, Programming Music With Overtone, Sam Aaron
    Skipped this one to catch the flight.

Overall, it was a great experience. I talked to few dozen of people about Clojure. Similarly to the last year there were quite a few at playing/evaluating/trying to introduce stage. But I did see signs of wider adoption. Met few guys who took a stance “No mas Java” and found 100% Clojure jobs after a while. Met one guy from London who works at a bank where IT director had been bitten by Clojure bug and was pushing it from the top. Clojure developer community is quite diverse with three main backgrounds: young (mostly Ruby) hackers, old time Lispers, and Java converts. Relevance (Clojure/Core) does an admirable job of evolving language and ecosystem.

Clojure Zipper Over Nested Vector

23 Nov

One of my coworkers posted a programming challenge of solving “triangle puzzle”.

Puzzle

Given above, start at the root then move done, examine left and right nodes, choose the max, then go down again. Please note that this is not a breadth first find max path problem seen at projecteuler-18. There is no backtracking, if we go left or right, higher possible sum could be potentially missed. For example, if we replace 8 in 3rd row and 5 in 4th row with 9, the result will still be the same – 27 and not 29.

Solution

clojure.zip has a build in support of turning nested vector into a zipper. So, the first step is to express our puzzle as a nested vector.

(require '[clojure.zip :as z])

(def nv [5 
         [9 
          [4 
           [0] [7]] 
          [6 
           [7] [1]]]
         [6 
          [6 
           [7] [1]]
          [8
           [1] [5]]]])

And here is the solution.

 1 (defn left-node [node-loc]
 2   (-> node-loc z/right z/down))
 3
 4 (defn right-node [node-loc]
 5   (-> node-loc z/right z/right z/down))
 6
 7 (defn sum-path [rt-loc, cf]
 8   (loop [rs (-> rt-loc z/down z/node) 
 9         cn (-> rt-loc z/down)]
10           (if-not (z/right cn)
11             rs
12             (let [lv (z/node (left-node cn))
13                   rv (z/node (right-node cn))]
14               (if (cf lv rv) 
15                 (recur (+ rs lv) (left-node cn))
16                 (recur (+ rs rv) (right-node cn)))))))

Pretty evident what is going on here. We use two utility functions to get left and right node. sum-path accepts starting node and compare function, depth first walks the zipper comparing left and right node values.

(def zv (z/vector-zip nv))
; max path - returns 27
(sum-path zv >)
; min path - return 18
(sum-path zv <)

Clojure Macro Average Time

22 Nov

In my previous post on binary search I wanted to evaluate performance of two implementations. Clojure provides dotimes and time. Wouldn’t it be nice if we can string them together, parse output from time and calculate average execution time. Checking source of time and poking around I discovered with-out-str macro. It is quite simple. It just dynamically rebinds default output stream *out* to java.io.StringWriter. Having this the rest is piece of cake.

 1 (use '[clojure.contrib.str-utils :only (re-split)])
 2 (defmacro time-avg
 3   "Captures time output, parses it and calculates average.
 4   Modeled after with-out-str. Example:
 5   (time-avg
 6      (dotimes [_ 5] (time (.run #(Thread/sleep (rand 100))))))"
 7   [& body]
 8   `(let [s# (java.io.StringWriter.)]
 9      (binding [*out* s#]
10        ~@body)
11      (let [strng# (str s#)
12            lns# (re-split #"\n" strng#)
13            flt-regex# #"\d+\.\d+"
14            lns-flts# (filter #(re-seq flt-regex# %) lns#)
15            flts# (map 
16              #(Float/parseFloat (first (re-seq flt-regex# %))) 
17              lns-flts#)
18            sum# (reduce + flts#)]
19        (println strng#)
20        (if (seq flts#)
21          (println "\"Average of" (count flts#) "run/s is" 
22               (/ sum# (count flts#)) "msecs\"")))))

Starting on line 11 we parse output using regexes, look for doubles, sum them up and print calculated average. The only thing I stumbled on was using suffix # in macro variable names.

Clojure code with Vim in WordPress

31 Oct

As of now WordPress doesn’t support pretty print of clojure code.
But fear not… To embed nicely formatted line-numbered code into default WordPress blog using vim.

  1. Open your file in vim.
    :set number
  2. Visually select code.
    :TOhtml
  3. Select html between

    <body></body>
  4. Paste it in your blog (in HTML view) inside

    <pre></pre>
  5. Don’t switch to Visual view. Save/update your post.

That is it. Here is a sample.

2011-01-13 Update
Default format of TOhtml changed in recent version of Vim (7.3.46 for Windows). It uses css styles from page header element. To get inline styles:

   :let g:html_use_css = 0

Clojure Binary Search

20 Oct

Embarking on “killing two birds with one stone” (play with clojure and brush up on algorithms) adventure, here is my humble implementation of binary search. Without consulting outside sources after about 10 hours in total (my train commute is about an hour each way) the best I could do was:

 1 (defn indexed [coll] (map vector (iterate inc 0) coll))
 2 (defn 
 3   #^{:test (fn [] (tst bin-srch-indx))}
 4   bin-srch-indx [coll el]
 5   (loop [vc-indx (vec (indexed coll))]
 6     (let [sz (count vc-indx), mi (quot sz 2),
 7          ce (get vc-indx mi), [idx v] ce]
 8       (cond
 9         (< el (first coll)) nil
10         (> el (last coll)) nil
11         (and (= mi 0) (or (> el v) (< el v))) nil
12         (= el v) idx
13         (> el v) (recur (subvec vc-indx mi sz))
14         (< el v) (recur (subvec vc-indx 0 mi))))))

indexed on line 1 is an interesting by itself. I came across it in Programming Clojure by Stuart Halloway. It creates new lazy sequence by calling vector with pair of first, then second, etc. items in (iterate (inc 0)) and passed in collection until latter collection is exhausted. bin-srch-indx converts input sequence into a vector, uses cond to catch edge cases and finally uses clojure’s recur and subvec to half the searched collection. On line 3 there is an example of closure to hook in clojure’s build-in test facility. We will talk about test function later.

Unsatisfied with above, I thought there might be a better algorithm for binary search – time to enlist help of Wikipedia. Here is a faster version which doesn’t use subvec but manipulates boundaries.

 1 (defn 
 2   #^{:test (fn [] (tst bin-srch))}
 3   bin-srch [coll el]
 4     (let [v (vec coll)]
 5       (loop [li (int 0), ri (int (count v))]
 6         (if (= (get v 0) el)
 7           0
 8           (let [p (int (/ (- ri li) 2))]
 9             (if
10               (> p 0)
11                 (let [next-p (int (+ li p)),
12                      ce (get v next-p)]
13                   (cond
14                     (= ce el) next-p
15                     (> ce el) (recur li next-p)
16                     (< ce el) (recur next-p ri)))
17               nil))))))

Being a diligent developer, I included test function to prove that code works.

  [f]
    (let 
      [ub1 99999
      ub2 (+ ub1 1)
      tc1 (range 10 ub1 1)
      tc2 (conj tc1 ub2)
      tc3 [11 13 15]
      pli 1
      pui 98
      dc 500
      pcol (take 100 (drop dc primes))]
        (assert (= nil (f tc1 1)))
        (assert (= nil (f tc1 (+ ub2 1))))
        (for [i (range 10 (+ ub1 1) 1)] 
          (assert (= (- i 10) (f tc1 i))))
        (for [i (range 10 (+ ub2 1) 1)] 
          (assert (= (- i 10) (f tc2 i))))
        (assert (= nil (f tc3 14)))
        (assert (= pli
           (f pcol (first (take 1 
              (drop (+ dc pli) primes))))))
        (assert (= pui
           (f pcol (first (take 1 
              (drop (+ dc pui) primes))))))))

Above sets up some test data and using assert verifies results. So, in your friendly REPL you can try:

user=> (use '[clojure.contrib.lazy-seqs :only (primes)])
nil
user=> (test bin-srch)
:ok
user=> (test bin-srch-indx)
:ok
user=> (bin-srch (take 100 primes) 11)
4

Also, here is a complete file bin-srch.clj. Since WordPress upload attachment gods don’t like clj extension, I had to use doc file extension. By the way, I don’t think it would hard to distinguish between text and binary files during upload and allow former. Let’s pray together: Hare WordPress upload attachment gods…:-)

In the next post we will take look how, with a help of clojure macro, we can measure running time of two implementations above.

Why Clojure?

27 May
Tools and languages do matter. Functional programming matters as well. To see why read why functional programming matters . Recently decades old ideas started slipping into Java land. Initially Scala, then Clojure gave Java programmers a choice of radically new languages running on top of solid and time-tested JVM. I briefly played with Scala, nothing resembling a real project, but when Clojure came out I knew it is time to get my feet wet. So, why Clojure? Here are the reasons in priority order.


1. Seamless interoperability with Java.
Calling Java from Clojure is easy via syntactic sugar coating. Any Java library is available right way. Clojure collections implement Java collection interfaces and are thread safe. Clojure functions implement Runnable and Callable. As a consequence, selling Clojure to your hairy-pointed boss becomes easier. It goes something like: functional languages are cool and are cutting edge at the moment, Clojure runs in JVM and can easily reuse our existing code base, etc. Let’s use it for non-mission critical components first.


2. State is immutable by default.
Writing complex concurrent programs in Java is hard even with Java 1.5 high level concurrency API. In contrast to Java where mutable state is a default and norm, Clojure defaults to immutable state. There are four flavors of explicit mutation API. Since one can’t get away from having a state and side-effects in any nontrivial program this leads to a design decision of splitting a program into immutable and mutable parts. Obviously, it will make sense to maximize the former and minimize the latter. Immutable part tested in REPL and/or by Clojure/Clojure-contrib test facilities is guaranteed to work every time in most complex call chains. This is a HUGE. To parallel this to Java let’s imagine that you only have to write unit tests. No need for integration and even regression tests – your unit tests assert that functions inputs and outputs behave as expected so combining them should work as well.


3. Macro facility extends language to your needs.
No more waiting for language babysitters to add this or that. You need a language construct to solve your particular need – use a macro. This also gives you a competitive advantage over others who use more rigid and verbose OO languages. That is what Paul Graham meant in his revenge of the nerds, a must read.


4. Other goodies: lazy-sequences, multi methods, software transaction memory, REPL.
Let me mention a drawback as well. As other dynamic languages Clojure lacks  strong typing.  It is dynamic language with support for type hints via metadata. Type system purists would argue that this is a crucial flaw. I agree to a degree. Yes, compile time checking is important but no sane person would run a Clojure without testing functions/parts in REPL. Plus, if you are writing unit tests, I do, you have another level of defense.
In the next post I will get my feet wet, writing a binary search in Clojure.