But as a hands on programmer I got tired of pen and paper number crunching (I did program some trivial calculations) so optional NLP programming exercises scratched a coding itch. Solutions (in Clojure, of course) are available at github.com/vitalyper/aiclass. Readme has more details not to be repeated here. Here are some thoughts after solving more challenging problem 2 – restore two characters shredded text.
|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|”A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|” |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |
Hypothesis 1
Any solution would require use of some external data. I used 2500+ most frequently used English words but bigram probability distribution compiled from sufficiently large corpus would work better. This echoes theme from NLP lectures – solutions to many NLP problems require collection of large amount of data. For example, spell checkers are trivial to implement if you have good set of misspelled words.
Observation 1
Brute force is a reasonable approach when you have access to a decent hardware. I didn’t so I had to limit permutations in steps 1 and 2. Parallelizable brute force approach would be much simpler and slower.
Observation 2
Applicability to other languages.
As it stands solution could be applicable to other languages with the following properties: 1) word based 2) left-to-right direction 3) same punctuation set. The most used words set would have to change of course. Adjusting solution to trigrams would requie some additional work and refactoring.
Overall, it was a great experience. I talked to few dozen of people about Clojure. Similarly to the last year there were quite a few at playing/evaluating/trying to introduce stage. But I did see signs of wider adoption. Met few guys who took a stance “No mas Java” and found 100% Clojure jobs after a while. Met one guy from London who works at a bank where IT director had been bitten by Clojure bug and was pushing it from the top. Clojure developer community is quite diverse with three main backgrounds: young (mostly Ruby) hackers, old time Lispers, and Java converts. Relevance (Clojure/Core) does an admirable job of evolving language and ecosystem.
]]>The first bright idea was to use lazy sequence primes from clojure.contrib.lazy-seqs in reverse order and simply working down divide target number by decreasing prime number looking for zero remainder. Well, it turns out that since primes is infinite simple filtering doesn’t work. Below enters infinite loop and should eventually run out of memory, I think. I just killed the process.
user=> (use '[clojure.contrib.lazy-seqs :only (primes)]) nil user=> (filter #(= 0 (rem 600851475143 %)) primes) ; never came back...
After pondering some time I decided to abandon “primes in reverse order approach”. If I can not realize infinite lazy sequence, I can look for specific item using nth. Fast forward some time and here is the final solution.
1 (use '[clojure.contrib.lazy-seqs :only (primes)])
2
3 (defn max-fctr [target]
4 (loop [nbr target, idx 0, fctrs []]
5 (let [cp (nth primes idx), rmndr (rem nbr cp)]
6 (if (= 0 rmndr)
7 (recur (/ nbr cp) 0 (conj fctrs cp))
8 (if (> cp nbr)
9 (last fctrs)
10 (recur nbr (inc idx) fctrs))))))
11 (max-fctr 600851475143)
On line 3 I setup explicit recursion point. On line 5 I capture current prime (cp) from primes and compute reminder (rmndr). If reminder is zero we set target number to a larger factor as an optimization, reset primes index to zero and add smaller factor to an accumulating vector of primes. Otherwise, to terminate recursion we compare current prime to our target number and if it is greater return last found factor. If current prime is less, we recur with incremented index. On my laptop above runs in about 20 msecs.
Well it is time to check how above compares to other solutions at project euler’s site… Overall, not bad – my solution is on a shorter side and only the 2nd one that takes advantage of lazy primes. Please note that I did not compare performance of other solutions. The best solution IMHO is by mtgred followed by davirus which looses because it relies on external functions (eg. prime?). Interestingly the best solution has approach similar to mine and both shorter and faster (runs in about 800 microseconds on my laptop). Let me reproduce it here (I made it slightly more readable).
1 (defn prime-factors [n]
2 (let [f (some #(if (= 0 (rem n %)) %) primes)]
3 (if (= f n)
4 #{f}
5 (conj (prime-factors (/ n f)) f))))
6 (apply max (prime-factors 600851475143)
mtgred uses clever trick with implicit recursion which dispenses with loop/recur and accumulating vector of factors. Also, using some instead of nth eliminates need for index and, my guess, makes it faster. Since unordered set is used #{f} it requires (apply max …).
]]>Given above, start at the root then move done, examine left and right nodes, choose the max, then go down again. Please note that this is not a breadth first find max path problem seen at projecteuler-18. There is no backtracking, if we go left or right, higher possible sum could be potentially missed. For example, if we replace 8 in 3rd row and 5 in 4th row with 9, the result will still be the same – 27 and not 29.
clojure.zip has a build in support of turning nested vector into a zipper. So, the first step is to express our puzzle as a nested vector.
(require '[clojure.zip :as z])
(def nv [5
[9
[4
[0] [7]]
[6
[7] [1]]]
[6
[6
[7] [1]]
[8
[1] [5]]]])
And here is the solution.
1 (defn left-node [node-loc]
2 (-> node-loc z/right z/down))
3
4 (defn right-node [node-loc]
5 (-> node-loc z/right z/right z/down))
6
7 (defn sum-path [rt-loc, cf]
8 (loop [rs (-> rt-loc z/down z/node)
9 cn (-> rt-loc z/down)]
10 (if-not (z/right cn)
11 rs
12 (let [lv (z/node (left-node cn))
13 rv (z/node (right-node cn))]
14 (if (cf lv rv)
15 (recur (+ rs lv) (left-node cn))
16 (recur (+ rs rv) (right-node cn)))))))
Pretty evident what is going on here. We use two utility functions to get left and right node. sum-path accepts starting node and compare function, depth first walks the zipper comparing left and right node values.
(def zv (z/vector-zip nv))]]>
; max path - returns 27
(sum-path zv >)
; min path - return 18
(sum-path zv <)
1 (use '[clojure.contrib.str-utils :only (re-split)])
2 (defmacro time-avg
3 "Captures time output, parses it and calculates average.
4 Modeled after with-out-str. Example:
5 (time-avg
6 (dotimes [_ 5] (time (.run #(Thread/sleep (rand 100))))))"
7 [& body]
8 `(let [s# (java.io.StringWriter.)]
9 (binding [*out* s#]
10 ~@body)
11 (let [strng# (str s#)
12 lns# (re-split #"\n" strng#)
13 flt-regex# #"\d+\.\d+"
14 lns-flts# (filter #(re-seq flt-regex# %) lns#)
15 flts# (map
16 #(Float/parseFloat (first (re-seq flt-regex# %)))
17 lns-flts#)
18 sum# (reduce + flts#)]
19 (println strng#)
20 (if (seq flts#)
21 (println "\"Average of" (count flts#) "run/s is"
22 (/ sum# (count flts#)) "msecs\"")))))
Starting on line 11 we parse output using regexes, look for doubles, sum them up and print calculated average. The only thing I stumbled on was using suffix # in macro variable names.
]]><body></body>
<pre></pre>
That is it. Here is a sample.
2011-01-13 Update
Default format of TOhtml changed in recent version of Vim (7.3.46 for Windows). It uses css styles from page header element. To get inline styles:
:let g:html_use_css = 0]]>
1 (defn indexed [coll] (map vector (iterate inc 0) coll))
2 (defn
3 #^{:test (fn [] (tst bin-srch-indx))}
4 bin-srch-indx [coll el]
5 (loop [vc-indx (vec (indexed coll))]
6 (let [sz (count vc-indx), mi (quot sz 2),
7 ce (get vc-indx mi), [idx v] ce]
8 (cond
9 (< el (first coll)) nil
10 (> el (last coll)) nil
11 (and (= mi 0) (or (> el v) (< el v))) nil
12 (= el v) idx
13 (> el v) (recur (subvec vc-indx mi sz))
14 (< el v) (recur (subvec vc-indx 0 mi))))))
indexed on line 1 is an interesting by itself. I came across it in Programming Clojure by Stuart Halloway. It creates new lazy sequence by calling vector with pair of first, then second, etc. items in (iterate (inc 0)) and passed in collection until latter collection is exhausted. bin-srch-indx converts input sequence into a vector, uses cond to catch edge cases and finally uses clojure’s recur and subvec to half the searched collection. On line 3 there is an example of closure to hook in clojure’s build-in test facility. We will talk about test function later.
Unsatisfied with above, I thought there might be a better algorithm for binary search – time to enlist help of Wikipedia. Here is a faster version which doesn’t use subvec but manipulates boundaries.
1 (defn
2 #^{:test (fn [] (tst bin-srch))}
3 bin-srch [coll el]
4 (let [v (vec coll)]
5 (loop [li (int 0), ri (int (count v))]
6 (if (= (get v 0) el)
7 0
8 (let [p (int (/ (- ri li) 2))]
9 (if
10 (> p 0)
11 (let [next-p (int (+ li p)),
12 ce (get v next-p)]
13 (cond
14 (= ce el) next-p
15 (> ce el) (recur li next-p)
16 (< ce el) (recur next-p ri)))
17 nil))))))
Being a diligent developer, I included test function to prove that code works.
[f]
(let
[ub1 99999
ub2 (+ ub1 1)
tc1 (range 10 ub1 1)
tc2 (conj tc1 ub2)
tc3 [11 13 15]
pli 1
pui 98
dc 500
pcol (take 100 (drop dc primes))]
(assert (= nil (f tc1 1)))
(assert (= nil (f tc1 (+ ub2 1))))
(for [i (range 10 (+ ub1 1) 1)]
(assert (= (- i 10) (f tc1 i))))
(for [i (range 10 (+ ub2 1) 1)]
(assert (= (- i 10) (f tc2 i))))
(assert (= nil (f tc3 14)))
(assert (= pli
(f pcol (first (take 1
(drop (+ dc pli) primes))))))
(assert (= pui
(f pcol (first (take 1
(drop (+ dc pui) primes))))))))
Above sets up some test data and using assert verifies results. So, in your friendly REPL you can try:
user=> (use '[clojure.contrib.lazy-seqs :only (primes)]) nil user=> (test bin-srch) :ok user=> (test bin-srch-indx) :ok user=> (bin-srch (take 100 primes) 11) 4
Also, here is a complete file bin-srch.clj. Since WordPress upload attachment gods don’t like clj extension, I had to use doc file extension. By the way, I don’t think it would hard to distinguish between text and binary files during upload and allow former. Let’s pray together: Hare WordPress upload attachment gods…:-)
In the next post we will take look how, with a help of clojure macro, we can measure running time of two implementations above.
]]>