platypope.org / blog

Interactive Hadoop with Parkour

February 8, 2014

Parkour is a Clojure library for writing Hadoop MapReduce programs. The most recent release of Parkour1 includes new features designed to integrate Parkour and Hadoop with the typical REPL-based Clojure workflow. Parkour can now lend a Clojure REPL most of the interactive Hadoop features of the Pig or Hive shells, but backed by the full power of the Clojure programming language.

This post will walk through setting up a Clojure REPL connected to a live Hadoop cluster (running via Amazon’s EMR) and interactively developing a MapReduce program to run on the cluster.

Problem

We need a problem to solve with MapReduce. Treading new data-analysis ground isn’t the goal here, so for this post we’re just going to write a Parkour version of a program to identify decade-wise trending terms in the Google Books n-grams corpus. To make things a bit interesting, we will make an attempt at improving on the original algorithm to include terms which first appear in a given decade.

Preliminaries

This post assumes you either already have fully-configured access to a local Hadoop cluster or know how to launch a cluster on Amazon EMR.

If launching a cluster on EMR, you’ll need to run your REPL process such that the local system Hadoop version matches the EMR cluster Hadoop version and has network access to the cluster services. The easiest way to do this is to run everything on the EMR cluster master node. This is less convenient for actual development than other approaches (e.g. configuring Hadoop to use SOCKS), but involves less setup, and so is what we’ll assume for this post.

Parkour supports the current stable (plus EMR-supported) versions of both Hadoop 1 and Hadoop 2. For this post’s EMR cluster we’ll be using Amazon Hadoop 2.2.0, but Amazon Hadoop 1.0.2 should work just as well.

Create a project

First, install Leiningen 2 and create a new project:

$ lein new app trending-terms
Generating a project called trending-terms based on the 'app' template.

Then update the project file to include Parkour and our version of Hadoop:

(defproject trending-terms "0.1.0-SNAPSHOT"
  :description "Decade-wise trending terms in the Google Books n-grams corpus."
  :url "http://github.com/llasram/trending-terms"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :global-vars {*warn-on-reflection* true}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [com.damballa/parkour "0.5.4"]
                 [org.apache.avro/avro "1.7.5"]
                 [org.apache.avro/avro-mapred "1.7.5"
                  :classifier "hadoop2"]
                 [transduce/transduce "0.1.1"]]
  :exclusions [org.apache.hadoop/hadoop-core
               org.apache.hadoop/hadoop-common
               org.apache.hadoop/hadoop-hdfs
               org.slf4j/slf4j-api org.slf4j/slf4j-log4j12 log4j
               org.apache.avro/avro
               org.apache.avro/avro-mapred
               org.apache.avro/avro-ipc]
  :profiles {:provided
             {:dependencies [[org.apache.hadoop/hadoop-client "2.2.0"]
                             [org.apache.hadoop/hadoop-common "2.2.0"]
                             [org.slf4j/slf4j-api "1.6.1"]
                             [org.slf4j/slf4j-log4j12 "1.6.1"]
                             [log4j "1.2.17"]]}
             :test {:resource-paths ["test-resources"]}
             :aot {:aot :all, :compile-path "target/aot/classes"}
             :uberjar [:aot]
             :jobjar [:aot]})

There’s currently some incidental complexity to the project file (the :exclusions and logging-related dependencies), but just roll with it for now.

Launch a REPL

In order to launch a cluster-connected REPL, we’ll want to use the lein-hadoop-cluster Leiningen plugin. We’ll also want the Alembic library for some of Parkour’s REPL support functionality. We could add these directly to the project file, but they’re pretty orthogonal to any given individual project, so we’ll just add them to the :user profile in our ~/.lein/profiles.clj:

{:user
 {:plugins [[lein-hadoop-cluster "0.1.2"]]
  :dependencies [[alembic "0.2.1"]]}}

With those changes made, we can actually launch the REPL from within the project directory:

$ lein hadoop-repl

Then (optionally, but suggested) connect to the REPL process from our editor.

Examine the data

Let’s get started writing code, in src/trending_terms/core.clj. First, the namespace preliminaries:

(ns trending-terms.core
  (:require [clojure.string :as str]
            [clojure.core.reducers :as r]
            [transduce.reducers :as tr]
            [abracad.avro :as avro]
            [parkour (conf :as conf) (fs :as fs) (mapreduce :as mr)
             ,       (graph :as pg) (reducers :as pr)]
            [parkour.io (seqf :as seqf) (avro :as mra) (dux :as dux)
             ,          (sample :as sample)]))

Then we can define some functions to access the n-gram corpus:

(def ngram-base-url
  "Base URL for Google Books n-gram dataset."
  "s3://datasets.elasticmapreduce/ngrams/books/20090715")

(defn ngram-url
  "URL for Google Books `n`-gram dataset."
  [n] (fs/uri ngram-base-url "eng-us-all" (str n "gram") "data"))

(defn ngram-dseq
  "Distributed sequence for Google Books `n`-grams."
  [n] (seqf/dseq (ngram-url n)))

With these functions in hand, we can start exploring the data:

trending-terms.core> (->> (ngram-dseq 1) (r/take 2) (into []))
[[1 "#\t1584\t6\t6\t1"] [2 "#\t1596\t1\t1\t1"]]
trending-terms.core> (->> (ngram-dseq 1) (sample/dseq)
                          (r/take 2) (into []))
[[265721405 "APPROPRIATE\t1999\t492\t470\t325"]
 [265721406 "APPROPRIATE\t2000\t793\t723\t375"]]
trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 2})
                          (r/take 2) (into []))
[[141968715 "poseen\t2007\t10\t10\t8"]
 [141968716 "poseen\t2008\t13\t13\t10"]]

The cluster-connected REPL gives us direct live access to the actual data we want to analyze. The sample/dseq function allows us to select small samples from that data, random within the limits of what Hadoop allows to be efficient.

Write a function

With direct access to the raw records our jobs get from Hadoop, we can easily begin writing functions to operate on those records:

(defn parse-record
  "Parse text-line 1-gram record `rec` into ((gram, decade), n) tuple."
  [rec]
  (let [[gram year n] (str/split rec #"\t")
        gram (str/lower-case gram)
        year (Long/parseLong year)
        n (Long/parseLong n)
        decade (-> year (quot 10) (* 10))]
    [[gram decade] n]))

(defn select-record?
  "True iff argument record should be included in analysis."
  [[[gram year]]]
  (and (<= 1890 year)
       (re-matches #"^[a-z+'-]+$" gram)))

And doing some quick REPL-testing:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1}) (r/map second)
                          (r/map parse-record)
                          (r/take 2) (into []))
[[["appropriate" 1990] 492] [["appropriate" 2000] 793]]
trending-terms.core> (->> (ngram-dseq 1) (r/map second)
                          (r/map parse-record)
                          (r/filter select-record?)
                          (r/take 2) (into []))
[[["a+" 1890] 1] [["a+" 1890] 2]]

We can glue these functions together into the first map task function we’ll need for our jobs:

(defn normalized-m
  "Parse, normalize, and filter 1-gram records."
  {::mr/source-as :vals, ::mr/sink-as :keyvals}
  [records]
  (->> records
       (r/map parse-record)
       (r/filter select-record?)
       (pr/reduce-by first (pr/mjuxt pr/arg1 +) [nil 0])))

Because Parkour task functions are just functions, we can test that in the REPL as well:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1}) (r/map second)
                          normalized-m (r/take 3) (into []))
[[["appropriate" 1990] 492]
 [["appropriate" 2000] 7242]
 [["appropriated" 1890] 59]]

As we develop the functions composing our program, we can iterate rapidly by immediately seeing the effect of calling our in-development functions on real data.

Write some jobs

Writing the rest of the jobs is a simple matter of programming. We’ll largely follow the original Hive version, but we will attempt to add the ability for entirely new terms to trend by applying Laplace smoothing to the occurrence counts.

Parkour allows us to optimize the entire process down to just two jobs (and no reduce-side joins). This is something which is equally possible via the base Java Hadoop interfaces, but which you’d rarely bother to do, because it’d be too fiddly in the raw APIs and completely impossible in most higher-level frameworks. In Clojure with Parkour however, it’s relatively straightforward.

But! – today’s main, REPL-based excitement comes after. So, without further commentary, the rest of the code:

(defn nth0-p
  "Partitioning function for first item of key tuple."
  ^long [k v ^long n] (-> k (nth 0) hash (mod n)))

(defn normalized-r
  "Collect: tuples of 1-gram and occurrence counts by decade; total 1-gram
occurrence counts by decade; and counter of total number of 1-grams."
  {::mr/source-as :keyvalgroups,
   ::mr/sink-as (dux/named-keyvals :totals)}
  [input]
  (let [nwords (.getCounter mr/*context* "normalized" "nwords")
        fnil+ (fnil + 0)]
    (->> input
         (r/map (fn [[[gram decade] ns]]
                  [gram [decade (reduce + 0 ns)]]))
         (pr/reduce-by first (pr/mjuxt pr/arg1 conj) [nil {}])
         (reduce (fn [totals [gram counts]]
                   (dux/write mr/*context* :counts gram (seq counts))
                   (merge-with + totals counts))
                 {})
         (seq))))

(defn normalized-j
  "Run job accumulating maps of decade-wise raw occurrence counts per 1-gram
and map of total decade-wise Laplace-smoothed occurrence counts."
  [conf workdir ngrams]
  (let [counts-path (fs/path workdir "counts")
        totals-path (fs/path workdir "totals")
        pkey-as (avro/tuple-schema ['string 'long])
        counts-as {:type 'array, :items (avro/tuple-schema ['long 'long])}
        [counts totals]
        , (-> (pg/input ngrams)
              (pg/map #'normalized-m)
              (pg/partition (mra/shuffle pkey-as 'long) #'nth0-p)
              (pg/reduce #'normalized-r)
              (pg/output :counts (mra/dsink ['string counts-as] counts-path)
                         :totals (mra/dsink ['long 'long] totals-path))
              (pg/execute conf "normalized"))
        gramsc (-> (mr/counters-map totals)
                   (get-in ["normalized" "nwords"])
                   double)
        fnil+ (fnil + gramsc)
        totals (reduce (fn [m [d n]]
                         (update-in m [d] fnil+ n))
                       {} totals)]
    [counts totals]))

(defn trending-m
  "Transform decade-wise 1-gram occurrence counts into negated ratios of
occurrence frequencies in adjacent decades."
  {::mr/source-as :keyvals, ::mr/sink-as :keyvals}
  [totals input]
  (r/mapcat (fn [[gram counts]]
              (let [counts (into {} counts)
                    ratios (reduce-kv (fn [m dy nd]
                                        (let [ng (inc (counts dy 0))]
                                          (assoc m dy (/ ng nd))))
                                      {} totals)]
                (->> (seq ratios)
                     (r/map (fn [[dy r]]
                              (let [r' (ratios (- dy 10))]
                                (if (and r' (< 0.000001 r))
                                  [[dy (- (/ r r'))] gram]))))
                     (r/remove nil?))))
            input))

(defn trending-r
  "Select top `n` 1-grams per decade by negated occurrence frequency ratios."
  {::mr/source-as :keyvalgroups, ::mr/sink-as :keyvals}
  [n input]
  (r/map (fn [[[decade] grams]]
           [decade (into [] (r/take n grams))])
         input))

(defn trending-j
  "Run job selecting trending terms per decade."
  [conf workdir topn counts totals]
  (let [ratio-as (avro/tuple-schema ['long 'double])
        ratio+g-as (avro/grouping-schema 1 ratio-as)
        grams-array {:type 'array, :items 'string}
        trending-path (fs/path workdir "trending")
        [trending]
        , (-> (pg/input counts)
              (pg/map #'trending-m totals)
              (pg/partition (mra/shuffle ratio-as 'string ratio+g-as)
                            #'nth0-p)
              (pg/reduce #'trending-r topn)
              (pg/output (mra/dsink ['long grams-array] trending-path))
              (pg/execute conf "trending"))]
    (into (sorted-map) trending)))

(defn trending-terms
  "Calculate the top `topn` trending 1-grams per decade from Google Books 1-gram
corpus dseq `ngrams`.  Writes job outputs under `workdir` and configure jobs
using Hadoop configuration `conf`.  Returns map of initial decade years to
vectors of trending terms."
  [conf workdir topn ngrams]
  (let [[counts totals] (normalized-j conf workdir ngrams)]
    (trending-j conf workdir topn counts totals)))

Once we start writing our job, we can use Parkour’s “mixed mode” job execution to iterate. Mixed mode allows us to run the job in the REPL process, but on the same live sampled data we were examining before:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1, :splits 20})
                          (trending-terms (conf/local-mr!) "file:tmp/tt/0" 5))
{1900 ["deglet" "warroad" "delicado" "erostrato" "warad"],
 1910 ["esbly" "wallonie" "wallstein" "dehmel's" "dellion"],
 1920 ["ernestino" "walska" "wandis" "wandke" "watasenia"],
 1930 ["delacorte" "phytosociological" "priuatly" "delber" "escapism"],
 1940 ["phthalates" "phylic" "espiner" "degrease" "wallonie"],
 1950 ["demokos" "wanotan" "ersine" "dekatron" "physicalistically"],
 1960 ["warain" "propeking" "warschaw" "demecarium" "pikiran"],
 1970 ["prioritize" "walshok" "waterboard" "demogrants" "delisted"],
 1980 ["warsl" "watasi" "proarrhythmic" "walonick" "procurved"],
 1990 ["wanglie" "procedores" "printlnc" "dejanews" "eslamboli"],
 2000 ["wardriving" "erlestoke" "deleterole" "deleteq" "erius"]}

This is the moral equivalent of Pig’s ILLUSTRATE command. Parkour lacks the rigid execution model which allows Pig’s ILLUSTRATE to e.g. synthesize data for joins, but the simplicity of the approach means any combination of “remote” jobs and local processing just works, without surprises.

As with sampling input for individual functions, mixed mode job execution supports rapid development iteration on real data.

Write a test

REPL iteration gets results quickly, but once code works, a real test allows us to have confidence that it will continue to work, and the results are what we actually expect. So let’s write a test for our job graph:

(ns trending-terms.core-test
  (:require [clojure.test :refer :all]
            [trending-terms.core :as tt]
            [parkour (fs :as fs) (conf :as conf)]
            [parkour.io (dsink :as dsink) (seqf :as seqf)]
            [parkour.test-helpers :as th])
  (:import [org.apache.hadoop.io Text LongWritable]))

(def n1grams-records
  ;; omitted from blog post for clarity
  )

(deftest test-basic
  (th/with-config
    (let [workdir (doto "tmp/work/basic" fs/path-delete)
          inpath (fs/path workdir "ngrams")
          ngrams (dsink/with-dseq (seqf/dsink [LongWritable Text] inpath)
                   n1grams-records)
          trending (tt/trending-terms (th/config) workdir ngrams 1)]
      (is (= {1950 ["legal"],
              1960 ["assembly"],
              1970 ["astoria"],
              2000 ["prostate"]}
             trending)))))

The Parkour th/config function and th/with-config macro allow us to run code using a purely local-mode Hadoop configuration, even in a process where the default configuration points to a live cluster. Just like we were able to REPL-test jobs using mixed mode, we can now run our actual tests in-REPL in full local mode:

trending-terms.core-test> (run-tests)

Testing trending-terms.core-test

Ran 1 tests containing 1 assertions.
0 failures, 0 errors.
{:type :summary, :pass 1, :test 1, :error 0, :fail 0}

Success!

Launch a job

Once our program is developed and tested, it’s time to run it on the full dataset. Normally this would involve leaving the REPL to build a job JAR and deploy it somehow, but Parkour allows us to do this directly from the REPL too:

trending-terms.core> (require '[parkour.repl :refer [launch!]])
trending-terms.core> (def *results
                       (future (->> (ngram-dseq 1)
                                    (launch! {"mapred.reduce.tasks" 8}
                                             trending-terms "tt/0" 5))))
#<Var@3a23a4ec: #<Future@55f5a074: :pending>>
trending-terms.core> (realized? *results)
false

(We run the jobs in a future to place it in a background thread, and thus not tie up the REPL while the jobs are running.)

Parkour uses hugoduncan’s Alembic library to load and interact with a full in-process (but isolated) instance of Leiningen. Using this Leiningen instance, Parkour just builds your job JAR and assembles your dependencies exactly as specified by your Leiningen project file.

Once the job finishes, time for some results:

trending-terms.core> (realized? *results)
true
trending-terms.core> @*results
{1900 ["strether" "fluidextract" "thutmose" "adrenalin" "lekythoi"],
 1910 ["orthotype" "britling" "salvarsan" "pacifist" "boches"],
 1920 ["liliom" "bacteriophage" "prohack" "vanzetti" "erlend"],
 1930 ["vridar" "samghin" "mulan" "nazis" "goebbels"],
 1940 ["psia" "plutonium" "luftwaffe" "darlan" "beachhead"],
 1950 ["lopatkin" "rooscvelt" "fluoridation" "jacy" "desegregation"],
 1960 ["vietcong" "synanon" "tshombe" "lumumba" "psychedelic"],
 1970 ["mdhr" "sexist" "sexism" "biofeedback" "counterculture"],
 1980 ["affit" "autocad" "dbase" "neob" "garion"],
 1990 ["activex" "photoshop" "javascript" "netscape" "toolbars"],
 2000 ["cengage" "eventargs" "itunes" "podcast" "wsdl"]}

How trendy! It looks like our smoothing function has added more noise from rare terms3, but the basics are there for the tweaking.

The complete example trending-terms project is on Github, if you want to give a try at experimenting with it (in a live REPL!) yourself.

Parting thoughts

Thanks to rfarrjr for awesome discussion around these features, and to ztellman and ahobson for prompting their value and for specific suggestions. I was honestly skeptical at first that this sort of REPL integration could be made useful, from past experience trying to make a live-cluster Cascalog REPL work. But now that these features exist, I’m not sure how I wrote MapReduce programs without them.

So head over to the Parkour project and get started!

1 Version 0.5.4, at the time of writing.

2 On the EMR Hadoop cluster master node.

3 Especially OCR errors.

(< Ruby Python Clojure)

November 25, 2012

It’s been a week, but I’m still revved up from attending Clojure/conj. The energy, dedication, and intelligence in the Clojure community all continue to impress me, and definitely influence my choice of Clojure as my preferred programming language. Of course the language itself remains the most significant factor, and that’s something I’ve been meaning to blog about for some time.

My language of choice has gone from C to Perl1 to Python to Ruby to Python again, and then finally to Clojure. Each transition has been motivated by a perceived increase in practical expressiveness – an ability to communicate both effect and intent more directly and efficiently, both to the computer and to other humans. C to Perl provided garbage collection, first-class collections, and CPAN. Perl to Python provided real objects, comprehensible semantics, and a syntax that did not resemble line-noise. Python to Ruby provided blocks, open classes, and pervasive metaprogramming.

Ruby back to Python takes a bit more explaining. In so many ways the languages are so similar as to be almost indistinguishable.

One difference is in the communities surrounding the languages. Ruby has come a long, long way from the days when most of its documentation was only available in Japanese. But – there are differing cultural norms between the two communities on how best to balance documentation, stable interfaces, implementation quality, and the introduction of new features. I feel that the Ruby community pervasively emphasizes the latter above the others to a greater extent than I prefer. The Python language, standard library, and third-party libraries tend to be better documented and more stable than their Ruby counterparts. Not universally or to an unjustifiable extent – just sufficiently that I see a distinction, and prefer Python.

The other most significant aspect is that I believe Python APIs tend to be more value-oriented than their equivalent Ruby APIs. This one is a little difficult to explain...

Ruby more closely follows the canonical OO model by having a concept of “messages.” All method invocations and function calls are actually deliveries of messages to objects. When you see a line of Ruby code like:

invoke_method arg

All you know is that the implicit self at that particular line is going to receive the invoke_method message with the argument arg. You first need to unwind the evaluation state to figure what object self is at that point. Then you need to figure out how it will actually handle the message, which may be via a class instance method, an included module method, or method_missing. Messages themselves aren’t first-class or introspectable, so there’s no way to ask “if I send the invoke_method message here, who exactly will respond?”

Python in contrast has no implicit scopes and models everything in terms of attribute access and function-calling2. When you see a line of Python code like:

object.method(arg)

You know there is an in-scope, introspectable value named object at this line. That value is asked via the attribute-access protocol for its method member, which yields another introspectable value. That attribute value is invoked via the function call protocol with the argument arg. At every step of the way you have a concrete, introspectable value backed by code you can find and follow.

In my experience these differences aren’t just a curiosity, but directly impact the way one most concisely expresses code in the two languages. In Ruby libraries it is a very common idiom to provide methods invoked within a class definition or instance_eval’d block which build state by modifying the implicit self. The moral analog of this in Python (metaclasses) is relatively rare, and most Python libraries provide interfaces in terms of concrete values. For a good example of what I’m talking about, look at the difference between the ruby-ffi and Python ctypes libraries.

And then finally, Clojure.

For the few years I’ve been using it, I’ve found that Clojure maximizes practical expressiveness in three ways which beat all other languages I’ve tried. It has everything you’d expect from a language like Python or Ruby, but then has adds these things on top.

First there’s the obvious: macros. Ruby and Python provide metaprogramming facilities which allow programmatically generating any constructs one could produce directly through code. But doing so is not always going to be as legible or compact as simply writing out the desired result, nor will it necessarily bear a clear relation to that directly-expressed version. With Clojure macros, you instead have the ability to do arbitrary in-code code-generation, using the same functions as for general-purpose code, and acting effectively like executable structural templates for the expanded code. In Clojure, there never needs to be any boilerplate – even if macros aren’t always the best method for eliminating code repetition, they are a method which can be used to eliminate any code repetition.

Second is the language’s functional style and culture. Without the language itself enforcing functional purity, the style of the Clojure standard library and the norms of the Clojure community encourage it. This gives what I see as the best of both worlds: the ability to write and reason about most code as pure functions operating on immutable values, while escaping to side effects and mutation when necessary without any excess ceremony. This makes it even more value-oriented that Python, with almost all execution happening in terms of values which are not only concrete and visible, but also immutable.

Third is Clojure’s definition as an explicitly hosted language, and specifically the JVM as the best-supported hosting platform. I was a bit resistant at first to getting deeply acquainted with the JVM and (oh, the humanity!) Java, but I now see it as a significant benefit. Part of practical expressiveness is not needing to write code incidental to the application domain. The vast ecosystem of Java libraries and the ease with which Leiningen allows access to them significantly cuts down on incidental code. The JVM itself of course is rock-solid, fast3, and universal. From a practical perspective, it’s a clear win for almost all applications.

Anyway, enough blogging – time to go write some Clojure!

1 In college >10 years ago, so cut me a little slack.

2 Which I believe you could think of as being messages, but I’m not sure that really helps much.

3 Well, after boot.

Without comment

April 5, 2012

Update: Ok, I was wrong!; or at least greatly over-stated my case. People have shown me lots of great examples of when comments-qua-comments can be useful1. I still think that much of the time, the time spent writing explanatory comments would be better spent just making the implementation clearer. But code can’t always perfectly capture intent or the influence of external factors. Consider my incipient dogmatism rescinded.

I am on the record at my current office saying that I prefer to work on code-bases without comments. I do my best to follow this preference in the code I write myself, which occasionally provokes some, er – comment. I can see why this might be controversial, so I’d like to explain.

Not comments

First off, some things that aren’t “comments” in the sense I mean; this:

(defn clojure-function
  "Calculates a result from arguments `args`."
  [& args] ... result)

Or this:

def python_function(*args):
    """Calculates a result from arguments `args`."""
    ...
    return result

Or even this:

/**
 * Calculates a result from arguments.
 *
 * @param args  the arguments to use in the calculation
 */
AbstractInterfaceAdaptorProxy
javaMethod(FlyweightFacadeFactory args...) {
    ...
    return result;
}

The first two examples are obviously not comments – they’re strings. To be precise, they’re docstrings. The third example uses Java’s syntax for comments, but only because Java doesn’t have docstrings. All three are pieces of interface documentation which are consumed in a structured way by an automated documentation system. The syntax of comments provides a convenient way to bolt on structured in-line interface documentation for languages which don’t have built-in support for it, but it’s hardly what comments are “for”; otherwise languages with real docstrings wouldn’t have a separate syntax for comments.

Also not comments: the input to systems like docco and marginalia. These sorts of systems use the syntax of comments to bolt on support for producing comprehensive, structured implementation documentation. They only use the syntax of comments because no one aside from Donald Knuth seems to be able to make it work to write the implementation in-line in the documentation2. Turning the documentation/implementation relationship inside-out shows use of the “comment” syntax as an artifact of convenience.

Comments, a taxonomy

Ok, so what does that leave as “actually comments”?

Implementation-repetition

# Append a dot to the end
some_string += "."

I can read just fine, thanks!

From the peanut gallery

# Wow, what a kludge
object.send(:private_method)

Well, then why are you doing it?

Completely wrong

# Only show users specials on Sunday
return false unless username =~ VALID_USERNAME_RE

Looks like someone changed the code without making sure the comments still matched!

Right?

# Increment by 2 to account for leap-seconds since 2004
seconds += 3

Just close enough to create the potential for confusion. Has there been another leap-second, or is seconds being incremented for a completely different reason?

What I tell you three times is true

# BACKEND-192: Implement secondary sort so that we can track preferences
#   on a per-user basis.
operation.sort_key = [:overall_rating, :user_preference]

So now there’s three different explanations of what the code is doing: the high-level description linked to changing requirements in the issue tracking system, the historical implementation description in the associated commit message, and... this comment, which isn’t linked to changing requirements or to the history of changes. Unless you look at the ticket to see what’s changed, or look at the commit log to see the change in context. The comment in the code adds absolutely nothing.

Right, but...

# Increment by 3 to account for leap-seconds since 2004
seconds += 3

The comment is right, and usefully explains what’s happening semantically, but why have a comment when you could just make the implementation read as clearly without a comment?:

LEAP_SECONDS_SINCE_2004 = 3
...
seconds += LEAP_SECONDS_SINCE_2004

If you need to explain why you’re adding in the leap-seconds, you can add a semantically-named function/method which performs the operation. I strongly believe that in most situations it’s possible to make the code itself just as clear as any comment could make it.

Explaining the horror

# Warning: massive kludge, but we can’t fix it until we re-implement
# the primary business logic, which is currently written in a dialect
# of REXX invented by Jim, who quit yesterday.
object.send(:private_method)

Hey, that’s actually useful!

It also indicates a code-base I’d really prefer not to work on if I can avoid it, and is also the kind of comment I certainly never want to feel the need to write myself.

Conclusion

So that’s why I prefer there be no comments: the only real use I see for comments-qua-comments is unstructured documentation of the reasons behind specific implementation warts. If I can help it, I prefer the code-bases I work on not have such warts. QED.

Comments?

1 Erik Peterson’s list in the comments on this post is a pretty good summary.

2 I know that there are other literate programming practitioners out there, but I must confess to never having tried it myself. Maybe some day.

Restoring Features

April 5, 2012

Couldn’t sleep, worked on blog instead.

I managed to get comments up with Disqus yesterday. Now I’ve got Jekyll rendering posts using Emacs muse-mode, just like my old blogging system did. Among other things, that means that I can use Emacs as my syntax-highlighting engine again:

(defn with-starts?
  "Does string s begin with the provided prefix?"
  {:inline (fn [prefix s & to]
             `(let [^String s# ~s, ^String prefix# ~prefix]
                (.startsWith s# prefix# ~@(when (seq to) [`(int ~@to)]))))
   :inline-arities #{2 3}}
  ([prefix s] (.startsWith ^String s ^String prefix))
  ([prefix s to] (.startsWith ^String s ^String prefix (int to))))

And I once again have footnotes1 which jump over into the side margin. I had to convert my old footnote-mangling code to jQuery from Prototype (yes, it was that old). But hey – it works now!

And I even added back an Atom feed and the archived posts page.

I think that’s actually it. Blog once again fully armed and operational. Not too shabby for a few hours of insomnia. Zzz...

1 So yes, actually sidenotes when everything works properly.

Back up; backup!

April 4, 2012

My now very former hosting provider “lost” my blog VPS, and I lost everything on it. My automated backup process had apparently stopped working when I’d shuffled some paths around. Oops. Thankfully the Wayback machine archived my blog for me, so I’ll be able to get back most of it. My previous process and custom+creaky blog engine was too complicated anyway – let’s give a try to a static site (generated with Jekyll).

I’ll be restoring functionality and old blog posts as I go, but I wanted to get something back online. It’s felt odd not having a Web presence – like I was no longer really there on the Internet, or some sort of Internet ghost. Spooooooky.

Anyway, I’m glad to be back.

Clojure `case' is not for you

October 15, 2011

I lost the original version of this post. C’est la vie. The gist is that Clojure’s standard case macro somewhat surprisingly (to me) does not support arbitrary expressions as test conditions – only read-time constants. It does this because it’s primary use from the Clojure-implementation perspective is to support fast dispatch on symbol/keyword values.

Other people have blogged about this and provided their own less-surprising implementations, include cemerick.

I didn’t like any of the ones I found elsewhere, so here’s mine:

(defmacro case-expr
  "Like case, but only supports individual test expressions, which are
evaluated at macro-expansion time."
  [e & clauses]
  `(case ~e
     ~@(concat
        (mapcat (fn [[test result]]
                  [(eval `(let [test# ~test] test#)) result])
                (partition 2 clauses))
        (when (odd? (count clauses))
          (list (last clauses))))))

Enjoy!

Confuse the customer... and win?

February 4, 2010

With Amazon.com not selling Macmillian books at the moment, now might seem like a good time to go to the source and buy ebooks directly from Macmillian. And they even have multiple formats available!: the “Adobe Digital Edition” format and the “Adobe eReader” format.

Wait, what?

According to their Ebooks information and help page, “Adobe eReader” books enable you “to read high-fidelity ebooks alongside other PDF files. Only this reader software displays ebooks with the pictures, graphics, and rich fonts you’ve come to expect from printed books.” While “Adobe Digital Editions” is “Adobe’s reader designed for eBooks” and “uses a format based on the Open Publishing Standard with the extension .epub,” but “ADE will also display your PDF files in a double-page, single page, or fit-to-width view — or you can specify your own custom fit.” Both formats have software download links, which both redirect to Adobe’s current Digital Editions page.

So. Er. I think “Adobe eReader” is PDF and “Adobe Digital Editions” is EPUB. Format proliferation is bad enough without making format identification more difficult than necessary.

The vicious cycle of piracy?

February 3, 2010

A few years ago I discovered that Wizards of the Coast had started selling PDF versions of pretty much the entire catalog of TSR-published original D&D, AD&D, and AD&D 2nd Ed material, all at quite reasonable prices. I’ve been fond of the Planescape setting ever since I was first introduced to it, and I impulsively bought the majority of the AD&D 2nd Ed Planescape manuals. You known, in case I ever get suddenly transported back to the 90’s or something. I later lost those precious, fully-paid-for bits in a hard drive crash, and didn’t bother redownloading them again because of reasons which seemed reasonable at the time. I mean, the vendor I bought them from will surely let them download them again whenever I decide to. What could possibly go wrong?

Fast forward to the present. Finally setting up a reasonable backup scheme jogged my memory of previously lost bits, and I decided to try downloading new copies of those RPG manuals. And… the vendor still exists… I’m able to log in… they have my complete order history… they have download links!… and – no. Apparently WotC pulled the plug, stopping all e-book sale of their both current and out-of-print material, including re-downloads of already-sold titles.

Of course, a quick search turned up Rapidshare-hosted copies of all the books I’d purchased, which I felt no scruples about downloading. But chicken or egg – are the books so easily available because WotC removed them from legitimate channels? Or was the pull in the first place a response to widespread piracy? Either way, I don’t see how WotC is benefiting.

Creative Reformation

February 3, 2010

Faruk Ates responds to Mark Pilgrim’s Tinker’s Sunset by trivializing tinkering and claiming that the ease-to-use devices like the iPad are fostering a Creative Revolution:

The simple matter is that these guys are old, and they grew up in an age where tinkering was the only possible course of action if you wanted to use the latest and greatest technology to its fullest potential. The Mac, in 1984, shifted that paradigm of creativity and creation towards average consumers a little. The iPhone and iPad are shifting it even further towards consumers, away from the tinkerers of old, the small little “elite” that excludes the vast majority of people.

He may be right about fostering creativity, but he’s missing the point. Making the iPad accessible to non-tinkers and making it untinkerable are completely orthogonal.

Imagine that the iPad worked exacly as Apple has already presented, but it also had a “tinker” switch. When on, this switch allowed users to run applications not signed by Apple, with appropriately dire warnings. Problem solved, without impacting typical user experience.

“Tinkering” is easy to trivialize, but doing so ignores what its prevention represents technically. What we’re talking about on the iPad is an impenetrable cryptographic shield which gives Apple absolute control over what code is allowed to run. Apple, not users, determines what applications are appropriate. Apple is free to censor not only content which fails to meet their technical standards, but also content which conflicts with their business interests or they deem to be “obscene.” No matter how light the shackles, on the iPad (and iPhone) you are not free.

On the flip side, like Pilgrim I do see “tinkering” as valuable in itself. Software stacks more than any other engineered systems are inherently knowable, and one can learn from them. Fully free systems (in Stallman’s sense) are the most knowable and most instructive by virtue of the source code for every component being there for the asking. With a cryptographically shielded platform like the iPad this is impossible, and the system is unknowable and there is nothing to be learned even for the “elite.”

Solving puzzles with (computer) science

January 7, 2010

For Christmas my girlfriend gave me a series of increasingly difficult wooden-block puzzles, yielding up each one only as I solved the previous. She’d show me the assembled puzzle — to mock me, I assume — then dump it disassembled into my lap1. The first one was a quickly-solved freebie, but the remaining three were pretty difficult. Difficult enough even that I decided to let computers do the boring work and wrote programs to solve them. Ah, the joys of being a software engineer!

And if you think this is “cheating,” I feel the final results from my “bonus round” (at the end of this post) validate the approach.

Round 1 was the King Snake. This puzzle is a 4x4x4 cube composed of 64 linearly connected unit-cubes. The strand of unit-cubes is divided into 46 segments of 2-4 cubes each. Each segment shares a joint cube with its previous segment and freely rotates to form any right angle with that segment. A naive estimation of the problem space is greater than 1e27 (4 right-angles to the power of 46 segments). However, most moves eliminate 25-75% of the remaining problem space by either running into already-filled space or leaving the allowed 4x4x4 grid. Counting on the constraints quickly reducing the problem space, I initially solved this one with a pretty naive depth-first search written in Python.

The Python program took about 5 hours to run, and was designed to find only one solution. But hey! — it worked and found a solution. I later revisited this puzzle with what I learned from solving the others, but I’ll get to that at the end.

Round 2 was the Shipper’s Dilemma Z. This puzzle is a 5x5x5 cube composed of 25 identical, 5-unit-cube, vaguely Z-shaped pieces. For this one I did some research, hoping something about tessellation would help. Alas, even though the pieces are all identical, it appears that this is still an (NP-complete) packing problem. There are 960 positions a piece could occupy within the space, which for 25 pieces yields a naive complexity estimate of greater than 1e492. Ouch. It seemed much less likely in this case that the basic problem constraints would help much, as a naive depth-first search would prune only a few possibilities with each move. I wrote a simple first-try in Python anyway. It got absolutely nowhere, which led me to decide to try something different, something “clever.” And thus down a rabbit-hole I went.

I’ll leave out most of the details, but I wasted a lot of holiday time trying to solve this puzzle with something akin to dynamic programming. I’d build groups of 3 pieces “attached” to a vertex, combine those into groups of 6 pieces attached to a quadrant, combine those into group of 12 attached to a side, then combine those into groups of 24 which (for solutions) would (theoretically) form the full cube with one piece-shaped hole somewhere in the middle. It was a terrible, terrible idea. The computational complexity of each step varied wildly as I changed my methods for forming groups and what assumptions I made about the properties of solution-participating groups. At one point the execution-time was lagging just enough that I re-wrote the solution-grinding code in C. Later the complexity was just where it wasn’t feasible to run at home, but I could run it on Amazon’s Elastic MapReduce. I did learn how to use Hadoop, EC2, and EMR, but — long story short — none of it yielded a solution. I eventually climbed out of the rabbit-hole and went back to a depth-first search, but this time with a much better conceptualization of the problem.

Fast C primitives help3, but the only real way to solve a problem of this sort is by exponentially reducing the search space. The first step is to avoid working on any “unsolvable” states. Many patterns of piece-placement leave gaps which no subsequent piece can fill. I test for this by filling in a copy of the board state with all the pieces which could fit, even if those pieces themselves overlap. If the cube isn’t completely filled, then the initial board state cannot lead to a solution.

Another obvious step for any sort of space-pattern problem is to eliminate all rotations and reflections which result in other states in the same symmetry group. For a cubical puzzle like this one, eliminating symmetries reduces the state-space by a factor of 48. In this case, I did so by initially calculating all symmetrically unique states which have 8 pieces placed, one in each corner. This also has the nice side-effect of splitting the search-space into parallelizable segments, although that turned out to be unnecessary.

The final step necessary to explore all the “interesting” states in a reasonable amount of time is to eliminate the exploration of duplicate states. Just naively iterating over piece combinations will result in trying both piece A then B and piece B then A, even though they result in exploring the same board states. I iterated over a couple approaches to this, but eventually hit upon simply ensuring that each position in the puzzle is filled in a fixed sequence. This minimizes the number of options available for each placement and doesn’t require any explicit book-keeping to short-circuit previously-explored board states.

Final running-time: 2.5 minutes to generate all four solutions.

Round 3 was the RAMU OCTAHEDRON 4. This one is kind of a irregular decahedron5 represented as a 5x5x5 cube with the 4 unit-cubes at each vertex removed. It splits into 8 very irregularly-shaped pieces. There are also two small wooden spheres which occupy unfilled space within the cube and “lock” it by preventing motion of the “key” piece unless the spheres are shifted into particular positions. The site calls the Ramu Octahedron their “most difficult puzzle” and claims that only one person has solved it without reference to the solution. The difficulties of this puzzle are three-fold: the irregularity of the pieces makes conceptualizing their spatial placement difficult; many of the pieces can only be placed by combinations of separate “insertion” then “locking” motions; and once the puzzle is assembled, one must determine how to maneuver the spheres within the unfilled internal space to allow re-disassembly.

Fortunately the piece-irregularity holds little difficulty for a computer. After laboriously entering the shapes of each piece, I was able to re-use code I’d written for the previous puzzle to generate all the interesting piece rotations and translations6 and their various combinations. Once that gave me the solution spatial arrangement, it required a bit of trial-and-error to figure out a working piece order and insert/lock sequences, but I was able to manage it by hand. Figuring out the necessary unlocking rotations was also easy enough, at least after I added a map of the unfilled internal space to the solution.

Final running-time: 1 second to generate the single solution.

Bonus round, back to the King Snake. I decided to come back to this puzzle with what I’d learned from solving the others and try to generate all its solutions in a reasonable amount of time. I didn’t have any new ideas about how to frame the puzzle, but I did have some new implementation tools: pre-generating and indexing by position all legal piece arrangements, and representing puzzle states as bit-fields. These two together allow for a blazing-fast solution.

Final running-time: 30 seconds to generate all four solutions 7.

If you own this puzzle, you may at this point be thinking “Wait, what?” The marketing copy and provided instructions for the King Snake claim there are only two solutions. But nope, four. The two extras solutions are very similar to each other, but are quite distinct from the two “official” solutions, and none are simply symmetries of the others.

“Cheating” — hah!

Puzzle-solving code available for your edification.

1 Best Christmas present ever, seriously.

2 For comparison, chess apparently has between 1e43 and 1e50 legal board states.

3 Most actions on a the puzzle-state are just a few bitwise operations.

4 Which I really like putting in all caps, for some reason.

5 Yeah, the name confuses me too.

6 But not reflections, what with three dimensional limitation and all.

7 FYI, I number from the opposite end than the provided solutions.

archived posts