platypope.org / blog / syndicate

Interactive Hadoop with Parkour

2014-02-08T00:00:00-05:00

Parkour is a Clojure library for writing Hadoop MapReduce programs. The most recent release of Parkour1 includes new features designed to integrate Parkour and Hadoop with the typical REPL-based Clojure workflow. Parkour can now lend a Clojure REPL most of the interactive Hadoop features of the Pig or Hive shells, but backed by the full power of the Clojure programming language.

This post will walk through setting up a Clojure REPL connected to a live Hadoop cluster (running via Amazon’s EMR) and interactively developing a MapReduce program to run on the cluster.

Problem

We need a problem to solve with MapReduce. Treading new data-analysis ground isn’t the goal here, so for this post we’re just going to write a Parkour version of a program to identify decade-wise trending terms in the Google Books n-grams corpus. To make things a bit interesting, we will make an attempt at improving on the original algorithm to include terms which first appear in a given decade.

Preliminaries

This post assumes you either already have fully-configured access to a local Hadoop cluster or know how to launch a cluster on Amazon EMR.

If launching a cluster on EMR, you’ll need to run your REPL process such that the local system Hadoop version matches the EMR cluster Hadoop version and has network access to the cluster services. The easiest way to do this is to run everything on the EMR cluster master node. This is less convenient for actual development than other approaches (e.g. configuring Hadoop to use SOCKS), but involves less setup, and so is what we’ll assume for this post.

Parkour supports the current stable (plus EMR-supported) versions of both Hadoop 1 and Hadoop 2. For this post’s EMR cluster we’ll be using Amazon Hadoop 2.2.0, but Amazon Hadoop 1.0.2 should work just as well.

Create a project

First, install Leiningen 2 and create a new project:

$ lein new app trending-terms
Generating a project called trending-terms based on the 'app' template.

Then update the project file to include Parkour and our version of Hadoop:

(defproject trending-terms "0.1.0-SNAPSHOT"
  :description "Decade-wise trending terms in the Google Books n-grams corpus."
  :url "http://github.com/llasram/trending-terms"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :global-vars {*warn-on-reflection* true}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [com.damballa/parkour "0.5.4"]
                 [org.apache.avro/avro "1.7.5"]
                 [org.apache.avro/avro-mapred "1.7.5"
                  :classifier "hadoop2"]
                 [transduce/transduce "0.1.1"]]
  :exclusions [org.apache.hadoop/hadoop-core
               org.apache.hadoop/hadoop-common
               org.apache.hadoop/hadoop-hdfs
               org.slf4j/slf4j-api org.slf4j/slf4j-log4j12 log4j
               org.apache.avro/avro
               org.apache.avro/avro-mapred
               org.apache.avro/avro-ipc]
  :profiles {:provided
             {:dependencies [[org.apache.hadoop/hadoop-client "2.2.0"]
                             [org.apache.hadoop/hadoop-common "2.2.0"]
                             [org.slf4j/slf4j-api "1.6.1"]
                             [org.slf4j/slf4j-log4j12 "1.6.1"]
                             [log4j "1.2.17"]]}
             :test {:resource-paths ["test-resources"]}
             :aot {:aot :all, :compile-path "target/aot/classes"}
             :uberjar [:aot]
             :jobjar [:aot]})

There’s currently some incidental complexity to the project file (the :exclusions and logging-related dependencies), but just roll with it for now.

Launch a REPL

In order to launch a cluster-connected REPL, we’ll want to use the lein-hadoop-cluster Leiningen plugin. We’ll also want the Alembic library for some of Parkour’s REPL support functionality. We could add these directly to the project file, but they’re pretty orthogonal to any given individual project, so we’ll just add them to the :user profile in our ~/.lein/profiles.clj:

{:user
 {:plugins [[lein-hadoop-cluster "0.1.2"]]
  :dependencies [[alembic "0.2.1"]]}}

With those changes made, we can actually launch the REPL from within the project directory:

$ lein hadoop-repl

Then (optionally, but suggested) connect to the REPL process from our editor.

Examine the data

Let’s get started writing code, in src/trending_terms/core.clj. First, the namespace preliminaries:

(ns trending-terms.core
  (:require [clojure.string :as str]
            [clojure.core.reducers :as r]
            [transduce.reducers :as tr]
            [abracad.avro :as avro]
            [parkour (conf :as conf) (fs :as fs) (mapreduce :as mr)
             ,       (graph :as pg) (reducers :as pr)]
            [parkour.io (seqf :as seqf) (avro :as mra) (dux :as dux)
             ,          (sample :as sample)]))

Then we can define some functions to access the n-gram corpus:

(def ngram-base-url
  "Base URL for Google Books n-gram dataset."
  "s3://datasets.elasticmapreduce/ngrams/books/20090715")

(defn ngram-url
  "URL for Google Books `n`-gram dataset."
  [n] (fs/uri ngram-base-url "eng-us-all" (str n "gram") "data"))

(defn ngram-dseq
  "Distributed sequence for Google Books `n`-grams."
  [n] (seqf/dseq (ngram-url n)))

With these functions in hand, we can start exploring the data:

trending-terms.core> (->> (ngram-dseq 1) (r/take 2) (into []))
[[1 "#\t1584\t6\t6\t1"] [2 "#\t1596\t1\t1\t1"]]
trending-terms.core> (->> (ngram-dseq 1) (sample/dseq)
                          (r/take 2) (into []))
[[265721405 "APPROPRIATE\t1999\t492\t470\t325"]
 [265721406 "APPROPRIATE\t2000\t793\t723\t375"]]
trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 2})
                          (r/take 2) (into []))
[[141968715 "poseen\t2007\t10\t10\t8"]
 [141968716 "poseen\t2008\t13\t13\t10"]]

The cluster-connected REPL gives us direct live access to the actual data we want to analyze. The sample/dseq function allows us to select small samples from that data, random within the limits of what Hadoop allows to be efficient.

Write a function

With direct access to the raw records our jobs get from Hadoop, we can easily begin writing functions to operate on those records:

(defn parse-record
  "Parse text-line 1-gram record `rec` into ((gram, decade), n) tuple."
  [rec]
  (let [[gram year n] (str/split rec #"\t")
        gram (str/lower-case gram)
        year (Long/parseLong year)
        n (Long/parseLong n)
        decade (-> year (quot 10) (* 10))]
    [[gram decade] n]))

(defn select-record?
  "True iff argument record should be included in analysis."
  [[[gram year]]]
  (and (<= 1890 year)
       (re-matches #"^[a-z+'-]+$" gram)))

And doing some quick REPL-testing:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1}) (r/map second)
                          (r/map parse-record)
                          (r/take 2) (into []))
[[["appropriate" 1990] 492] [["appropriate" 2000] 793]]
trending-terms.core> (->> (ngram-dseq 1) (r/map second)
                          (r/map parse-record)
                          (r/filter select-record?)
                          (r/take 2) (into []))
[[["a+" 1890] 1] [["a+" 1890] 2]]

We can glue these functions together into the first map task function we’ll need for our jobs:

(defn normalized-m
  "Parse, normalize, and filter 1-gram records."
  {::mr/source-as :vals, ::mr/sink-as :keyvals}
  [records]
  (->> records
       (r/map parse-record)
       (r/filter select-record?)
       (pr/reduce-by first (pr/mjuxt pr/arg1 +) [nil 0])))

Because Parkour task functions are just functions, we can test that in the REPL as well:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1}) (r/map second)
                          normalized-m (r/take 3) (into []))
[[["appropriate" 1990] 492]
 [["appropriate" 2000] 7242]
 [["appropriated" 1890] 59]]

As we develop the functions composing our program, we can iterate rapidly by immediately seeing the effect of calling our in-development functions on real data.

Write some jobs

Writing the rest of the jobs is a simple matter of programming. We’ll largely follow the original Hive version, but we will attempt to add the ability for entirely new terms to trend by applying Laplace smoothing to the occurrence counts.

Parkour allows us to optimize the entire process down to just two jobs (and no reduce-side joins). This is something which is equally possible via the base Java Hadoop interfaces, but which you’d rarely bother to do, because it’d be too fiddly in the raw APIs and completely impossible in most higher-level frameworks. In Clojure with Parkour however, it’s relatively straightforward.

But! – today’s main, REPL-based excitement comes after. So, without further commentary, the rest of the code:

(defn nth0-p
  "Partitioning function for first item of key tuple."
  ^long [k v ^long n] (-> k (nth 0) hash (mod n)))

(defn normalized-r
  "Collect: tuples of 1-gram and occurrence counts by decade; total 1-gram
occurrence counts by decade; and counter of total number of 1-grams."
  {::mr/source-as :keyvalgroups,
   ::mr/sink-as (dux/named-keyvals :totals)}
  [input]
  (let [nwords (.getCounter mr/*context* "normalized" "nwords")
        fnil+ (fnil + 0)]
    (->> input
         (r/map (fn [[[gram decade] ns]]
                  [gram [decade (reduce + 0 ns)]]))
         (pr/reduce-by first (pr/mjuxt pr/arg1 conj) [nil {}])
         (reduce (fn [totals [gram counts]]
                   (dux/write mr/*context* :counts gram (seq counts))
                   (merge-with + totals counts))
                 {})
         (seq))))

(defn normalized-j
  "Run job accumulating maps of decade-wise raw occurrence counts per 1-gram
and map of total decade-wise Laplace-smoothed occurrence counts."
  [conf workdir ngrams]
  (let [counts-path (fs/path workdir "counts")
        totals-path (fs/path workdir "totals")
        pkey-as (avro/tuple-schema ['string 'long])
        counts-as {:type 'array, :items (avro/tuple-schema ['long 'long])}
        [counts totals]
        , (-> (pg/input ngrams)
              (pg/map #'normalized-m)
              (pg/partition (mra/shuffle pkey-as 'long) #'nth0-p)
              (pg/reduce #'normalized-r)
              (pg/output :counts (mra/dsink ['string counts-as] counts-path)
                         :totals (mra/dsink ['long 'long] totals-path))
              (pg/execute conf "normalized"))
        gramsc (-> (mr/counters-map totals)
                   (get-in ["normalized" "nwords"])
                   double)
        fnil+ (fnil + gramsc)
        totals (reduce (fn [m [d n]]
                         (update-in m [d] fnil+ n))
                       {} totals)]
    [counts totals]))

(defn trending-m
  "Transform decade-wise 1-gram occurrence counts into negated ratios of
occurrence frequencies in adjacent decades."
  {::mr/source-as :keyvals, ::mr/sink-as :keyvals}
  [totals input]
  (r/mapcat (fn [[gram counts]]
              (let [counts (into {} counts)
                    ratios (reduce-kv (fn [m dy nd]
                                        (let [ng (inc (counts dy 0))]
                                          (assoc m dy (/ ng nd))))
                                      {} totals)]
                (->> (seq ratios)
                     (r/map (fn [[dy r]]
                              (let [r' (ratios (- dy 10))]
                                (if (and r' (< 0.000001 r))
                                  [[dy (- (/ r r'))] gram]))))
                     (r/remove nil?))))
            input))

(defn trending-r
  "Select top `n` 1-grams per decade by negated occurrence frequency ratios."
  {::mr/source-as :keyvalgroups, ::mr/sink-as :keyvals}
  [n input]
  (r/map (fn [[[decade] grams]]
           [decade (into [] (r/take n grams))])
         input))

(defn trending-j
  "Run job selecting trending terms per decade."
  [conf workdir topn counts totals]
  (let [ratio-as (avro/tuple-schema ['long 'double])
        ratio+g-as (avro/grouping-schema 1 ratio-as)
        grams-array {:type 'array, :items 'string}
        trending-path (fs/path workdir "trending")
        [trending]
        , (-> (pg/input counts)
              (pg/map #'trending-m totals)
              (pg/partition (mra/shuffle ratio-as 'string ratio+g-as)
                            #'nth0-p)
              (pg/reduce #'trending-r topn)
              (pg/output (mra/dsink ['long grams-array] trending-path))
              (pg/execute conf "trending"))]
    (into (sorted-map) trending)))

(defn trending-terms
  "Calculate the top `topn` trending 1-grams per decade from Google Books 1-gram
corpus dseq `ngrams`.  Writes job outputs under `workdir` and configure jobs
using Hadoop configuration `conf`.  Returns map of initial decade years to
vectors of trending terms."
  [conf workdir topn ngrams]
  (let [[counts totals] (normalized-j conf workdir ngrams)]
    (trending-j conf workdir topn counts totals)))

Once we start writing our job, we can use Parkour’s “mixed mode” job execution to iterate. Mixed mode allows us to run the job in the REPL process, but on the same live sampled data we were examining before:

trending-terms.core> (->> (ngram-dseq 1) (sample/dseq {:seed 1, :splits 20})
                          (trending-terms (conf/local-mr!) "file:tmp/tt/0" 5))
{1900 ["deglet" "warroad" "delicado" "erostrato" "warad"],
 1910 ["esbly" "wallonie" "wallstein" "dehmel's" "dellion"],
 1920 ["ernestino" "walska" "wandis" "wandke" "watasenia"],
 1930 ["delacorte" "phytosociological" "priuatly" "delber" "escapism"],
 1940 ["phthalates" "phylic" "espiner" "degrease" "wallonie"],
 1950 ["demokos" "wanotan" "ersine" "dekatron" "physicalistically"],
 1960 ["warain" "propeking" "warschaw" "demecarium" "pikiran"],
 1970 ["prioritize" "walshok" "waterboard" "demogrants" "delisted"],
 1980 ["warsl" "watasi" "proarrhythmic" "walonick" "procurved"],
 1990 ["wanglie" "procedores" "printlnc" "dejanews" "eslamboli"],
 2000 ["wardriving" "erlestoke" "deleterole" "deleteq" "erius"]}

This is the moral equivalent of Pig’s ILLUSTRATE command. Parkour lacks the rigid execution model which allows Pig’s ILLUSTRATE to e.g. synthesize data for joins, but the simplicity of the approach means any combination of “remote” jobs and local processing just works, without surprises.

As with sampling input for individual functions, mixed mode job execution supports rapid development iteration on real data.

Write a test

REPL iteration gets results quickly, but once code works, a real test allows us to have confidence that it will continue to work, and the results are what we actually expect. So let’s write a test for our job graph:

(ns trending-terms.core-test
  (:require [clojure.test :refer :all]
            [trending-terms.core :as tt]
            [parkour (fs :as fs) (conf :as conf)]
            [parkour.io (dsink :as dsink) (seqf :as seqf)]
            [parkour.test-helpers :as th])
  (:import [org.apache.hadoop.io Text LongWritable]))

(def n1grams-records
  ;; omitted from blog post for clarity
  )

(deftest test-basic
  (th/with-config
    (let [workdir (doto "tmp/work/basic" fs/path-delete)
          inpath (fs/path workdir "ngrams")
          ngrams (dsink/with-dseq (seqf/dsink [LongWritable Text] inpath)
                   n1grams-records)
          trending (tt/trending-terms (th/config) workdir ngrams 1)]
      (is (= {1950 ["legal"],
              1960 ["assembly"],
              1970 ["astoria"],
              2000 ["prostate"]}
             trending)))))

The Parkour th/config function and th/with-config macro allow us to run code using a purely local-mode Hadoop configuration, even in a process where the default configuration points to a live cluster. Just like we were able to REPL-test jobs using mixed mode, we can now run our actual tests in-REPL in full local mode:

trending-terms.core-test> (run-tests)

Testing trending-terms.core-test

Ran 1 tests containing 1 assertions.
0 failures, 0 errors.
{:type :summary, :pass 1, :test 1, :error 0, :fail 0}

Success!

Launch a job

Once our program is developed and tested, it’s time to run it on the full dataset. Normally this would involve leaving the REPL to build a job JAR and deploy it somehow, but Parkour allows us to do this directly from the REPL too:

trending-terms.core> (require '[parkour.repl :refer [launch!]])
trending-terms.core> (def *results
                       (future (->> (ngram-dseq 1)
                                    (launch! {"mapred.reduce.tasks" 8}
                                             trending-terms "tt/0" 5))))
#<Var@3a23a4ec: #<Future@55f5a074: :pending>>
trending-terms.core> (realized? *results)
false

(We run the jobs in a future to place it in a background thread, and thus not tie up the REPL while the jobs are running.)

Parkour uses hugoduncan’s Alembic library to load and interact with a full in-process (but isolated) instance of Leiningen. Using this Leiningen instance, Parkour just builds your job JAR and assembles your dependencies exactly as specified by your Leiningen project file.

Once the job finishes, time for some results:

trending-terms.core> (realized? *results)
true
trending-terms.core> @*results
{1900 ["strether" "fluidextract" "thutmose" "adrenalin" "lekythoi"],
 1910 ["orthotype" "britling" "salvarsan" "pacifist" "boches"],
 1920 ["liliom" "bacteriophage" "prohack" "vanzetti" "erlend"],
 1930 ["vridar" "samghin" "mulan" "nazis" "goebbels"],
 1940 ["psia" "plutonium" "luftwaffe" "darlan" "beachhead"],
 1950 ["lopatkin" "rooscvelt" "fluoridation" "jacy" "desegregation"],
 1960 ["vietcong" "synanon" "tshombe" "lumumba" "psychedelic"],
 1970 ["mdhr" "sexist" "sexism" "biofeedback" "counterculture"],
 1980 ["affit" "autocad" "dbase" "neob" "garion"],
 1990 ["activex" "photoshop" "javascript" "netscape" "toolbars"],
 2000 ["cengage" "eventargs" "itunes" "podcast" "wsdl"]}

How trendy! It looks like our smoothing function has added more noise from rare terms3, but the basics are there for the tweaking.

The complete example trending-terms project is on Github, if you want to give a try at experimenting with it (in a live REPL!) yourself.

Parting thoughts

Thanks to rfarrjr for awesome discussion around these features, and to ztellman and ahobson for prompting their value and for specific suggestions. I was honestly skeptical at first that this sort of REPL integration could be made useful, from past experience trying to make a live-cluster Cascalog REPL work. But now that these features exist, I’m not sure how I wrote MapReduce programs without them.

So head over to the Parkour project and get started!

1 Version 0.5.4, at the time of writing.

2 On the EMR Hadoop cluster master node.

3 Especially OCR errors.

(< Ruby Python Clojure)

2012-11-25T00:00:00-05:00

It’s been a week, but I’m still revved up from attending Clojure/conj. The energy, dedication, and intelligence in the Clojure community all continue to impress me, and definitely influence my choice of Clojure as my preferred programming language. Of course the language itself remains the most significant factor, and that’s something I’ve been meaning to blog about for some time.

My language of choice has gone from C to Perl1 to Python to Ruby to Python again, and then finally to Clojure. Each transition has been motivated by a perceived increase in practical expressiveness – an ability to communicate both effect and intent more directly and efficiently, both to the computer and to other humans. C to Perl provided garbage collection, first-class collections, and CPAN. Perl to Python provided real objects, comprehensible semantics, and a syntax that did not resemble line-noise. Python to Ruby provided blocks, open classes, and pervasive metaprogramming.

Ruby back to Python takes a bit more explaining. In so many ways the languages are so similar as to be almost indistinguishable.

One difference is in the communities surrounding the languages. Ruby has come a long, long way from the days when most of its documentation was only available in Japanese. But – there are differing cultural norms between the two communities on how best to balance documentation, stable interfaces, implementation quality, and the introduction of new features. I feel that the Ruby community pervasively emphasizes the latter above the others to a greater extent than I prefer. The Python language, standard library, and third-party libraries tend to be better documented and more stable than their Ruby counterparts. Not universally or to an unjustifiable extent – just sufficiently that I see a distinction, and prefer Python.

The other most significant aspect is that I believe Python APIs tend to be more value-oriented than their equivalent Ruby APIs. This one is a little difficult to explain...

Ruby more closely follows the canonical OO model by having a concept of “messages.” All method invocations and function calls are actually deliveries of messages to objects. When you see a line of Ruby code like:

invoke_method arg

All you know is that the implicit self at that particular line is going to receive the invoke_method message with the argument arg. You first need to unwind the evaluation state to figure what object self is at that point. Then you need to figure out how it will actually handle the message, which may be via a class instance method, an included module method, or method_missing. Messages themselves aren’t first-class or introspectable, so there’s no way to ask “if I send the invoke_method message here, who exactly will respond?”

Python in contrast has no implicit scopes and models everything in terms of attribute access and function-calling2. When you see a line of Python code like:

object.method(arg)

You know there is an in-scope, introspectable value named object at this line. That value is asked via the attribute-access protocol for its method member, which yields another introspectable value. That attribute value is invoked via the function call protocol with the argument arg. At every step of the way you have a concrete, introspectable value backed by code you can find and follow.

In my experience these differences aren’t just a curiosity, but directly impact the way one most concisely expresses code in the two languages. In Ruby libraries it is a very common idiom to provide methods invoked within a class definition or instance_eval’d block which build state by modifying the implicit self. The moral analog of this in Python (metaclasses) is relatively rare, and most Python libraries provide interfaces in terms of concrete values. For a good example of what I’m talking about, look at the difference between the ruby-ffi and Python ctypes libraries.

And then finally, Clojure.

For the few years I’ve been using it, I’ve found that Clojure maximizes practical expressiveness in three ways which beat all other languages I’ve tried. It has everything you’d expect from a language like Python or Ruby, but then has adds these things on top.

First there’s the obvious: macros. Ruby and Python provide metaprogramming facilities which allow programmatically generating any constructs one could produce directly through code. But doing so is not always going to be as legible or compact as simply writing out the desired result, nor will it necessarily bear a clear relation to that directly-expressed version. With Clojure macros, you instead have the ability to do arbitrary in-code code-generation, using the same functions as for general-purpose code, and acting effectively like executable structural templates for the expanded code. In Clojure, there never needs to be any boilerplate – even if macros aren’t always the best method for eliminating code repetition, they are a method which can be used to eliminate any code repetition.

Second is the language’s functional style and culture. Without the language itself enforcing functional purity, the style of the Clojure standard library and the norms of the Clojure community encourage it. This gives what I see as the best of both worlds: the ability to write and reason about most code as pure functions operating on immutable values, while escaping to side effects and mutation when necessary without any excess ceremony. This makes it even more value-oriented that Python, with almost all execution happening in terms of values which are not only concrete and visible, but also immutable.

Third is Clojure’s definition as an explicitly hosted language, and specifically the JVM as the best-supported hosting platform. I was a bit resistant at first to getting deeply acquainted with the JVM and (oh, the humanity!) Java, but I now see it as a significant benefit. Part of practical expressiveness is not needing to write code incidental to the application domain. The vast ecosystem of Java libraries and the ease with which Leiningen allows access to them significantly cuts down on incidental code. The JVM itself of course is rock-solid, fast3, and universal. From a practical perspective, it’s a clear win for almost all applications.

Anyway, enough blogging – time to go write some Clojure!

1 In college >10 years ago, so cut me a little slack.

2 Which I believe you could think of as being messages, but I’m not sure that really helps much.

3 Well, after boot.

Without comment

2012-04-05T22:31:27Z

Update: Ok, I was wrong!; or at least greatly over-stated my case. People have shown me lots of great examples of when comments-qua-comments can be useful1. I still think that much of the time, the time spent writing explanatory comments would be better spent just making the implementation clearer. But code can’t always perfectly capture intent or the influence of external factors. Consider my incipient dogmatism rescinded.

I am on the record at my current office saying that I prefer to work on code-bases without comments. I do my best to follow this preference in the code I write myself, which occasionally provokes some, er – comment. I can see why this might be controversial, so I’d like to explain.

Not comments

First off, some things that aren’t “comments” in the sense I mean; this:

(defn clojure-function
  "Calculates a result from arguments `args`."
  [& args] ... result)

Or this:

def python_function(*args):
    """Calculates a result from arguments `args`."""
    ...
    return result

Or even this:

/**
 * Calculates a result from arguments.
 *
 * @param args  the arguments to use in the calculation
 */
AbstractInterfaceAdaptorProxy
javaMethod(FlyweightFacadeFactory args...) {
    ...
    return result;
}

The first two examples are obviously not comments – they’re strings. To be precise, they’re docstrings. The third example uses Java’s syntax for comments, but only because Java doesn’t have docstrings. All three are pieces of interface documentation which are consumed in a structured way by an automated documentation system. The syntax of comments provides a convenient way to bolt on structured in-line interface documentation for languages which don’t have built-in support for it, but it’s hardly what comments are “for”; otherwise languages with real docstrings wouldn’t have a separate syntax for comments.

Also not comments: the input to systems like docco and marginalia. These sorts of systems use the syntax of comments to bolt on support for producing comprehensive, structured implementation documentation. They only use the syntax of comments because no one aside from Donald Knuth seems to be able to make it work to write the implementation in-line in the documentation2. Turning the documentation/implementation relationship inside-out shows use of the “comment” syntax as an artifact of convenience.

Comments, a taxonomy

Ok, so what does that leave as “actually comments”?

Implementation-repetition

# Append a dot to the end
some_string += "."

I can read just fine, thanks!

From the peanut gallery

# Wow, what a kludge
object.send(:private_method)

Well, then why are you doing it?

Completely wrong

# Only show users specials on Sunday
return false unless username =~ VALID_USERNAME_RE

Looks like someone changed the code without making sure the comments still matched!

Right?

# Increment by 2 to account for leap-seconds since 2004
seconds += 3

Just close enough to create the potential for confusion. Has there been another leap-second, or is seconds being incremented for a completely different reason?

What I tell you three times is true

# BACKEND-192: Implement secondary sort so that we can track preferences
#   on a per-user basis.
operation.sort_key = [:overall_rating, :user_preference]

So now there’s three different explanations of what the code is doing: the high-level description linked to changing requirements in the issue tracking system, the historical implementation description in the associated commit message, and... this comment, which isn’t linked to changing requirements or to the history of changes. Unless you look at the ticket to see what’s changed, or look at the commit log to see the change in context. The comment in the code adds absolutely nothing.

Right, but...

# Increment by 3 to account for leap-seconds since 2004
seconds += 3

The comment is right, and usefully explains what’s happening semantically, but why have a comment when you could just make the implementation read as clearly without a comment?:

LEAP_SECONDS_SINCE_2004 = 3
...
seconds += LEAP_SECONDS_SINCE_2004

If you need to explain why you’re adding in the leap-seconds, you can add a semantically-named function/method which performs the operation. I strongly believe that in most situations it’s possible to make the code itself just as clear as any comment could make it.

Explaining the horror

# Warning: massive kludge, but we can’t fix it until we re-implement
# the primary business logic, which is currently written in a dialect
# of REXX invented by Jim, who quit yesterday.
object.send(:private_method)

Hey, that’s actually useful!

It also indicates a code-base I’d really prefer not to work on if I can avoid it, and is also the kind of comment I certainly never want to feel the need to write myself.

Conclusion

So that’s why I prefer there be no comments: the only real use I see for comments-qua-comments is unstructured documentation of the reasons behind specific implementation warts. If I can help it, I prefer the code-bases I work on not have such warts. QED.

Comments?

1 Erik Peterson’s list in the comments on this post is a pretty good summary.

2 I know that there are other literate programming practitioners out there, but I must confess to never having tried it myself. Maybe some day.

Restoring Features

2012-04-05T00:00:00-04:00

Couldn’t sleep, worked on blog instead.

I managed to get comments up with Disqus yesterday. Now I’ve got Jekyll rendering posts using Emacs muse-mode, just like my old blogging system did. Among other things, that means that I can use Emacs as my syntax-highlighting engine again:

(defn with-starts?
  "Does string s begin with the provided prefix?"
  {:inline (fn [prefix s & to]
             `(let [^String s# ~s, ^String prefix# ~prefix]
                (.startsWith s# prefix# ~@(when (seq to) [`(int ~@to)]))))
   :inline-arities #{2 3}}
  ([prefix s] (.startsWith ^String s ^String prefix))
  ([prefix s to] (.startsWith ^String s ^String prefix (int to))))

And I once again have footnotes1 which jump over into the side margin. I had to convert my old footnote-mangling code to jQuery from Prototype (yes, it was that old). But hey – it works now!

And I even added back an Atom feed and the archived posts page.

I think that’s actually it. Blog once again fully armed and operational. Not too shabby for a few hours of insomnia. Zzz...

1 So yes, actually sidenotes when everything works properly.

Back up; backup!

2012-04-04T00:00:00-04:00

My now very former hosting provider “lost” my blog VPS, and I lost everything on it. My automated backup process had apparently stopped working when I’d shuffled some paths around. Oops. Thankfully the Wayback machine archived my blog for me, so I’ll be able to get back most of it. My previous process and custom+creaky blog engine was too complicated anyway – let’s give a try to a static site (generated with Jekyll).

I’ll be restoring functionality and old blog posts as I go, but I wanted to get something back online. It’s felt odd not having a Web presence – like I was no longer really there on the Internet, or some sort of Internet ghost. Spooooooky.

Anyway, I’m glad to be back.

Clojure `case' is not for you

2011-10-15T00:00:00-04:00

I lost the original version of this post. C’est la vie. The gist is that Clojure’s standard case macro somewhat surprisingly (to me) does not support arbitrary expressions as test conditions – only read-time constants. It does this because it’s primary use from the Clojure-implementation perspective is to support fast dispatch on symbol/keyword values.

Other people have blogged about this and provided their own less-surprising implementations, include cemerick.

I didn’t like any of the ones I found elsewhere, so here’s mine:

(defmacro case-expr
  "Like case, but only supports individual test expressions, which are
evaluated at macro-expansion time."
  [e & clauses]
  `(case ~e
     ~@(concat
        (mapcat (fn [[test result]]
                  [(eval `(let [test# ~test] test#)) result])
                (partition 2 clauses))
        (when (odd? (count clauses))
          (list (last clauses))))))

Enjoy!

Confuse the customer... and win?

2010-02-04T00:00:00-05:00

With Amazon.com not selling Macmillian books at the moment, now might seem like a good time to go to the source and buy ebooks directly from Macmillian. And they even have multiple formats available!: the “Adobe Digital Edition” format and the “Adobe eReader” format.

Wait, what?

According to their Ebooks information and help page, “Adobe eReader” books enable you “to read high-fidelity ebooks alongside other PDF files. Only this reader software displays ebooks with the pictures, graphics, and rich fonts you’ve come to expect from printed books.” While “Adobe Digital Editions” is “Adobe’s reader designed for eBooks” and “uses a format based on the Open Publishing Standard with the extension .epub,” but “ADE will also display your PDF files in a double-page, single page, or fit-to-width view — or you can specify your own custom fit.” Both formats have software download links, which both redirect to Adobe’s current Digital Editions page.

So. Er. I think “Adobe eReader” is PDF and “Adobe Digital Editions” is EPUB. Format proliferation is bad enough without making format identification more difficult than necessary.

The vicious cycle of piracy?

2010-02-03T00:00:00-05:00

A few years ago I discovered that Wizards of the Coast had started selling PDF versions of pretty much the entire catalog of TSR-published original D&D, AD&D, and AD&D 2nd Ed material, all at quite reasonable prices. I’ve been fond of the Planescape setting ever since I was first introduced to it, and I impulsively bought the majority of the AD&D 2nd Ed Planescape manuals. You known, in case I ever get suddenly transported back to the 90’s or something. I later lost those precious, fully-paid-for bits in a hard drive crash, and didn’t bother redownloading them again because of reasons which seemed reasonable at the time. I mean, the vendor I bought them from will surely let them download them again whenever I decide to. What could possibly go wrong?

Fast forward to the present. Finally setting up a reasonable backup scheme jogged my memory of previously lost bits, and I decided to try downloading new copies of those RPG manuals. And… the vendor still exists… I’m able to log in… they have my complete order history… they have download links!… and – no. Apparently WotC pulled the plug, stopping all e-book sale of their both current and out-of-print material, including re-downloads of already-sold titles.

Of course, a quick search turned up Rapidshare-hosted copies of all the books I’d purchased, which I felt no scruples about downloading. But chicken or egg – are the books so easily available because WotC removed them from legitimate channels? Or was the pull in the first place a response to widespread piracy? Either way, I don’t see how WotC is benefiting.

Creative Reformation

2010-02-03T00:00:00-05:00

Faruk Ates responds to Mark Pilgrim’s Tinker’s Sunset by trivializing tinkering and claiming that the ease-to-use devices like the iPad are fostering a Creative Revolution:

The simple matter is that these guys are old, and they grew up in an age where tinkering was the only possible course of action if you wanted to use the latest and greatest technology to its fullest potential. The Mac, in 1984, shifted that paradigm of creativity and creation towards average consumers a little. The iPhone and iPad are shifting it even further towards consumers, away from the tinkerers of old, the small little “elite” that excludes the vast majority of people.

He may be right about fostering creativity, but he’s missing the point. Making the iPad accessible to non-tinkers and making it untinkerable are completely orthogonal.

Imagine that the iPad worked exacly as Apple has already presented, but it also had a “tinker” switch. When on, this switch allowed users to run applications not signed by Apple, with appropriately dire warnings. Problem solved, without impacting typical user experience.

“Tinkering” is easy to trivialize, but doing so ignores what its prevention represents technically. What we’re talking about on the iPad is an impenetrable cryptographic shield which gives Apple absolute control over what code is allowed to run. Apple, not users, determines what applications are appropriate. Apple is free to censor not only content which fails to meet their technical standards, but also content which conflicts with their business interests or they deem to be “obscene.” No matter how light the shackles, on the iPad (and iPhone) you are not free.

On the flip side, like Pilgrim I do see “tinkering” as valuable in itself. Software stacks more than any other engineered systems are inherently knowable, and one can learn from them. Fully free systems (in Stallman’s sense) are the most knowable and most instructive by virtue of the source code for every component being there for the asking. With a cryptographically shielded platform like the iPad this is impossible, and the system is unknowable and there is nothing to be learned even for the “elite.”

Solving puzzles with (computer) science

2010-01-07T00:00:00-05:00

For Christmas my girlfriend gave me a series of increasingly difficult wooden-block puzzles, yielding up each one only as I solved the previous. She’d show me the assembled puzzle — to mock me, I assume — then dump it disassembled into my lap1. The first one was a quickly-solved freebie, but the remaining three were pretty difficult. Difficult enough even that I decided to let computers do the boring work and wrote programs to solve them. Ah, the joys of being a software engineer!

And if you think this is “cheating,” I feel the final results from my “bonus round” (at the end of this post) validate the approach.

Round 1 was the King Snake. This puzzle is a 4x4x4 cube composed of 64 linearly connected unit-cubes. The strand of unit-cubes is divided into 46 segments of 2-4 cubes each. Each segment shares a joint cube with its previous segment and freely rotates to form any right angle with that segment. A naive estimation of the problem space is greater than 1e27 (4 right-angles to the power of 46 segments). However, most moves eliminate 25-75% of the remaining problem space by either running into already-filled space or leaving the allowed 4x4x4 grid. Counting on the constraints quickly reducing the problem space, I initially solved this one with a pretty naive depth-first search written in Python.

The Python program took about 5 hours to run, and was designed to find only one solution. But hey! — it worked and found a solution. I later revisited this puzzle with what I learned from solving the others, but I’ll get to that at the end.

Round 2 was the Shipper’s Dilemma Z. This puzzle is a 5x5x5 cube composed of 25 identical, 5-unit-cube, vaguely Z-shaped pieces. For this one I did some research, hoping something about tessellation would help. Alas, even though the pieces are all identical, it appears that this is still an (NP-complete) packing problem. There are 960 positions a piece could occupy within the space, which for 25 pieces yields a naive complexity estimate of greater than 1e492. Ouch. It seemed much less likely in this case that the basic problem constraints would help much, as a naive depth-first search would prune only a few possibilities with each move. I wrote a simple first-try in Python anyway. It got absolutely nowhere, which led me to decide to try something different, something “clever.” And thus down a rabbit-hole I went.

I’ll leave out most of the details, but I wasted a lot of holiday time trying to solve this puzzle with something akin to dynamic programming. I’d build groups of 3 pieces “attached” to a vertex, combine those into groups of 6 pieces attached to a quadrant, combine those into group of 12 attached to a side, then combine those into groups of 24 which (for solutions) would (theoretically) form the full cube with one piece-shaped hole somewhere in the middle. It was a terrible, terrible idea. The computational complexity of each step varied wildly as I changed my methods for forming groups and what assumptions I made about the properties of solution-participating groups. At one point the execution-time was lagging just enough that I re-wrote the solution-grinding code in C. Later the complexity was just where it wasn’t feasible to run at home, but I could run it on Amazon’s Elastic MapReduce. I did learn how to use Hadoop, EC2, and EMR, but — long story short — none of it yielded a solution. I eventually climbed out of the rabbit-hole and went back to a depth-first search, but this time with a much better conceptualization of the problem.

Fast C primitives help3, but the only real way to solve a problem of this sort is by exponentially reducing the search space. The first step is to avoid working on any “unsolvable” states. Many patterns of piece-placement leave gaps which no subsequent piece can fill. I test for this by filling in a copy of the board state with all the pieces which could fit, even if those pieces themselves overlap. If the cube isn’t completely filled, then the initial board state cannot lead to a solution.

Another obvious step for any sort of space-pattern problem is to eliminate all rotations and reflections which result in other states in the same symmetry group. For a cubical puzzle like this one, eliminating symmetries reduces the state-space by a factor of 48. In this case, I did so by initially calculating all symmetrically unique states which have 8 pieces placed, one in each corner. This also has the nice side-effect of splitting the search-space into parallelizable segments, although that turned out to be unnecessary.

The final step necessary to explore all the “interesting” states in a reasonable amount of time is to eliminate the exploration of duplicate states. Just naively iterating over piece combinations will result in trying both piece A then B and piece B then A, even though they result in exploring the same board states. I iterated over a couple approaches to this, but eventually hit upon simply ensuring that each position in the puzzle is filled in a fixed sequence. This minimizes the number of options available for each placement and doesn’t require any explicit book-keeping to short-circuit previously-explored board states.

Final running-time: 2.5 minutes to generate all four solutions.

Round 3 was the RAMU OCTAHEDRON 4. This one is kind of a irregular decahedron5 represented as a 5x5x5 cube with the 4 unit-cubes at each vertex removed. It splits into 8 very irregularly-shaped pieces. There are also two small wooden spheres which occupy unfilled space within the cube and “lock” it by preventing motion of the “key” piece unless the spheres are shifted into particular positions. The site calls the Ramu Octahedron their “most difficult puzzle” and claims that only one person has solved it without reference to the solution. The difficulties of this puzzle are three-fold: the irregularity of the pieces makes conceptualizing their spatial placement difficult; many of the pieces can only be placed by combinations of separate “insertion” then “locking” motions; and once the puzzle is assembled, one must determine how to maneuver the spheres within the unfilled internal space to allow re-disassembly.

Fortunately the piece-irregularity holds little difficulty for a computer. After laboriously entering the shapes of each piece, I was able to re-use code I’d written for the previous puzzle to generate all the interesting piece rotations and translations6 and their various combinations. Once that gave me the solution spatial arrangement, it required a bit of trial-and-error to figure out a working piece order and insert/lock sequences, but I was able to manage it by hand. Figuring out the necessary unlocking rotations was also easy enough, at least after I added a map of the unfilled internal space to the solution.

Final running-time: 1 second to generate the single solution.

Bonus round, back to the King Snake. I decided to come back to this puzzle with what I’d learned from solving the others and try to generate all its solutions in a reasonable amount of time. I didn’t have any new ideas about how to frame the puzzle, but I did have some new implementation tools: pre-generating and indexing by position all legal piece arrangements, and representing puzzle states as bit-fields. These two together allow for a blazing-fast solution.

Final running-time: 30 seconds to generate all four solutions 7.

If you own this puzzle, you may at this point be thinking “Wait, what?” The marketing copy and provided instructions for the King Snake claim there are only two solutions. But nope, four. The two extras solutions are very similar to each other, but are quite distinct from the two “official” solutions, and none are simply symmetries of the others.

“Cheating” — hah!

Puzzle-solving code available for your edification.

1 Best Christmas present ever, seriously.

2 For comparison, chess apparently has between 1e43 and 1e50 legal board states.

3 Most actions on a the puzzle-state are just a few bitwise operations.

4 Which I really like putting in all caps, for some reason.

5 Yeah, the name confuses me too.

6 But not reflections, what with three dimensional limitation and all.

7 FYI, I number from the opposite end than the provided solutions.

Language Implementation Patterns

2009-10-05T00:00:00-04:00

Possible lesson: don’t get upset with a book for not being a completely different book the author hasn’t written yet, but will later.

I few months ago I bought The Definitive ANTLR Reference by Terence Parr1. The book’s subtitle is “Building Domain-Specific Languages,” so I was rather disappointed when I found out that it has absolutely no information on building domain-specific languages — or any other applications — using ANTLR. It’s a great ANTLR reference, but doesn’t have any examples of using ANTLR to do anything other than just parse (and emit and handle parsing errors). Without any practical examples, it took bit of head-scratching on my part to realize how to even e.g. make my Z-code assembler available to my ANTLR-generate AST walker.

But today I learned that Terence Parr has also written another book titled Language Implementation Patterns. I haven’t had a chance to read much of it yet, but it looks like exactly what I wanted in the first place — a guide to actually writing various sorts of applications which involve parsing languages, mostly using ANTLR-generated parsers. This book has the oddly-similar subtitle “Create Your Own Domain-Specific and General Programming Languages,” but which seems rather more apt here. In any case, I’m looking forward to it.

And on a side note, Parr’s publisher, the Pragmatic Bookshelf, kicks ass. Their e-book deployment is even better than O’Reilly’s. While both provide PDF, Mobipocket, and EPUB versions for perpetual re-download, the PragProgs also (a) offer almost their entire catalog as e-books, and (b) make e-book editions available as “beta books” several months prior to the print release date. In fact, Language Implementation Patterns is currently only available in an e-book beta version. But available it is, for delicious pre-final-draft reading by the adventurous.

1 Primary author of ANTLR, so you can see the draw.

ZmForth!

2009-09-29T00:00:00-04:00

Some time ago I decide to fix one of the holes in my software engineering knowledge and learn about writing compilers. My undergraduate curriculum provided an overview of parsing and some general programming language design issues, but nothing at all on code generation or optimization. I bought a few books on compilers, started familiarizing myself with ANTLR, then became completely and utterly side-tracked by the programming language Forth.

Forth is a strange little language. Chance are that unless you’re an astronomer or have mucked around with Sun’s OpenBoot, you’ve never even heard of it; or if you have heard of it, you’ve never used it. Forth saw its heyday during the 80’s and has largely faded away with the 8- and 16-bit processors to which it was most suited. But boy was it suited to those processors — if I ever needed to develop code to run on a 16-bit processor with less than 128k of RAM, Forth is probably the language I’d turn to.

Forth’s key property is that it’s a purely stack-based language. Forth functions — or “words,” as Forth calls them — do not take arguments explicitly, but instead implicitly via a system-wide parameter stack. There aren’t any local variables — instead the language provides a rich set of stack-manipulation primitives like DUP1, SWAP2, and ROT3 to facilitate juggling the top handful of stack items to provide the correct parameters to each function call. Even control structures are implemented with Forth-level function calls!4 For example, here’s my implementation of IF / THEN:

: if compile ?branch here 0 , ; immediate
: then here swap ! ; immediate

This means that a Forth program consists of nothing more than a list of functions to call in sequence, a feature which the language exploits in two ways.

First, to simplify syntax. Just as a Forth program abstractly consists of nothing more than a list of function “words,” the source code of a Forth program consists of just lists of those words’ human-readable names separated by spaces. Adding the ability to switch the Forth system between executing each word as parsed and appending it into a new definition allows the system to act as both interpreter and compiler in one. That’s a full interactive interpreter and compiler in under 8k.

Second, to simplify implementation. There are “traditional” optimizing compilers for Forth, but that isn’t the most common or most obvious approach to Forth compilation, especially in-target. More frequently, Forth systems will use an approach known as “threaded code.” This has nothing to do with multithreaded execution, but instead refers to the technique of generating code which consists of only of calls to other functions. The more specific techniques of so-called “direct” and “indirect” threaded code drop even the calls themselves5 and instead encode only the function addresses. A list of address can’t be executed directly, so this necessitates an “inner interpreter” to load and call each in sequence, but this interpreter need only be a few machine instructions long, and on some architectures imposes no additional overhead over directly-expressed function calls. In a Forth system using this approach, compiling a function call into a new definition literally consists of just appending the address of the function to call to the end of the definition in progress. It just can’t get any simpler than that.

And because it’s all so simple, you can implement it yourself! Or rather, I can lose the thin thread of sanity and decide to implement one myself. There are existing F/OSS Forth systems out there for pretty much every environment under the sun, which provide plenty of examples to turn to for inspiration, but also mean that the whole Forth thing is pretty well done. This needn’t be a hindrance to implementation-as-a-learning-exercise, but I wanted to contribute something new. I happened to think of the Z-machine, and lo-and-behold, although there is a Z-machine scheme implementation, there was no Z-machine Forth.

Until now! Ladies and gentlemen, I present to you ZmForth!, an ANS Forth implementation for the Z-machine. It passes the woefully incomplete ANS Forth test suite I found and runs existing Forth programs that don’t depend on file I/O. It plays Tetris, performs 32-bit arithmetic and unsigned comparisons, and provides exceptions, a compiler, and a de-compiler. All of this in a 16-bit virtual machine with only 64k of addressable memory. The whole project was one giant rabbit hole, but I had to write my own Z-code assembler to provide the directives I wanted, which is at least in the direction of what I initially planned to work on.

I had a good time working it, and I hope someone else may be at least half as entertained by it as I was.

1 Duplicate the stop stack item.

2 Swap the top two stack items.

3 Rotate the third stack item to the top.

4 In fact, in my system all the control structures are implemented in Forth.

5 The call instructions, that is.

Creator vs. reader and the Adobe EPUB monopoly posted

2009-07-02T00:00:00-04:00

I’ve been doing a fair bit more reading on my Android phone, although recently I’ve switched from FBReaderJ to Aldiko. Each new release of FBReaderJ has gotten better, but Aldiko includes an actual CSS-based renderer1 and presently provides a much smoother reading experience2. I feel a little guilty using a piece of commercial software when a free-as-in-freedom solution exists, but not yet guilty enough to try hacking on FBReaderJ.

Guilt aside, reading more on a smaller screen has had me thinking again about the tension between book creator and reader in e-book formatting and layout. EPUB on the mobile phone highlights this tension more than e-ink devices if for no other reason than that a phone screen even less resembles a book page than an e-ink screen does. One conclusion I’ve come to is that it’s somewhat unfortunate that Adobe has thus far been the biggest contributor to EPUB as a commercial e-book format.

On the one hand, someone had to do it, and it’s good that someone has done it. Even the DRM thing, to some degree – most publishers aren’t ready to do without it, and at least Adobe had the good grace to use an easily circumventable system. The major concern I have is with some aspects of the apparent mindset behind Adobe Digital Editions.

At present, Digital Editions is the EPUB viewer to beat – I have no idea what the actual usage figures look like, but it’s the only viewer one can legally use for all those commercially-sold Adobe-DRMed EPUB books, so one has to imagine that DE commands the lion’s share of the market. Adobe represents Digital Editions as being more than just an EPUB viewer. The advertising copy on the DE web site touts it as “offer[ing] an engaging way to view and manage eBooks and other digital publications.” To this end, DE supports not only EPUB, but also the document format much more central to Adobe’s business – PDF. And because PDF is so much more important to Adobe, DE caters to PDF at the expense of EPUB.

It’s no surprise to the average e-book enthusiast that PDF’s fixed-page nature makes it a poor e-book format. The most obvious reason is that PDF files can’t be cleanly reflowed to alternative page sizes. A perhaps less obvious corollary is that PDF leaves little room for user control of basic formatting parameters such as font size, line height, text alignment, and paragraph marking. But via one or more chains of causality, because PDF rendering does not allow user-control of these properties, Adobe DE doesn’t allow setting them for EPUB either. This yields an EPUB viewer where the reader has control over only the font size and page size. And even then only partial control! – DE will not rescale any size a book specifies as an absolute size, and DE allows books to provide “page template” files which control how the available screen area is divided into text regions and body columns.

The most generous interpretation of this decision is interface consistency. As long as Adobe is attempting to present PDF as an e-book format and provide a viewer which handles “digital publications” regardless of format, it only complicates that viewer’s interface to provide options which apply to some formats but not others. Somewhat less charitably, Adobe’s focus on PDF may have led to an unconscious bias toward creator control of document formatting, to the extent of perhaps not even considering placing control beyond font size (a.k.a. “zoom”) in the readers’ hands. And way over on the conspiracy-theory side of the fence, perhaps it represents a conscious decision on the part of Adobe to limit the usefulness (and adoption) of EPUB by making it only a small improvement over PDF even for the documents most suited to reflowable formatting.

So for whatever reason, the most popular EPUB viewer on the market has limitations which severely restrict the usefulness of the format. But EPUB is an open standard, which means other, better viewers are free to compete with Digital Editions and supplant it, right? Except they aren’t really, because of DRM.

As I mentioned above, it seems that most publishers are not yet ready to do without DRM, despite the lesson of the music industry. The overwhelming majority of commercially-sold books are encumbered with DRM, and all of the DRM-encumbered EPUB books sold use Adobe’s ADEPT DRM. It would be technically possible for a competitor to begin offering a different EPUB DRM scheme, but I can only imagine the degree of confusion mutually incompatible “EPUB” books would cause among average consumers. So any successful EPUB viewer device or application needs to license the ADEPT DRM technology from Adobe.

To facilitate this, Adobe has begun offering the “Adobe Reader Mobile 9 SDK.” The SDK is available by license agreement only4, so we can but speculate from the marketing copy on the capabilities and interfaces it provides. The “features” list in the SDK FAQ focuses entirely on features of the “Reader Mobile document rendering engine,” suggesting that SDK primarily/only provides a rendering engine. If this is the case, then Adobe is not expecting – or potentially allowing – other vendors to write competing ADEPT-compatible EPUB renderers. Instead, all the available and announced EPUB reader apps/devices using the Adobe SDK5 will simply repackage the Adobe renderer and the paucity of options for user control it provides.

One example in support of this theory is Amazon’s PDF support in the Kindle DX. Although not widely advertised, Amazon is apparently using the Adobe SDK, just integrating only the PDF renderer. Notably missing in the DX’s PDF support vs. all the Kindles’ Mobipocket support is the ability to add annotations to documents. It seems to me that this would be a “must have” feature, not only for parity with Mobipocket support, but also for the target market as a device for textbooks and technical documents. To me the most obvious explanation for this feature’s lack is that there isn’t an easy way to add it while still using the SDK to allow rendering of DRMed PDFs. There are plenty F/OSS PDF renderers to which Amazon could have (comparatively) easily added annotation support, which suggests that the Adobe SDK’s DRM support is an implicit part of the included renderer, and that SDK licensees cannot use the SDK to read DRMed documents independently of the renderer.

Another interesting angle is to compare the Adobe approach with how other book formats/viewers handle the creator-reader tension.

The desktop version of MSReader allows setting only the font size, but a significant number of users seem to regard it as still the best desktop e-book viewer available. A major component of this seems to be very well-chosen defaults for properties like line-height, and the infrequency with which books alter those properties. I have seen much more mixed reactions to the Mobile version, perhaps because on the smaller screens of mobile devices tuning the line-height, margins, etc to an individually comfortable size is much more important. Perhaps a commenter could fill me in?

The various Mobipocket viewers support differing assortments of user-controlled properties. Most support setting font-size, line-height, and paragraph alignment. Interestingly, Mobipocket’s treatment of these properties demonstrates all three possible resolutions of the creator-reader formatting tension. Line-height may be set by the reader, but not by the book creator – the format simply provides no way to specify it. The font-size may be set by the book creator, but only in terms of a size relative to the reader-selected base size. And paragraph alignment may be specified by either the book creator or reader, but the creator’s setting overrides the reader’s when specified.

Correspondingly, paragraph alignment is the subject of one of Mobipocket’s most forceful formatting recommendations: “alignment must NOT be set if it is not strictly needed.” In contrast, the EPUB specification documents contain little resembling formatting guidelines (beyond an admonition against using absolute positioning). The one purely formatting recommendation in Adobe’s “EPUB Best Practices Guide” is “use spacing that looks more like a book,” suggesting including CSS rules to eliminate default space around block-level elements. It does contain some sensible recommendations e.g. against using tags in most contexts, but otherwise the “Guide” documents Adobe’s EPUB extensions and DE’s quirks more than actual “best practices.” Which means that EPUB combines the most extensive e-book formatting capabilities with the fewest guidelines for producing actually readable books. And this is something of a challenge for authors of EPUB viewers.

Coming full circle to the beginning of this post, the Aldiko EPUB viewer does allow the user to set some basic formatting properties, including margins, font, font-size, and line-height6. The first three work seamlessly, but line-height somewhat less so. Aldiko handles books which don’t specify their own line-height fine, but any book-specified line-height “wins.” Which isn’t intended as a slight on Aldiko – this is a difficult problem to solve.

Something like page margin can really only be set via CSS in one “most sensible” way7, and thus is easy to override coherently and consistently in a viewer. Font-size is potentially trickier, but the most obvious ways of specifying font sizes (relative sizes and the CSS named absolute sizes) make a simple solution fairly straightforward. Properties like line-height unfortunately lack such a solution – all the allowed relative values for the CSS line-height property are interpreted as relative to the font-size, not the line-height of the parent element. This means that the font-size solution of “change the base and respect subsequent relative changes” doesn't work for line-height. Without doing some sort of layout analysis, all a viewer can do is either ignore all book-specified line heights or respect all book-specified line heights.

Solution? I’m not sure. One possible solution would be for book producers and viewer author to agree on guidelines which allow viewers to consistently override some set of formatting properties. Most of these could be fairly simple, like Mobipocket’s “alignment must not be set” rule. Another solution would be for e-book reader apps to do some sort of pre-rendering layout analysis which allows them to automatically produce a per-book user stylesheet. I like hands-off technological solutions, but I’m not sure how feasible that will be on mobile devices.

Other ideas?

1 To my surprise, a good enough one to handle the max-width property on images.

2 Quite literally, in the case of the page-turn animation.

4 Probably to stop people from reverse-engineering the DRM system.

5 The ones I’m currently aware of: the Sony Reader, Lexcycle Stanza, the Bookeen Cybook, and the Elonex ebook.

6 Not paragraph alignment yet, but via e-mail the author has said it’ll probably be added soon.

7 One could set left and right margins on every block level element, but hopefully no one actually does that.