platypope.org / blog / syndicatehttp://blog.platypope.org/Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/2014-03-07T06:00:26-05:00http://blog.platypope.org/2014/2/8/interactive-hadoop-with-parkourInteractive Hadoop with Parkour2014-02-08T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p><a href="https://github.com/damballa/parkour">Parkour</a> is a Clojure library for writing Hadoop MapReduce programs. The most
recent release of Parkour<a class="footref" href="#fn.interactive-hadoop-with-parkour.1">1</a> includes new features designed to integrate Parkour
and Hadoop with the typical REPL-based Clojure workflow. Parkour can now lend a
Clojure REPL most of the interactive Hadoop features of the Pig or Hive shells,
but backed by the full power of the Clojure programming language.</p>
<p>This post will walk through setting up a Clojure REPL connected to a live Hadoop
cluster (running via Amazon’s EMR) and interactively developing a MapReduce
program to run on the cluster.</p>
<h3>Problem</h3>
<p class="first">We need a problem to solve with MapReduce. Treading new data-analysis ground
isn’t the goal here, so for this post we’re just going to write a Parkour
version of a program to identify <a href="http://aws.amazon.com/articles/5249664154115844">decade-wise trending terms</a> in the
<a href="http://aws.amazon.com/datasets/8172056142375670">Google Books n-grams corpus</a>. To make things a bit interesting, we will make an
attempt at improving on the original algorithm to include terms which first
appear in a given decade.</p>
<h3>Preliminaries</h3>
<p class="first">This post assumes you either already have fully-configured access to a local
Hadoop cluster or know how to launch a cluster on Amazon EMR.</p>
<p>If launching a cluster on EMR, you’ll need to run your REPL process such that
the local system Hadoop version matches the EMR cluster Hadoop version and has
network access to the cluster services. The easiest way to do this is to run
everything on the EMR cluster master node. This is less convenient for actual
development than other approaches (e.g. configuring Hadoop to use SOCKS), but
involves less setup, and so is what we’ll assume for this post.</p>
<p>Parkour supports the current stable (plus EMR-supported) versions of both Hadoop
1 and Hadoop 2. For this post’s EMR cluster we’ll be using Amazon Hadoop 2.2.0,
but Amazon Hadoop 1.0.2 should work just as well.</p>
<h3>Create a project</h3>
<p class="first">First, install <a href="http://leiningen.org/">Leiningen</a><a class="footref" href="#fn.interactive-hadoop-with-parkour.2">2</a> and create a new project:</p>
<pre class="example">$ lein new app trending-terms
Generating a project called trending-terms based on the 'app' template.
</pre>
<p>Then update the project file to include Parkour and our version of Hadoop:</p>
<pre class="example">(<span class="keyword">defproject</span> <span class="function-name">trending-terms</span> <span class="string">"0.1.0-SNAPSHOT"</span>
<span class="constant">:description</span> <span class="string">"Decade-wise trending terms in the Google Books n-grams corpus."</span>
<span class="constant">:url</span> <span class="string">"http://github.com/llasram/trending-terms"</span>
<span class="constant">:license</span> {<span class="constant">:name</span> <span class="string">"Eclipse Public License"</span>
<span class="constant">:url</span> <span class="string">"http://www.eclipse.org/legal/epl-v10.html"</span>}
<span class="constant">:global-vars</span> {*warn-on-reflection* true}
<span class="constant">:dependencies</span> [[org.clojure/clojure <span class="string">"1.5.1"</span>]
[com.damballa/parkour <span class="string">"0.5.4"</span>]
[org.apache.avro/avro <span class="string">"1.7.5"</span>]
[org.apache.avro/avro-mapred <span class="string">"1.7.5"</span>
<span class="constant">:classifier</span> <span class="string">"hadoop2"</span>]
[transduce/transduce <span class="string">"0.1.1"</span>]]
<span class="constant">:exclusions</span> [org.apache.hadoop/hadoop-core
org.apache.hadoop/hadoop-common
org.apache.hadoop/hadoop-hdfs
org.slf4j/slf4j-api org.slf4j/slf4j-log4j12 log4j
org.apache.avro/avro
org.apache.avro/avro-mapred
org.apache.avro/avro-ipc]
<span class="constant">:profiles</span> {<span class="constant">:provided</span>
{<span class="constant">:dependencies</span> [[org.apache.hadoop/hadoop-client <span class="string">"2.2.0"</span>]
[org.apache.hadoop/hadoop-common <span class="string">"2.2.0"</span>]
[org.slf4j/slf4j-api <span class="string">"1.6.1"</span>]
[org.slf4j/slf4j-log4j12 <span class="string">"1.6.1"</span>]
[log4j <span class="string">"1.2.17"</span>]]}
<span class="constant">:test</span> {<span class="constant">:resource-paths</span> [<span class="string">"test-resources"</span>]}
<span class="constant">:aot</span> {<span class="constant">:aot</span> <span class="constant">:all</span>, <span class="constant">:compile-path</span> <span class="string">"target/aot/classes"</span>}
<span class="constant">:uberjar</span> [<span class="constant">:aot</span>]
<span class="constant">:jobjar</span> [<span class="constant">:aot</span>]})
</pre>
<p>There’s currently some incidental complexity to the project file (the
<code>:exclusions</code> and logging-related dependencies), but just roll with it for now.</p>
<h3>Launch a REPL</h3>
<p class="first">In order to launch a cluster-connected REPL, we’ll want to use the
<a href="https://github.com/llasram/lein-hadoop-cluster">lein-hadoop-cluster</a> Leiningen plugin. We’ll also want the <a href="https://github.com/pallet/alembic">Alembic</a> library for
some of Parkour’s REPL support functionality. We could add these directly to
the project file, but they’re pretty orthogonal to any given individual project,
so we’ll just add them to the <code>:user</code> profile in our <code>~/.lein/profiles.clj</code>:</p>
<pre class="example">{<span class="constant">:user</span>
{<span class="constant">:plugins</span> [[lein-hadoop-cluster <span class="string">"0.1.2"</span>]]
<span class="constant">:dependencies</span> [[alembic <span class="string">"0.2.1"</span>]]}}
</pre>
<p>With those changes made, we can actually launch the REPL from within the project
directory:</p>
<pre class="example">$ lein hadoop-repl
</pre>
<p>Then (optionally, but suggested) connect to the REPL process from our editor.</p>
<h3>Examine the data</h3>
<p class="first">Let’s get started writing code, in <code>src/trending_terms/core.clj</code>. First, the
namespace preliminaries:</p>
<pre class="example">(<span class="keyword">ns</span> trending-terms.core
(<span class="constant">:require</span> [clojure.string <span class="constant">:as</span> str]
[clojure.core.reducers <span class="constant">:as</span> r]
[transduce.reducers <span class="constant">:as</span> tr]
[abracad.avro <span class="constant">:as</span> avro]
[parkour (conf <span class="constant">:as</span> conf) (fs <span class="constant">:as</span> fs) (mapreduce <span class="constant">:as</span> mr)
, (graph <span class="constant">:as</span> pg) (reducers <span class="constant">:as</span> pr)]
[parkour.io (seqf <span class="constant">:as</span> seqf) (avro <span class="constant">:as</span> mra) (dux <span class="constant">:as</span> dux)
, (sample <span class="constant">:as</span> sample)]))
</pre>
<p>Then we can define some functions to access the n-gram corpus:</p>
<pre class="example">(<span class="keyword">def</span> <span class="function-name">ngram-base-url</span>
<span class="string">"Base URL for Google Books n-gram dataset."</span>
<span class="string">"s3://datasets.elasticmapreduce/ngrams/books/20090715"</span>)
(<span class="keyword">defn</span> <span class="function-name">ngram-url</span>
<span class="doc">"URL for Google Books `n`-gram dataset."</span>
[n] (fs/uri ngram-base-url <span class="string">"eng-us-all"</span> (<span class="builtin">str</span> n <span class="string">"gram"</span>) <span class="string">"data"</span>))
(<span class="keyword">defn</span> <span class="function-name">ngram-dseq</span>
<span class="doc">"Distributed sequence for Google Books `n`-grams."</span>
[n] (seqf/dseq (ngram-url n)))
</pre>
<p>With these functions in hand, we can start exploring the data:</p>
<pre class="example">trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (r/take 2) (<span class="builtin">into</span> []))
[[1 <span class="string">"#\t1584\t6\t6\t1"</span>] [2 <span class="string">"#\t1596\t1\t1\t1"</span>]]
trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (sample/dseq)
(r/take 2) (<span class="builtin">into</span> []))
[[265721405 <span class="string">"APPROPRIATE\t1999\t492\t470\t325"</span>]
[265721406 <span class="string">"APPROPRIATE\t2000\t793\t723\t375"</span>]]
trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (sample/dseq {<span class="constant">:seed</span> 2})
(r/take 2) (<span class="builtin">into</span> []))
[[141968715 <span class="string">"poseen\t2007\t10\t10\t8"</span>]
[141968716 <span class="string">"poseen\t2008\t13\t13\t10"</span>]]
</pre>
<p>The cluster-connected REPL gives us direct live access to the actual data we
want to analyze. The <code>sample/dseq</code> function allows us to select small samples
from that data, random within the limits of what Hadoop allows to be efficient.</p>
<h3>Write a function</h3>
<p class="first">With direct access to the raw records our jobs get from Hadoop, we can easily
begin writing functions to operate on those records:</p>
<pre class="example">(<span class="keyword">defn</span> <span class="function-name">parse-record</span>
<span class="doc">"Parse text-line 1-gram record `rec` into ((gram, decade), n) tuple."</span>
[rec]
(<span class="keyword">let</span> [[gram year n] (str/split rec #<span class="string">"\t"</span>)
gram (str/lower-case gram)
year (<span class="preprocessor">Long/parseLong</span> year)
n (<span class="preprocessor">Long/parseLong</span> n)
decade (<span class="keyword">-></span> year (<span class="builtin">quot</span> 10) (<span class="builtin">*</span> 10))]
[[gram decade] n]))
(<span class="keyword">defn</span> <span class="function-name">select-record?</span>
<span class="doc">"True iff argument record should be included in analysis."</span>
[[[gram year]]]
(<span class="keyword">and</span> (<span class="builtin"><=</span> 1890 year)
(<span class="builtin">re-matches</span> #<span class="string">"^[a-z+'-]+$"</span> gram)))
</pre>
<p>And doing some quick REPL-testing:</p>
<pre class="example">trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (sample/dseq {<span class="constant">:seed</span> 1}) (r/map second)
(r/map parse-record)
(r/take 2) (<span class="builtin">into</span> []))
[[[<span class="string">"appropriate"</span> 1990] 492] [[<span class="string">"appropriate"</span> 2000] 793]]
trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (r/map second)
(r/map parse-record)
(r/filter select-record?)
(r/take 2) (<span class="builtin">into</span> []))
[[[<span class="string">"a+"</span> 1890] 1] [[<span class="string">"a+"</span> 1890] 2]]
</pre>
<p>We can glue these functions together into the first map task function we’ll
need for our jobs:</p>
<pre class="example">(<span class="keyword">defn</span> <span class="function-name">normalized-m</span>
<span class="doc">"Parse, normalize, and filter 1-gram records."</span>
{<span class="constant">::mr/source-as</span> <span class="constant">:vals</span>, <span class="constant">::mr/sink-as</span> <span class="constant">:keyvals</span>}
[records]
(<span class="keyword">->></span> records
(r/map parse-record)
(r/filter select-record?)
(pr/reduce-by first (pr/mjuxt pr/arg1 +) [nil 0])))
</pre>
<p>Because Parkour task functions are just functions, we can test that in the REPL
as well:</p>
<pre class="example">trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (sample/dseq {<span class="constant">:seed</span> 1}) (r/map second)
normalized-m (r/take 3) (<span class="builtin">into</span> []))
[[[<span class="string">"appropriate"</span> 1990] 492]
[[<span class="string">"appropriate"</span> 2000] 7242]
[[<span class="string">"appropriated"</span> 1890] 59]]
</pre>
<p>As we develop the functions composing our program, we can iterate rapidly by
immediately seeing the effect of calling our in-development functions on real
data.</p>
<h3>Write some jobs</h3>
<p class="first">Writing the rest of the jobs is a simple matter of programming. We’ll largely
follow the original Hive version, but we will attempt to add the ability for
entirely new terms to trend by applying Laplace smoothing to the occurrence
counts.</p>
<p>Parkour allows us to optimize the entire process down to just two jobs (and no
reduce-side joins). This is something which is equally possible via the base
Java Hadoop interfaces, but which you’d rarely <em>bother</em> to do, because it’d be too
fiddly in the raw APIs and completely impossible in most higher-level
frameworks. In Clojure with Parkour however, it’s relatively straightforward.</p>
<p>But! – today’s main, REPL-based excitement comes after. So, without further
commentary, the rest of the code:</p>
<pre class="example">(<span class="keyword">defn</span> <span class="function-name">nth0-p</span>
<span class="doc">"Partitioning function for first item of key tuple."</span>
<span class="preprocessor">^</span><span class="type">long</span> [k v <span class="preprocessor">^</span><span class="type">long</span> n] (<span class="keyword">-></span> k (<span class="builtin">nth</span> 0) hash (<span class="builtin">mod</span> n)))
(<span class="keyword">defn</span> <span class="function-name">normalized-r</span>
<span class="doc">"Collect: tuples of 1-gram and occurrence counts by decade; total 1-gram
occurrence counts by decade; and counter of total number of 1-grams."</span>
{<span class="constant">::mr/source-as</span> <span class="constant">:keyvalgroups</span>,
<span class="constant">::mr/sink-as</span> (dux/named-keyvals <span class="constant">:totals</span>)}
[input]
(<span class="keyword">let</span> [nwords (<span class="preprocessor">.getCounter</span> mr/*context* <span class="string">"normalized"</span> <span class="string">"nwords"</span>)
fnil+ (fnil + 0)]
(<span class="keyword">->></span> input
(r/map (<span class="keyword">fn</span> [[[gram decade] ns]]
[gram [decade (<span class="builtin">reduce</span> + 0 ns)]]))
(pr/reduce-by first (pr/mjuxt pr/arg1 conj) [nil {}])
(<span class="builtin">reduce</span> (<span class="keyword">fn</span> [totals [gram counts]]
(dux/write mr/*context* <span class="constant">:counts</span> gram (<span class="builtin">seq</span> counts))
(<span class="builtin">merge-with</span> + totals counts))
{})
(<span class="builtin">seq</span>))))
(<span class="keyword">defn</span> <span class="function-name">normalized-j</span>
<span class="doc">"Run job accumulating maps of decade-wise raw occurrence counts per 1-gram
and map of total decade-wise Laplace-smoothed occurrence counts."</span>
[conf workdir ngrams]
(<span class="keyword">let</span> [counts-path (fs/<span class="builtin">path</span> workdir <span class="string">"counts"</span>)
totals-path (fs/<span class="builtin">path</span> workdir <span class="string">"totals"</span>)
pkey-as (avro/tuple-schema ['string 'long])
counts-as {<span class="constant">:type</span> 'array, <span class="constant">:items</span> (avro/tuple-schema ['long 'long])}
[counts totals]
, (<span class="keyword">-></span> (pg/input ngrams)
(pg/map #'normalized-m)
(pg/partition (mra/shuffle pkey-as 'long) #'nth0-p)
(pg/reduce #'normalized-r)
(pg/output <span class="constant">:counts</span> (mra/dsink ['string counts-as] counts-path)
<span class="constant">:totals</span> (mra/dsink ['long 'long] totals-path))
(pg/execute conf <span class="string">"normalized"</span>))
gramsc (<span class="keyword">-></span> (mr/counters-map totals)
(<span class="builtin">get-in</span> [<span class="string">"normalized"</span> <span class="string">"nwords"</span>])
double)
fnil+ (fnil + gramsc)
totals (<span class="builtin">reduce</span> (<span class="keyword">fn</span> [m [d n]]
(<span class="builtin">update-in</span> m [d] fnil+ n))
{} totals)]
[counts totals]))
(<span class="keyword">defn</span> <span class="function-name">trending-m</span>
<span class="doc">"Transform decade-wise 1-gram occurrence counts into negated ratios of
occurrence frequencies in adjacent decades."</span>
{<span class="constant">::mr/source-as</span> <span class="constant">:keyvals</span>, <span class="constant">::mr/sink-as</span> <span class="constant">:keyvals</span>}
[totals input]
(r/mapcat (<span class="keyword">fn</span> [[gram counts]]
(<span class="keyword">let</span> [counts (<span class="builtin">into</span> {} counts)
ratios (<span class="builtin">reduce-kv</span> (<span class="keyword">fn</span> [m dy nd]
(<span class="keyword">let</span> [ng (<span class="builtin">inc</span> (counts dy 0))]
(<span class="builtin">assoc</span> m dy (<span class="builtin">/</span> ng nd))))
{} totals)]
(<span class="keyword">->></span> (<span class="builtin">seq</span> ratios)
(r/map (<span class="keyword">fn</span> [[dy r]]
(<span class="keyword">let</span> [r' (ratios (<span class="builtin">-</span> dy 10))]
(<span class="keyword">if</span> (<span class="keyword">and</span> r' (<span class="builtin"><</span> 0.000001 r))
[[dy (<span class="builtin">-</span> (<span class="builtin">/</span> r r'))] gram]))))
(r/<span class="builtin">remove</span> nil?))))
input))
(<span class="keyword">defn</span> <span class="function-name">trending-r</span>
<span class="doc">"Select top `n` 1-grams per decade by negated occurrence frequency ratios."</span>
{<span class="constant">::mr/source-as</span> <span class="constant">:keyvalgroups</span>, <span class="constant">::mr/sink-as</span> <span class="constant">:keyvals</span>}
[n input]
(r/map (<span class="keyword">fn</span> [[[decade] grams]]
[decade (<span class="builtin">into</span> [] (r/take n grams))])
input))
(<span class="keyword">defn</span> <span class="function-name">trending-j</span>
<span class="doc">"Run job selecting trending terms per decade."</span>
[conf workdir topn counts totals]
(<span class="keyword">let</span> [ratio-as (avro/tuple-schema ['long 'double])
ratio+g-as (avro/grouping-schema 1 ratio-as)
grams-array {<span class="constant">:type</span> 'array, <span class="constant">:items</span> 'string}
trending-path (fs/<span class="builtin">path</span> workdir <span class="string">"trending"</span>)
[trending]
, (<span class="keyword">-></span> (pg/input counts)
(pg/map #'trending-m totals)
(pg/partition (mra/shuffle ratio-as 'string ratio+g-as)
#'nth0-p)
(pg/reduce #'trending-r topn)
(pg/output (mra/dsink ['long grams-array] trending-path))
(pg/execute conf <span class="string">"trending"</span>))]
(<span class="builtin">into</span> (<span class="builtin">sorted-map</span>) trending)))
(<span class="keyword">defn</span> <span class="function-name">trending-terms</span>
<span class="doc">"Calculate the top `topn` trending 1-grams per decade from Google Books 1-gram
corpus dseq `ngrams`. Writes job outputs under `workdir` and configure jobs
using Hadoop configuration `conf`. Returns map of initial decade years to
vectors of trending terms."</span>
[conf workdir topn ngrams]
(<span class="keyword">let</span> [[counts totals] (normalized-j conf workdir ngrams)]
(trending-j conf workdir topn counts totals)))
</pre>
<p>Once we start writing our job, we can use Parkour’s “mixed mode” job execution
to iterate. Mixed mode allows us to run the job in the REPL process, but on the
same live sampled data we were examining before:</p>
<pre class="example">trending-terms.core> (<span class="keyword">->></span> (ngram-dseq 1) (sample/dseq {<span class="constant">:seed</span> 1, <span class="constant">:splits</span> 20})
(trending-terms (conf/local-mr!) <span class="string">"file:tmp/tt/0"</span> 5))
{1900 [<span class="string">"deglet"</span> <span class="string">"warroad"</span> <span class="string">"delicado"</span> <span class="string">"erostrato"</span> <span class="string">"warad"</span>],
1910 [<span class="string">"esbly"</span> <span class="string">"wallonie"</span> <span class="string">"wallstein"</span> <span class="string">"dehmel's"</span> <span class="string">"dellion"</span>],
1920 [<span class="string">"ernestino"</span> <span class="string">"walska"</span> <span class="string">"wandis"</span> <span class="string">"wandke"</span> <span class="string">"watasenia"</span>],
1930 [<span class="string">"delacorte"</span> <span class="string">"phytosociological"</span> <span class="string">"priuatly"</span> <span class="string">"delber"</span> <span class="string">"escapism"</span>],
1940 [<span class="string">"phthalates"</span> <span class="string">"phylic"</span> <span class="string">"espiner"</span> <span class="string">"degrease"</span> <span class="string">"wallonie"</span>],
1950 [<span class="string">"demokos"</span> <span class="string">"wanotan"</span> <span class="string">"ersine"</span> <span class="string">"dekatron"</span> <span class="string">"physicalistically"</span>],
1960 [<span class="string">"warain"</span> <span class="string">"propeking"</span> <span class="string">"warschaw"</span> <span class="string">"demecarium"</span> <span class="string">"pikiran"</span>],
1970 [<span class="string">"prioritize"</span> <span class="string">"walshok"</span> <span class="string">"waterboard"</span> <span class="string">"demogrants"</span> <span class="string">"delisted"</span>],
1980 [<span class="string">"warsl"</span> <span class="string">"watasi"</span> <span class="string">"proarrhythmic"</span> <span class="string">"walonick"</span> <span class="string">"procurved"</span>],
1990 [<span class="string">"wanglie"</span> <span class="string">"procedores"</span> <span class="string">"printlnc"</span> <span class="string">"dejanews"</span> <span class="string">"eslamboli"</span>],
2000 [<span class="string">"wardriving"</span> <span class="string">"erlestoke"</span> <span class="string">"deleterole"</span> <span class="string">"deleteq"</span> <span class="string">"erius"</span>]}
</pre>
<p>This is the moral equivalent of Pig’s <code>ILLUSTRATE</code> command. Parkour lacks the
rigid execution model which allows Pig’s <code>ILLUSTRATE</code> to e.g. synthesize data for
joins, but the simplicity of the approach means any combination of “remote” jobs
and local processing just works, without surprises.</p>
<p>As with sampling input for individual functions, mixed mode job execution
supports rapid development iteration on real data.</p>
<h3>Write a test</h3>
<p class="first">REPL iteration gets results quickly, but once code works, a real test allows us
to have confidence that it will continue to work, and the results are what we
actually expect. So let’s write a test for our job graph:</p>
<pre class="example">(<span class="keyword">ns</span> trending-terms.core-test
(<span class="constant">:require</span> [clojure.test <span class="constant">:refer</span> <span class="constant">:all</span>]
[trending-terms.core <span class="constant">:as</span> tt]
[parkour (fs <span class="constant">:as</span> fs) (conf <span class="constant">:as</span> conf)]
[parkour.io (dsink <span class="constant">:as</span> dsink) (seqf <span class="constant">:as</span> seqf)]
[parkour.test-helpers <span class="constant">:as</span> th])
(<span class="constant">:import</span> [org.apache.hadoop.io <span class="preprocessor">Text</span> <span class="preprocessor">LongWritable</span>]))
(<span class="keyword">def</span> <span class="function-name">n1grams-records</span>
<span class="comment-delimiter">;; </span><span class="comment">omitted from blog post for clarity
</span> )
(<span class="keyword">deftest</span> <span class="function-name">test-basic</span>
(th/with-config
(<span class="keyword">let</span> [workdir (<span class="keyword">doto</span> <span class="string">"tmp/work/basic"</span> fs/path-delete)
inpath (fs/<span class="builtin">path</span> workdir <span class="string">"ngrams"</span>)
ngrams (dsink/with-dseq (seqf/dsink [<span class="preprocessor">LongWritable</span> <span class="preprocessor">Text</span>] inpath)
n1grams-records)
trending (tt/trending-terms (th/config) workdir ngrams 1)]
(<span class="builtin">is</span> (<span class="builtin">=</span> {1950 [<span class="string">"legal"</span>],
1960 [<span class="string">"assembly"</span>],
1970 [<span class="string">"astoria"</span>],
2000 [<span class="string">"prostate"</span>]}
trending)))))
</pre>
<p>The Parkour <code>th/config</code> function and <code>th/with-config</code> macro allow us to run code
using a purely local-mode Hadoop configuration, even in a process where the
default configuration points to a live cluster. Just like we were able to
REPL-test jobs using mixed mode, we can now run our actual tests in-REPL in full
local mode:</p>
<pre class="example">trending-terms.core-test> (<span class="builtin">run-tests</span>)
<span class="preprocessor">Testing</span> trending-terms.core-test
<span class="preprocessor">Ran</span> 1 tests containing 1 assertions.
0 failures, 0 errors.
{<span class="constant">:type</span> <span class="constant">:summary</span>, <span class="constant">:pass</span> 1, <span class="constant">:test</span> 1, <span class="constant">:error</span> 0, <span class="constant">:fail</span> 0}
</pre>
<p>Success!</p>
<h3>Launch a job</h3>
<p class="first">Once our program is developed and tested, it’s time to run it on the full
dataset. Normally this would involve leaving the REPL to build a job JAR and
deploy it somehow, but Parkour allows us to do this directly from the REPL too:</p>
<pre class="example">trending-terms.core> (<span class="builtin">require</span> '[parkour.repl <span class="constant">:refer</span> [launch!]])
trending-terms.core> (<span class="keyword">def</span> <span class="function-name">*results</span>
(<span class="builtin">future</span> (<span class="keyword">->></span> (ngram-dseq 1)
(launch! {<span class="string">"mapred.reduce.tasks"</span> 8}
trending-terms <span class="string">"tt/0"</span> 5))))
#<Var@3a23a4ec: #<Future@55f5a074: <span class="constant">:pending>></span>
trending-terms.core> (realized? *results)
false
</pre>
<p>(We run the jobs in a future to place it in a background thread, and thus not
tie up the REPL while the jobs are running.)</p>
<p>Parkour uses hugoduncan’s <a href="https://github.com/pallet/alembic">Alembic</a> library to load and interact with a full
in-process (but isolated) instance of Leiningen. Using this Leiningen instance,
Parkour just builds your job JAR and assembles your dependencies exactly as
specified by your Leiningen project file.</p>
<p>Once the job finishes, time for some results:</p>
<pre class="example">trending-terms.core> (realized? *results)
true
trending-terms.core> @*results
{1900 [<span class="string">"strether"</span> <span class="string">"fluidextract"</span> <span class="string">"thutmose"</span> <span class="string">"adrenalin"</span> <span class="string">"lekythoi"</span>],
1910 [<span class="string">"orthotype"</span> <span class="string">"britling"</span> <span class="string">"salvarsan"</span> <span class="string">"pacifist"</span> <span class="string">"boches"</span>],
1920 [<span class="string">"liliom"</span> <span class="string">"bacteriophage"</span> <span class="string">"prohack"</span> <span class="string">"vanzetti"</span> <span class="string">"erlend"</span>],
1930 [<span class="string">"vridar"</span> <span class="string">"samghin"</span> <span class="string">"mulan"</span> <span class="string">"nazis"</span> <span class="string">"goebbels"</span>],
1940 [<span class="string">"psia"</span> <span class="string">"plutonium"</span> <span class="string">"luftwaffe"</span> <span class="string">"darlan"</span> <span class="string">"beachhead"</span>],
1950 [<span class="string">"lopatkin"</span> <span class="string">"rooscvelt"</span> <span class="string">"fluoridation"</span> <span class="string">"jacy"</span> <span class="string">"desegregation"</span>],
1960 [<span class="string">"vietcong"</span> <span class="string">"synanon"</span> <span class="string">"tshombe"</span> <span class="string">"lumumba"</span> <span class="string">"psychedelic"</span>],
1970 [<span class="string">"mdhr"</span> <span class="string">"sexist"</span> <span class="string">"sexism"</span> <span class="string">"biofeedback"</span> <span class="string">"counterculture"</span>],
1980 [<span class="string">"affit"</span> <span class="string">"autocad"</span> <span class="string">"dbase"</span> <span class="string">"neob"</span> <span class="string">"garion"</span>],
1990 [<span class="string">"activex"</span> <span class="string">"photoshop"</span> <span class="string">"javascript"</span> <span class="string">"netscape"</span> <span class="string">"toolbars"</span>],
2000 [<span class="string">"cengage"</span> <span class="string">"eventargs"</span> <span class="string">"itunes"</span> <span class="string">"podcast"</span> <span class="string">"wsdl"</span>]}
</pre>
<p>How trendy! It looks like our smoothing function has added more noise from rare
terms<a class="footref" href="#fn.interactive-hadoop-with-parkour.3">3</a>, but the basics are there for the tweaking.</p>
<p>The complete example <a href="https://github.com/llasram/trending-terms">trending-terms</a> project is on Github, if you want to give a
try at experimenting with it (in a live REPL!) yourself.</p>
<h3>Parting thoughts</h3>
<p class="first">Thanks to <a href="https://github.com/rfarrjr">rfarrjr</a> for awesome discussion around these features, and to <a href="https://github.com/ztellman">ztellman</a>
and <a href="https://github.com/ahobson">ahobson</a> for prompting their value and for specific suggestions. I was
honestly skeptical at first that this sort of REPL integration could be made
useful, from past experience trying to make a live-cluster Cascalog REPL work.
But now that these features exist, I’m not sure how I wrote MapReduce programs
without them.</p>
<p>So head over to the <a href="https://github.com/damballa/parkour#parkour">Parkour project</a> and get started!</p>
<p class="footnote"><a class="footnum" name="fn.interactive-hadoop-with-parkour.1" id="fn.interactive-hadoop-with-parkour.1">1</a> Version 0.5.4, at the time of writing.
<p class="footnote"><a class="footnum" name="fn.interactive-hadoop-with-parkour.2" id="fn.interactive-hadoop-with-parkour.2">2</a> On the EMR Hadoop cluster master node.
<p class="footnote"><a class="footnum" name="fn.interactive-hadoop-with-parkour.3" id="fn.interactive-hadoop-with-parkour.3">3</a> Especially OCR errors.</p>
http://blog.platypope.org/2012/11/25/ruby-python-clojure(< Ruby Python Clojure)2012-11-25T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>It’s been a week, but I’m still revved up from attending <a href="http://clojure-conj.org/">Clojure/conj</a>.
The energy, dedication, and intelligence in the Clojure community all
continue to impress me, and definitely influence my choice of Clojure
as my preferred programming language. Of course the language itself
remains the most significant factor, and that’s something I’ve been
meaning to blog about for some time.</p>
<p>My language of choice has gone from C to Perl<a class="footref" href="#fn.ruby-python-clojure.1">1</a> to Python to Ruby to
Python again, and then finally to Clojure. Each transition has been
motivated by a perceived increase in practical expressiveness – an
ability to communicate both effect and intent more directly and
efficiently, both to the computer and to other humans. C to Perl
provided garbage collection, first-class collections, and CPAN. Perl
to Python provided real objects, comprehensible semantics, and a
syntax that did not resemble line-noise. Python to Ruby provided
blocks, open classes, and pervasive metaprogramming.</p>
<p>Ruby back to Python takes a bit more explaining. In so many ways the
languages are so similar as to be almost indistinguishable.</p>
<p>One difference is in the communities surrounding the languages. Ruby
has come a long, long way from the days when most of its documentation
was only available in Japanese. But – there are differing cultural
norms between the two communities on how best to balance
documentation, stable interfaces, implementation quality, and the
introduction of new features. I feel that the Ruby community
pervasively emphasizes the latter above the others to a greater extent
than I prefer. The Python language, standard library, and third-party
libraries tend to be better documented and more stable than their Ruby
counterparts. Not universally or to an unjustifiable extent – just
sufficiently that I see a distinction, and prefer Python.</p>
<p>The other most significant aspect is that I believe Python APIs tend
to be more value-oriented than their equivalent Ruby APIs. This one
is a little difficult to explain...</p>
<p>Ruby more closely follows the canonical OO model by having a concept
of “messages.” All method invocations and function calls are actually
deliveries of messages to objects. When you see a line of Ruby code
like:</p>
<pre class="example">invoke_method arg
</pre>
<p>All you know is that the implicit <code>self</code> at that particular line is
going to receive the <code>invoke_method</code> message with the argument <code>arg</code>. You
first need to unwind the evaluation state to figure what object <code>self</code>
is at that point. Then you need to figure out how it will actually
handle the message, which may be via a class instance method, an
included module method, or <code>method_missing</code>. Messages themselves aren’t
first-class or introspectable, so there’s no way to ask “if I send the
<code>invoke_method</code> message here, who exactly will respond?”</p>
<p>Python in contrast has no implicit scopes and models everything in
terms of attribute access and function-calling<a class="footref" href="#fn.ruby-python-clojure.2">2</a>. When you see a
line of Python code like:</p>
<pre class="example"><span class="builtin">object</span>.method(arg)
</pre>
<p>You know there is an in-scope, introspectable value named <code>object</code> at
this line. That value is asked via the attribute-access protocol for
its <code>method</code> member, which yields another introspectable value. That
attribute value is invoked via the function call protocol with the
argument <code>arg</code>. At every step of the way you have a concrete,
introspectable value backed by code you can find and follow.</p>
<p>In my experience these differences aren’t just a curiosity, but
directly impact the way one most concisely expresses code in the two
languages. In Ruby libraries it is a very common idiom to provide
methods invoked within a class definition or <code>instance_eval</code>’d block
which build state by modifying the implicit self. The moral analog of
this in Python (metaclasses) is relatively rare, and most Python
libraries provide interfaces in terms of concrete values. For a good
example of what I’m talking about, look at the difference between the
<a href="https://github.com/ffi/ffi">ruby-ffi</a> and Python <a href="http://docs.python.org/2/library/ctypes.html">ctypes</a> libraries.</p>
<p>And then finally, Clojure.</p>
<p>For the few years I’ve been using it, I’ve found that Clojure
maximizes practical expressiveness in three ways which beat all other
languages I’ve tried. It has everything you’d expect from a language
like Python or Ruby, but then has adds these things on top.</p>
<p>First there’s the obvious: macros. Ruby and Python provide
metaprogramming facilities which allow programmatically generating any
constructs one could produce directly through code. But doing so is
not always going to be as legible or compact as simply writing out the
desired result, nor will it necessarily bear a clear relation to that
directly-expressed version. With Clojure macros, you instead have the
ability to do arbitrary in-code code-generation, using the same
functions as for general-purpose code, and acting effectively like
executable structural templates for the expanded code. In Clojure,
there never needs to be any boilerplate – even if macros aren’t always
the <em>best</em> method for eliminating code repetition, they are a method
which can be used to eliminate <em>any</em> code repetition.</p>
<p>Second is the language’s functional style and culture. Without the
language itself enforcing functional purity, the style of the Clojure
standard library and the norms of the Clojure community encourage it.
This gives what I see as the best of both worlds: the ability to write
and reason about most code as pure functions operating on immutable
values, while escaping to side effects and mutation when necessary
without any excess ceremony. This makes it even more value-oriented
that Python, with almost all execution happening in terms of values
which are not only concrete and visible, but also immutable.</p>
<p>Third is Clojure’s definition as an explicitly hosted language, and
specifically the JVM as the best-supported hosting platform. I was a
bit resistant at first to getting deeply acquainted with the JVM and
(oh, the humanity!) Java, but I now see it as a significant benefit.
Part of practical expressiveness is not needing to write code
incidental to the application domain. The vast ecosystem of Java
libraries and the ease with which <a href="https://github.com/technomancy/leiningen">Leiningen</a> allows access to them
significantly cuts down on incidental code. The JVM itself of course
is rock-solid, fast<a class="footref" href="#fn.ruby-python-clojure.3">3</a>, and universal. From a practical perspective,
it’s a clear win for almost all applications.</p>
<p>Anyway, enough blogging – time to go write some Clojure!</p>
<p class="footnote"><a class="footnum" name="fn.ruby-python-clojure.1" id="fn.ruby-python-clojure.1">1</a> In college >10 years ago, so cut me a little slack.</p>
<p class="footnote"><a class="footnum" name="fn.ruby-python-clojure.2" id="fn.ruby-python-clojure.2">2</a> Which I believe you could think of as being messages, but I’m not
sure that really helps much.</p>
<p class="footnote"><a class="footnum" name="fn.ruby-python-clojure.3" id="fn.ruby-python-clojure.3">3</a> Well, after boot.</p>
http://blog.platypope.org/2012/4/5/without-commentWithout comment2012-04-05T22:31:27ZMarshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p><strong>Update</strong>: Ok, I was wrong!; or at least greatly over-stated my case. People have
shown me lots of great examples of when comments-qua-comments can be useful<a class="footref" href="#fn.without-comment.1">1</a>.
I still think that <em>much of the time</em>, the time spent writing explanatory
comments would be better spent just making the implementation clearer. But
code can’t always perfectly capture intent or the influence of external
factors. Consider my incipient dogmatism rescinded.</p>
<p>I am on the record at my current office saying that I prefer to work on
code-bases without comments. I do my best to follow this preference in the
code I write myself, which occasionally provokes some, er – comment. I can see
why this might be controversial, so I’d like to explain.</p>
<h3>Not comments</h3>
<p class="first">First off, some things that aren’t “comments” in the sense I mean; this:</p>
<pre class="example">(<span class="keyword">defn</span> <span class="function-name">clojure-function</span>
<span class="doc">"Calculates a result from arguments `args`."</span>
[& args] ... result)
</pre>
<p>Or this:</p>
<pre class="example"><span class="keyword">def</span> <span class="function-name">python_function</span>(*args):
<span class="string">"""Calculates a result from arguments `args`."""</span>
...
<span class="keyword">return</span> result
</pre>
<p>Or even this:</p>
<pre class="example"><span class="doc">/**
* Calculates a result from arguments.
*
* </span><span class="doc"><span class="constant">@param</span></span><span class="doc"> args the arguments to use in the calculation
*/</span>
<span class="type">AbstractInterfaceAdaptorProxy</span>
<span class="function-name">javaMethod</span>(<span class="type">FlyweightFacadeFactory</span> <span class="variable-name">args</span>...) {
...
<span class="keyword">return</span> result;
}
</pre>
<p>The first two examples are obviously not comments – they’re strings. To be
precise, they’re <a href="http://en.wikipedia.org/wiki/Docstring">docstrings</a>. The third example uses Java’s syntax for
comments, but only because Java doesn’t have docstrings. All three are pieces
of <em>interface documentation</em> which are consumed in a structured way by an
automated documentation system. The syntax of comments provides a convenient
way to bolt on structured in-line interface documentation for languages which
don’t have built-in support for it, but it’s hardly what comments are “for”;
otherwise languages with real docstrings wouldn’t have a separate syntax for
comments.</p>
<p>Also not comments: the input to systems like <a href="http://jashkenas.github.com/docco/">docco</a> and <a href="http://fogus.me/fun/marginalia/">marginalia</a>. These sorts
of systems use the syntax of comments to bolt on support for producing
comprehensive, structured <em>implementation</em> documentation. They only use the
syntax of comments because no one aside from <a href="http://en.wikipedia.org/wiki/Literate_programming">Donald Knuth</a> seems to be able to
make it work to write the implementation in-line in the documentation<a class="footref" href="#fn.without-comment.2">2</a>.
Turning the documentation/implementation relationship inside-out shows use of
the “comment” syntax as an artifact of convenience.</p>
<h3>Comments, a taxonomy</h3>
<p class="first">Ok, so what does that leave as “actually comments”?</p>
<h4>Implementation-repetition</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Append a dot to the end
</span>some_string += <span class="string">"."</span>
</pre>
<p>I can read just fine, thanks!</p>
<h4>From the peanut gallery</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Wow, what a kludge
</span>object.send(<span class="constant">:private_method</span>)
</pre>
<p>Well, then why are you doing it?</p>
<h4>Completely wrong</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Only show users specials on Sunday
</span><span class="keyword">return</span> <span class="variable-name">false</span> <span class="keyword">unless</span> username =~ <span class="type">VALID_USERNAME_RE</span>
</pre>
<p>Looks like someone changed the code without making sure the comments still
matched!</p>
<h4>Right?</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Increment by 2 to account for leap-seconds since 2004
</span>seconds += 3
</pre>
<p>Just close enough to create the potential for confusion. Has there been
another leap-second, or is <code>seconds</code> being incremented for a completely different
reason?</p>
<h4>What I tell you three times is true</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">BACKEND-192: Implement secondary sort so that we can track preferences
</span><span class="comment-delimiter"># </span><span class="comment">on a per-user basis.
</span>operation.sort_key = [<span class="constant">:overall_rating</span>, <span class="constant">:user_preference</span>]
</pre>
<p>So now there’s three different explanations of what the code is doing: the
high-level description linked to changing requirements in the issue tracking
system, the historical implementation description in the associated commit
message, and... this comment, which isn’t linked to changing requirements or to
the history of changes. Unless you look at the ticket to see what’s changed,
or look at the commit log to see the change in context. The comment in the
code adds absolutely nothing.</p>
<h4>Right, but...</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Increment by 3 to account for leap-seconds since 2004
</span>seconds += 3
</pre>
<p>The comment is right, and usefully explains what’s happening semantically, but
why have a comment when you could just make the implementation read as clearly
without a comment?:</p>
<pre class="example"><span class="type">LEAP_SECONDS_SINCE_2004</span> = 3
...
seconds += <span class="type">LEAP_SECONDS_SINCE_2004</span>
</pre>
<p>If you need to explain <em>why</em> you’re adding in the leap-seconds, you can add a
semantically-named function/method which performs the operation. I strongly
believe that in most situations it’s possible to make the code itself just as
clear as any comment could make it.</p>
<h4>Explaining the horror</h4>
<pre class="example"><span class="comment-delimiter"># </span><span class="comment">Warning: massive kludge, but we can’t fix it until we re-implement
</span><span class="comment-delimiter"># </span><span class="comment">the primary business logic, which is currently written in a dialect
</span><span class="comment-delimiter"># </span><span class="comment">of REXX invented by Jim, who quit yesterday.
</span>object.send(<span class="constant">:private_method</span>)
</pre>
<p>Hey, that’s actually useful!</p>
<p>It also indicates a code-base I’d <em>really</em> prefer <em>not</em> to work on if I can avoid
it, and is also the kind of comment I certainly never <em>want</em> to feel the need to
write myself.</p>
<h3>Conclusion</h3>
<p class="first">So that’s why I prefer there be no comments: the only real use I see for
comments-qua-comments is unstructured documentation of the reasons behind
specific implementation warts. If I can help it, I prefer the code-bases I
work on not have such warts. QED.</p>
<p>Comments?</p>
<p class="footnote"><a class="footnum" name="fn.without-comment.1" id="fn.without-comment.1">1</a> Erik Peterson’s list in the comments on this post is a pretty good summary.</p>
<p class="footnote"><a class="footnum" name="fn.without-comment.2" id="fn.without-comment.2">2</a> I know that there are other literate programming practitioners out there,
but I must confess to never having tried it myself. Maybe some day.</p>
http://blog.platypope.org/2012/4/5/restore-featuresRestoring Features2012-04-05T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>Couldn’t sleep, worked on blog instead.</p>
<p>I managed to get comments up with <a href="http://disqus.com/">Disqus</a> yesterday. Now I’ve got Jekyll
rendering posts using Emacs <code>muse-mode</code>, just like my old blogging system did.
Among other things, that means that I can use Emacs as my syntax-highlighting
engine again:</p>
<pre class="example">(<span class="keyword">defn</span> <span class="function-name">with-starts?</span>
<span class="doc">"Does string s begin with the provided prefix?"</span>
{<span class="constant">:inline</span> (<span class="keyword">fn</span> [prefix s & to]
`(<span class="keyword">let</span> [<span class="preprocessor">^</span><span class="type">String</span> s# ~s, <span class="preprocessor">^</span><span class="type">String</span> prefix# ~prefix]
(<span class="preprocessor">.startsWith</span> s# prefix# ~@(<span class="keyword">when</span> (<span class="builtin">seq</span> to) [`(<span class="builtin">int</span> ~@to)]))))
<span class="constant">:inline-arities</span> #{2 3}}
([prefix s] (<span class="preprocessor">.startsWith</span> <span class="preprocessor">^</span><span class="type">String</span> s <span class="preprocessor">^</span><span class="type">String</span> prefix))
([prefix s to] (<span class="preprocessor">.startsWith</span> <span class="preprocessor">^</span><span class="type">String</span> s <span class="preprocessor">^</span><span class="type">String</span> prefix (<span class="builtin">int</span> to))))
</pre>
<p>And I once again have footnotes<a class="footref" href="#fn.restore-features.1">1</a> which jump over into the side margin. I had
to convert my old footnote-mangling code to jQuery from Prototype (yes, it was
<em>that</em> old). But hey – it works now!</p>
<p>And I even added back an <a href="/syndicate">Atom feed</a> and the <a href="/archive">archived posts</a> page.</p>
<p>I think that’s actually it. Blog once again fully armed and operational. Not
too shabby for a few hours of insomnia. Zzz...</p>
<p class="footnote"><a class="footnum" name="fn.restore-features.1" id="fn.restore-features.1">1</a> So yes, actually sidenotes when everything works properly.</p>
http://blog.platypope.org/2012/4/4/back-up-backupBack up; backup!2012-04-04T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>My now <em>very</em> former hosting provider “lost” my blog VPS, and I lost everything on it. My automated backup process had apparently stopped working when I’d shuffled some paths around. Oops. Thankfully the Wayback machine archived my blog for me, so I’ll be able to get back most of it. My previous process and custom+creaky blog engine was too complicated anyway – let’s give a try to a static site (generated with Jekyll).</p>
<p>I’ll be restoring functionality and old blog posts as I go, but I wanted to get <em>something</em> back online. It’s felt odd not having a Web presence – like I was no longer really there on the Internet, or some sort of Internet ghost. Spooooooky.</p>
<p>Anyway, I’m glad to be back.</p>http://blog.platypope.org/2011/10/15/clojure-case-is-not-for-youClojure `case' is not for you2011-10-15T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>I lost the original version of this post. C’est la vie. The gist is that
Clojure’s standard <code>case</code> macro somewhat surprisingly (to me) does not support
arbitrary expressions as test conditions – only read-time constants. It does
this because it’s primary use from the Clojure-implementation perspective is to
support fast dispatch on symbol/keyword values.</p>
<p>Other people have blogged about this and provided their own less-surprising
implementations, include <a href="http://cemerick.com/2010/08/03/enhancing-clojures-case-to-evaluate-dispatch-values/">cemerick</a>.</p>
<p>I didn’t like any of the ones I found elsewhere, so here’s mine:</p>
<pre class="example">(<span class="keyword">defmacro</span> <span class="function-name">case-expr</span>
<span class="doc">"Like case, but only supports individual test expressions, which are
evaluated at macro-expansion time."</span>
[e & clauses]
`(<span class="keyword">case</span> ~e
~@(<span class="builtin">concat</span>
(<span class="builtin">mapcat</span> (<span class="keyword">fn</span> [[test result]]
[(<span class="builtin">eval</span> `(<span class="keyword">let</span> [test# ~test] test#)) result])
(<span class="builtin">partition</span> 2 clauses))
(<span class="keyword">when</span> (<span class="builtin">odd?</span> (<span class="builtin">count</span> clauses))
(<span class="builtin">list</span> (<span class="builtin">last</span> clauses))))))
</pre>
<p>Enjoy!</p>
http://blog.platypope.org/2010/2/4/confuse-the-customer-and-winConfuse the customer... and win?2010-02-04T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>With <a href='http://whatever.scalzi.com/2010/02/01/all-the-many-ways-amazon-so-very-failed-the-weekend/'>Amazon.com not selling Macmillian books</a> at the moment, now might seem like a good time to go to the source and <a href='http://www.panmacmillan.com/Categories/EBooks/?SideNav=CategoryNav&SubjectID=55&Imprint='>buy ebooks directly from Macmillian</a>. And they even have multiple formats available!: the “Adobe Digital Edition” format and the “Adobe eReader” format.</p>
<p>Wait, what?</p>
<p>According to their <a href='http://www.panmacmillan.com/Categories/Ebooks/displayPage.asp?PageTitle=Ebooks%20information%20and%20help'>Ebooks information and help page</a>, “Adobe eReader” books enable you “to read high-fidelity ebooks alongside other PDF files. Only this reader software displays ebooks with the pictures, graphics, and rich fonts you’ve come to expect from printed books.” While “Adobe Digital Editions” is “Adobe’s reader designed for eBooks” and “uses a format based on the Open Publishing Standard with the extension .epub,” but “ADE will also display your PDF files in a double-page, single page, or fit-to-width view — or you can specify your own custom fit.” Both formats have software download links, which both redirect to Adobe’s current Digital Editions page.</p>
<p>So. Er. I think “Adobe eReader” is PDF and “Adobe Digital Editions” is EPUB. Format proliferation is bad enough without making format identification more difficult than necessary.</p>http://blog.platypope.org/2010/2/3/the-vicious-cycle-of-piracyThe vicious cycle of piracy?2010-02-03T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>A few years ago I discovered that Wizards of the Coast had started selling PDF versions of pretty much the entire catalog of TSR-published original D&D, AD&D, and AD&D 2nd Ed material, all at quite reasonable prices. I’ve been fond of the <a href='http://en.wikipedia.org/wiki/Planescape'>Planescape</a> setting ever since I was first introduced to it, and I impulsively bought the majority of the AD&D 2nd Ed Planescape manuals. You known, in case I ever get suddenly transported back to the 90’s or something. I later lost those precious, fully-paid-for bits in a hard drive crash, and didn’t bother redownloading them again because of reasons which seemed reasonable at the time. I mean, the vendor I bought them from will surely let them download them again whenever I decide to. What could possibly go wrong?</p>
<p>Fast forward to the present. Finally setting up a reasonable backup scheme jogged my memory of previously lost bits, and I decided to try downloading new copies of those RPG manuals. And… <a href='http://rpg.drivethrustuff.com/'>the vendor</a> still exists… I’m able to log in… they have my complete order history… they have download links!… and – no. Apparently WotC pulled the plug, stopping all e-book sale of their both current and out-of-print material, including re-downloads of already-sold titles.</p>
<p>Of course, a quick search turned up Rapidshare-hosted copies of all the books I’d purchased, which I felt no scruples about downloading. But chicken or egg – are the books so easily available because WotC removed them from legitimate channels? Or was the pull in the first place a response to widespread piracy? Either way, I don’t see how WotC is benefiting.</p>http://blog.platypope.org/2010/2/3/creative-reformationCreative Reformation2010-02-03T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>Faruk Ates responds to Mark Pilgrim’s <a href="http://diveintomark.org/archives/2010/01/29/tinkerers-sunset">Tinker’s Sunset</a> by trivializing tinkering
and claiming that the ease-to-use devices like the iPad are fostering a
<a href="http://web.archive.org/web/20100807231527/http://farukat.es/journal/2010/02/390-the-creative-revolution">Creative Revolution</a>:</p>
<blockquote>
<p class="quoted">The simple matter is that these guys are old, and they grew up in an age
where tinkering was the only possible course of action if you wanted to use
the latest and greatest technology to its fullest potential. The Mac, in
1984, shifted that paradigm of creativity and creation towards average
consumers a little. The iPhone and iPad are shifting it even further
towards consumers, away from the tinkerers of old, the small little “elite”
that excludes the vast majority of people.</p>
</blockquote>
<p>He may be right about fostering creativity, but he’s missing the point. Making
the iPad accessible to non-tinkers and making it untinkerable are completely
orthogonal.</p>
<p>Imagine that the iPad worked exacly as Apple has already presented, but it also
had a “tinker” switch. When on, this switch allowed users to run applications
not signed by Apple, with appropriately dire warnings. Problem solved, without
impacting typical user experience.</p>
<p>“Tinkering” is easy to trivialize, but doing so ignores what its prevention
represents technically. What we’re talking about on the iPad is an impenetrable
cryptographic shield which gives Apple absolute control over what code is
allowed to run. Apple, not users, determines what applications are
appropriate. Apple is free to censor not only content which fails to meet their
technical standards, but also content which conflicts with their business
interests or they deem to be “obscene.” No matter how light the shackles, on
the iPad (and iPhone) you are not free.</p>
<p>On the flip side, like Pilgrim I do see “tinkering” as valuable in
itself. Software stacks more than any other engineered systems are inherently
knowable, and one can learn from them. Fully free systems (in Stallman’s sense)
are the most knowable and most instructive by virtue of the source code for
every component being there for the asking. With a cryptographically shielded
platform like the iPad this is impossible, and the system is unknowable and
there is nothing to be learned even for the “elite.”</p>
http://blog.platypope.org/2010/1/7/solving-puzzles-with-computer-scienceSolving puzzles with (computer) science2010-01-07T00:00:00-05:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>For Christmas my girlfriend gave me a series of increasingly difficult
wooden-block puzzles, yielding up each one only as I solved the previous. She’d
show me the assembled puzzle — to mock me, I assume — then dump it disassembled
into my lap<a class="footref" href="#fn.solving-puzzles-with-computer-science.1">1</a>. The first one was a quickly-solved freebie, but the remaining
three were pretty difficult. Difficult enough even that I decided to let
computers do the boring work and wrote programs to solve them. Ah, the joys of
being a software engineer!</p>
<p>And if you think this is “cheating,” I feel the final results from my “bonus
round” (at the end of this post) validate the approach.</p>
<p>Round 1 was the <a href="http://paxpuzzle.com/king-snake-medium-p-265.html">King Snake</a>. This puzzle is a 4x4x4 cube composed of 64 linearly
connected unit-cubes. The strand of unit-cubes is divided into 46 segments of
2-4 cubes each. Each segment shares a joint cube with its previous segment and
freely rotates to form any right angle with that segment. A naive estimation of
the problem space is greater than 1e27 (4 right-angles to the power of 46
segments). However, most moves eliminate 25-75% of the remaining problem space
by either running into already-filled space or leaving the allowed 4x4x4
grid. Counting on the constraints quickly reducing the problem space, I
initially solved this one with a pretty naive depth-first search written in
Python.</p>
<p>The Python program took about 5 hours to run, and was designed to find only one
solution. But hey! — it worked and found a solution. I later revisited this
puzzle with what I learned from solving the others, but I’ll get to that at the
end.</p>
<p>Round 2 was the <a href="http://www.creativecrafthouse.com/index.php?main_page=product_info&products_id=278">Shipper’s Dilemma Z</a>. This puzzle is a 5x5x5 cube composed of 25
identical, 5-unit-cube, vaguely Z-shaped pieces. For this one I did some
research, hoping something about tessellation would help. Alas, even though the
pieces are all identical, it appears that this is still an (NP-complete)
packing problem. There are 960 positions a piece could occupy within the space,
which for 25 pieces yields a naive complexity estimate of greater than
1e49<a class="footref" href="#fn.solving-puzzles-with-computer-science.2">2</a>. Ouch. It seemed much less likely in this case that the basic problem
constraints would help much, as a naive depth-first search would prune only a
few possibilities with each move. I wrote a simple first-try in Python
anyway. It got absolutely nowhere, which led me to decide to try something
different, something “clever.” And thus down a rabbit-hole I went.</p>
<p>I’ll leave out most of the details, but I wasted a lot of holiday time trying
to solve this puzzle with something akin to dynamic programming. I’d build
groups of 3 pieces “attached” to a vertex, combine those into groups of 6
pieces attached to a quadrant, combine those into group of 12 attached to a
side, then combine those into groups of 24 which (for solutions) would
(theoretically) form the full cube with one piece-shaped hole somewhere in the
middle. It was a terrible, terrible idea. The computational complexity of each
step varied wildly as I changed my methods for forming groups and what
assumptions I made about the properties of solution-participating groups. At
one point the execution-time was lagging just enough that I re-wrote the
solution-grinding code in C. Later the complexity was just where it wasn’t
feasible to run at home, but I could run it on <a href="http://aws.amazon.com/elasticmapreduce/">Amazon’s Elastic MapReduce</a>. I
did learn how to use Hadoop, EC2, and EMR, but — long story short — none of it
yielded a solution. I eventually climbed out of the rabbit-hole and went back
to a depth-first search, but this time with a much better conceptualization of
the problem.</p>
<p>Fast C primitives help<a class="footref" href="#fn.solving-puzzles-with-computer-science.3">3</a>, but the only real way to solve a problem of this sort
is by exponentially reducing the search space. The first step is to avoid
working on any “unsolvable” states. Many patterns of piece-placement leave gaps
which no subsequent piece can fill. I test for this by filling in a copy of the
board state with all the pieces which could fit, even if those pieces
themselves overlap. If the cube isn’t completely filled, then the initial board
state cannot lead to a solution.</p>
<p>Another obvious step for any sort of space-pattern problem is to eliminate all
rotations and reflections which result in other states in the same symmetry
group. For a cubical puzzle like this one, eliminating symmetries reduces the
state-space by a <a href="http://www.ams.org/featurecolumn/archive/cubes7.html">factor of 48</a>. In this case, I did so by initially calculating
all symmetrically unique states which have 8 pieces placed, one in each
corner. This also has the nice side-effect of splitting the search-space into
parallelizable segments, although that turned out to be unnecessary.</p>
<p>The final step necessary to explore all the “interesting” states in a
reasonable amount of time is to eliminate the exploration of duplicate
states. Just naively iterating over piece combinations will result in trying
both piece A then B and piece B then A, even though they result in exploring
the same board states. I iterated over a couple approaches to this, but
eventually hit upon simply ensuring that each position in the puzzle is filled
in a fixed sequence. This minimizes the number of options available for each
placement and doesn’t require any explicit book-keeping to short-circuit
previously-explored board states.</p>
<p>Final running-time: 2.5 minutes to generate <a href="http://files.platypope.org/shipz-solution.txt">all four solutions</a>.</p>
<p>Round 3 was the <a href="http://paxpuzzle.com/ramu-octahedron-p-347.html">RAMU OCTAHEDRON</a><a class="footref" href="#fn.solving-puzzles-with-computer-science.4">4</a>. This one is kind of a irregular
decahedron<a class="footref" href="#fn.solving-puzzles-with-computer-science.5">5</a> represented as a 5x5x5 cube with the 4 unit-cubes at each vertex
removed. It splits into 8 very irregularly-shaped pieces. There are also two
small wooden spheres which occupy unfilled space within the cube and “lock” it
by preventing motion of the “key” piece unless the spheres are shifted into
particular positions. The site calls the Ramu Octahedron their “most difficult
puzzle” and claims that only one person has solved it without reference to the
solution. The difficulties of this puzzle are three-fold: the irregularity of
the pieces makes conceptualizing their spatial placement difficult; many of the
pieces can only be placed by combinations of separate “insertion” then
“locking” motions; and once the puzzle is assembled, one must determine how to
maneuver the spheres within the unfilled internal space to allow
re-disassembly.</p>
<p>Fortunately the piece-irregularity holds little difficulty for a
computer. After laboriously entering the shapes of each piece, I was able to
re-use code I’d written for the previous puzzle to generate all the interesting
piece rotations and translations<a class="footref" href="#fn.solving-puzzles-with-computer-science.6">6</a> and their various combinations. Once that
gave me the solution spatial arrangement, it required a bit of trial-and-error
to figure out a working piece order and insert/lock sequences, but I was able
to manage it by hand. Figuring out the necessary unlocking rotations was also
easy enough, at least after I added a map of the unfilled internal space to the
solution.</p>
<p>Final running-time: 1 second to generate the <a href="http://files.platypope.org/ramu-solution.txt">single solution</a>.</p>
<p>Bonus round, back to the King Snake. I decided to come back to this puzzle with
what I’d learned from solving the others and try to generate all its solutions
in a reasonable amount of time. I didn’t have any new ideas about how to frame
the puzzle, but I did have some new implementation tools: pre-generating and
indexing by position all legal piece arrangements, and representing puzzle
states as bit-fields. These two together allow for a blazing-fast solution.</p>
<p>Final running-time: 30 seconds to generate all <a href="http://files.platypope.org/snake-solution.txt">four solutions</a><a class="footref" href="#fn.solving-puzzles-with-computer-science.7">7</a>.</p>
<p>If you own this puzzle, you may at this point be thinking “Wait, what?” The
marketing copy and provided instructions for the King Snake claim there are
only two solutions. But nope, four. The two extras solutions are very similar
to each other, but are quite distinct from the two “official” solutions, and
none are simply symmetries of the others.</p>
<p>“Cheating” — hah!</p>
<p><a href="https://github.com/llasram/puzzles">Puzzle-solving code</a> available for your edification.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.1" id="fn.solving-puzzles-with-computer-science.1">1</a> Best Christmas present ever, seriously.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.2" id="fn.solving-puzzles-with-computer-science.2">2</a> For comparison, chess apparently has between 1e43 and 1e50 legal board states.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.3" id="fn.solving-puzzles-with-computer-science.3">3</a> Most actions on a the puzzle-state are just a few bitwise operations.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.4" id="fn.solving-puzzles-with-computer-science.4">4</a> Which I really like putting in all caps, for some reason.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.5" id="fn.solving-puzzles-with-computer-science.5">5</a> Yeah, the name confuses me too.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.6" id="fn.solving-puzzles-with-computer-science.6">6</a> But not reflections, what with three dimensional limitation and all.</p>
<p class="footnote"><a class="footnum" name="fn.solving-puzzles-with-computer-science.7" id="fn.solving-puzzles-with-computer-science.7">7</a> FYI, I number from the opposite end than the provided solutions.</p>
http://blog.platypope.org/2009/10/5/language-implementation-patternsLanguage Implementation Patterns2009-10-05T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>Possible lesson: don’t get upset with a book for not being a completely
different book the author hasn’t written yet, but will later.</p>
<p>I few months ago I bought <em><a href="http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference">The Definitive ANTLR Reference</a></em> by Terence
Parr<a class="footref" href="#fn.language-implementation-patterns.1">1</a>. The book’s subtitle is “Building Domain-Specific Languages,” so I was
rather disappointed when I found out that it has absolutely no information on
building domain-specific languages — or any other applications — using
ANTLR. It’s a great ANTLR reference, but doesn’t have any examples of using
ANTLR to do anything other than just parse (and emit and handle parsing
errors). Without any practical examples, it took bit of head-scratching on my
part to realize how to even e.g. make my Z-code assembler available to my
ANTLR-generate AST walker.</p>
<p>But today I learned that Terence Parr has also written another book titled
<em><a href="http://www.pragprog.com/titles/tpdsl/language-implementation-patterns">Language Implementation Patterns</a></em>. I haven’t had a chance to read much of it
yet, but it looks like exactly what I wanted in the first place — a guide to
actually writing various sorts of applications which involve parsing languages,
mostly using ANTLR-generated parsers. This book has the oddly-similar subtitle
“Create Your Own Domain-Specific and General Programming Languages,” but which
seems rather more apt here. In any case, I’m looking forward to it.</p>
<p>And on a side note, Parr’s publisher, <a href="http://www.pragprog.com/">the Pragmatic Bookshelf</a>, kicks ass. Their
e-book deployment is even better than O’Reilly’s. While both provide PDF,
Mobipocket, and EPUB versions for perpetual re-download, the PragProgs also (a)
offer almost their entire catalog as e-books, and (b) make e-book editions
available as “<a href="http://www.pragprog.com/frequently-asked-questions/beta-books">beta books</a>” several months prior to the print release date. In
fact, Language Implementation Patterns is currently only available in an e-book
beta version. But available it is, for delicious pre-final-draft reading by the
adventurous.</p>
<p class="footnote"><a class="footnum" name="fn.language-implementation-patterns.1" id="fn.language-implementation-patterns.1">1</a> Primary author of ANTLR, so you can see the draw.</p>
http://blog.platypope.org/2009/9/29/zmforthZmForth!2009-09-29T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>Some time ago I decide to fix one of the holes in my software engineering
knowledge and learn about writing compilers. My undergraduate curriculum
provided an overview of parsing and some general programming language design
issues, but nothing at all on code generation or optimization. I bought a few
books on compilers, started familiarizing myself with <a href="http://www.antlr.org/">ANTLR</a>, then became
completely and utterly side-tracked by the programming language Forth.</p>
<p>Forth is a strange little language. Chance are that unless you’re an astronomer
or have mucked around with Sun’s OpenBoot, you’ve never even heard of it; or if
you have heard of it, you’ve never used it. Forth saw its heyday during the
80’s and has largely faded away with the 8- and 16-bit processors to which it
was most suited. But boy was it suited to those processors — if I ever needed
to develop code to run on a 16-bit processor with less than 128k of RAM, Forth
is probably the language I’d turn to.</p>
<p>Forth’s key property is that it’s a purely stack-based language. Forth
functions — or “words,” as Forth calls them — do not take arguments explicitly,
but instead implicitly via a system-wide parameter stack. There aren’t any
local variables — instead the language provides a rich set of
stack-manipulation primitives like <code>DUP</code><a class="footref" href="#fn.zmforth.1">1</a>, <code>SWAP</code><a class="footref" href="#fn.zmforth.2">2</a>, and <code>ROT</code><a class="footref" href="#fn.zmforth.3">3</a> to facilitate
juggling the top handful of stack items to provide the correct parameters to
each function call. Even control structures are implemented with Forth-level
function calls!<a class="footref" href="#fn.zmforth.4">4</a> For example, here’s my implementation of <code>IF</code> / <code>THEN</code>:</p>
<pre class="example"><span class="keyword">: </span><span class="function-name">if </span>compile ?branch here <span class="constant">0 </span>, <span class="keyword">; immediate</span>
<span class="keyword">: </span><span class="function-name">then </span>here swap ! <span class="keyword">; immediate</span>
</pre>
<p>This means that a Forth program consists of nothing more than a list of
functions to call in sequence, a feature which the language exploits in two
ways.</p>
<p>First, to simplify syntax. Just as a Forth program abstractly consists of
nothing more than a list of function “words,” the source code of a Forth
program consists of just lists of those words’ human-readable names separated
by spaces. Adding the ability to switch the Forth system between executing each
word as parsed and appending it into a new definition allows the system to act
as both interpreter and compiler in one. That’s a full interactive interpreter
and compiler in under 8k.</p>
<p>Second, to simplify implementation. There are “traditional” optimizing
compilers for Forth, but that isn’t the most common or most obvious approach to
Forth compilation, especially in-target. More frequently, Forth systems will
use an approach known as “<a href="http://en.wikipedia.org/wiki/Threaded_code">threaded code</a>.” This has nothing to do with
multithreaded execution, but instead refers to the technique of generating code
which consists of only of calls to other functions. The more specific
techniques of so-called “direct” and “indirect” threaded code drop even the
calls themselves<a class="footref" href="#fn.zmforth.5">5</a> and instead encode only the function addresses. A list of
address can’t be executed directly, so this necessitates an “inner interpreter”
to load and call each in sequence, but this interpreter need only be a few
machine instructions long, and on some architectures imposes no additional
overhead over directly-expressed function calls. In a Forth system using this
approach, compiling a function call into a new definition literally consists of
just appending the address of the function to call to the end of the definition
in progress. It just can’t get any simpler than that.</p>
<p>And because it’s all so simple, you can implement it yourself! Or rather, I can
lose the thin thread of sanity and decide to implement one myself. There are
existing F/OSS Forth systems out there for pretty much every environment under
the sun, which provide plenty of examples to turn to for inspiration, but also
mean that the whole Forth thing is pretty well done. This needn’t be a
hindrance to implementation-as-a-learning-exercise, but I wanted to contribute
something new. I happened to think of the <a href="http://en.wikipedia.org/wiki/Z-machine">Z-machine</a>, and lo-and-behold,
although there is a <a href="http://www.ifwiki.org/index.php/Lists_and_Lists">Z-machine scheme implementation</a>, there was no Z-machine
Forth.</p>
<p>Until now! Ladies and gentlemen, I present to you <a href="https://github.com/llasram/zmforth">ZmForth!</a>, an ANS Forth
implementation for the Z-machine. It passes the woefully incomplete ANS Forth
test suite I found and runs existing Forth programs that don’t depend on file
I/O. It plays Tetris, performs 32-bit arithmetic and unsigned comparisons, and
provides exceptions, a compiler, and a de-compiler. All of this in a 16-bit
virtual machine with only 64k of addressable memory. The whole project was one
giant rabbit hole, but I had to write <a href="https://github.com/llasram/zmforth/blob/master/zas.py">my own Z-code assembler</a> to provide the
directives I wanted, which is at least in the direction of what I initially
planned to work on.</p>
<p>I had a good time working it, and I hope someone else may be at least half as
entertained by it as I was.</p>
<p class="footnote"><a class="footnum" name="fn.zmforth.1" id="fn.zmforth.1">1</a> Duplicate the stop stack item.</p>
<p class="footnote"><a class="footnum" name="fn.zmforth.2" id="fn.zmforth.2">2</a> Swap the top two stack items.</p>
<p class="footnote"><a class="footnum" name="fn.zmforth.3" id="fn.zmforth.3">3</a> Rotate the third stack item to the top.</p>
<p class="footnote"><a class="footnum" name="fn.zmforth.4" id="fn.zmforth.4">4</a> In fact, in my system all the control structures are implemented in Forth.</p>
<p class="footnote"><a class="footnum" name="fn.zmforth.5" id="fn.zmforth.5">5</a> The call instructions, that is.</p>
http://blog.platypope.org/2009/7/2/creator-vs-reader-and-the-adobe-epub-monopolyCreator vs. reader and the Adobe EPUB monopoly posted2009-07-02T00:00:00-04:00Marshall Bockrath-Vandegriftllasram@gmail.comhttp://blog.platypope.org/<p>I’ve been doing a fair bit more reading on my Android phone, although recently
I’ve switched from <a href="http://www.fbreader.org/FBReaderJ/">FBReaderJ</a> to <a href="http://www.aldiko.com/">Aldiko</a>. Each new release of FBReaderJ has
gotten better, but Aldiko includes an actual CSS-based renderer<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.1">1</a> and
presently provides a much smoother reading experience<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.2">2</a>. I feel a little
guilty using a piece of commercial software when a free-as-in-freedom solution
exists, but not yet guilty enough to try hacking on FBReaderJ.</p>
<p>Guilt aside, reading more on a smaller screen has had me thinking again about
the tension between book creator and reader in e-book formatting and
layout. EPUB on the mobile phone highlights this tension more than e-ink
devices if for no other reason than that a phone screen even less resembles a
book page than an e-ink screen does. One conclusion I’ve come to is that it’s
somewhat unfortunate that Adobe has thus far been the biggest contributor to
EPUB as a commercial e-book format.</p>
<p>On the one hand, someone had to do it, and it’s good that someone has done
it. Even the DRM thing, to some degree – most publishers aren’t ready to do
without it, and at least Adobe had the good grace to use an
<a href="http://i-u2665-cabbages.blogspot.com/2009/02/circumventing-adobe-adept-drm-for-epub.html">easily circumventable system</a>. The major concern I have is with some aspects of
the apparent mindset behind Adobe Digital Editions.</p>
<p>At present, Digital Editions is the EPUB viewer to beat – I have no idea what
the actual usage figures look like, but it’s the only viewer one can legally
use for all those commercially-sold Adobe-DRMed EPUB books, so one has to
imagine that DE commands the lion’s share of the market. Adobe represents
Digital Editions as being more than just an EPUB viewer. The advertising copy
on <a href="http://www.adobe.com/products/digitaleditions/">the DE web site</a> touts it as “offer[ing] an engaging way to view and manage
eBooks and other digital publications.” To this end, DE supports not only EPUB,
but also the document format much more central to Adobe’s business – PDF. And
because PDF is so much more important to Adobe, DE caters to PDF at the expense
of EPUB.</p>
<p>It’s no surprise to the average e-book enthusiast that PDF’s fixed-page nature
makes it a poor e-book format. The most obvious reason is that PDF files can’t
be cleanly reflowed to alternative page sizes. A perhaps less obvious corollary
is that PDF leaves little room for user control of basic formatting parameters
such as font size, line height, text alignment, and paragraph marking. But via
one or more chains of causality, because PDF rendering does not allow
user-control of these properties, Adobe DE doesn’t allow setting them for EPUB
either. This yields an EPUB viewer where the reader has control over only the
font size and page size. And even then only partial control! – DE will not
rescale any size a book specifies as an absolute size, and DE allows books to
provide “page template” files which control how the available screen area is
divided into text regions and body columns.</p>
<p>The most generous interpretation of this decision is interface consistency. As
long as Adobe is attempting to present PDF as an e-book format and provide a
viewer which handles “digital publications” regardless of format, it only
complicates that viewer’s interface to provide options which apply to some
formats but not others. Somewhat less charitably, Adobe’s focus on PDF may have
led to an unconscious bias toward creator control of document formatting, to
the extent of perhaps not even considering placing control beyond font size
(a.k.a. “zoom”) in the readers’ hands. And way over on the conspiracy-theory
side of the fence, perhaps it represents a conscious decision on the part of
Adobe to limit the usefulness (and adoption) of EPUB by making it only a small
improvement over PDF even for the documents most suited to reflowable
formatting.</p>
<p>So for whatever reason, the most popular EPUB viewer on the market has
limitations which severely restrict the usefulness of the format. But EPUB is
an open standard, which means other, better viewers are free to compete with
Digital Editions and supplant it, right? Except they aren’t really, because of
DRM.</p>
<p>As I mentioned above, it seems that most publishers are not yet ready to do
without DRM, despite the lesson of the music industry. The overwhelming
majority of commercially-sold books are encumbered with DRM, and all of the
DRM-encumbered EPUB books sold use Adobe’s ADEPT DRM. It would be technically
possible for a competitor to begin offering a different EPUB DRM scheme, but I
can only imagine the degree of confusion mutually incompatible “EPUB” books
would cause among average consumers. So any successful EPUB viewer device or
application needs to license the ADEPT DRM technology from Adobe.</p>
<p>To facilitate this, Adobe has begun offering the “<a href="http://www.adobe.com/devnet/readermobile/">Adobe Reader Mobile 9 SDK</a>.”
The SDK is available by license agreement only<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.4">4</a>, so we can but speculate from
the marketing copy on the capabilities and interfaces it provides. The
“features” list in the SDK FAQ focuses entirely on features of the “Reader
Mobile document rendering engine,” suggesting that SDK primarily/only provides
a rendering engine. If this is the case, then Adobe is not expecting – or
potentially allowing – other vendors to write competing ADEPT-compatible EPUB
renderers. Instead, all the available and announced EPUB reader apps/devices
using the Adobe SDK<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.5">5</a> will simply repackage the Adobe renderer and the paucity
of options for user control it provides.</p>
<p>One example in support of this theory is Amazon’s PDF support in the Kindle
DX. Although not widely advertised, <a href="http://blogs.adobe.com/billmccoy/2009/05/amazon_others_l.html">Amazon is apparently using the Adobe SDK</a>,
just integrating only the PDF renderer. Notably missing in the DX’s PDF support
vs. all the Kindles’ Mobipocket support is the ability to add annotations to
documents. It seems to me that this would be a “must have” feature, not only
for parity with Mobipocket support, but also for the target market as a device
for textbooks and technical documents. To me the most obvious explanation for
this feature’s lack is that there isn’t an easy way to add it while still using
the SDK to allow rendering of DRMed PDFs. There are plenty F/OSS PDF renderers
to which Amazon could have (comparatively) easily added annotation support,
which suggests that the Adobe SDK’s DRM support is an implicit part of the
included renderer, and that SDK licensees cannot use the SDK to read DRMed
documents independently of the renderer.</p>
<p>Another interesting angle is to compare the Adobe approach with how other book
formats/viewers handle the creator-reader tension.</p>
<p>The desktop version of MSReader allows setting only the font size, but a
significant number of users seem to regard it as still the best desktop e-book
viewer available. A major component of this seems to be very well-chosen
defaults for properties like line-height, and the infrequency with which books
alter those properties. I have seen much more mixed reactions to the Mobile
version, perhaps because on the smaller screens of mobile devices tuning the
line-height, margins, etc to an individually comfortable size is much more
important. Perhaps a commenter could fill me in?</p>
<p>The various Mobipocket viewers support differing assortments of user-controlled
properties. Most support setting font-size, line-height, and paragraph
alignment. Interestingly, Mobipocket’s treatment of these properties
demonstrates all three possible resolutions of the creator-reader formatting
tension. Line-height may be set by the reader, but not by the book creator –
the format simply provides no way to specify it. The font-size may be set by
the book creator, but only in terms of a size relative to the reader-selected
base size. And paragraph alignment may be specified by either the book creator
or reader, but the creator’s setting overrides the reader’s when specified.</p>
<p>Correspondingly, paragraph alignment is the subject of one of Mobipocket’s most
forceful formatting recommendations: “alignment must NOT be set if it is not
strictly needed.” In contrast, the EPUB specification documents contain little
resembling formatting guidelines (beyond an admonition against using absolute
positioning). The one purely formatting recommendation in Adobe’s “EPUB Best
Practices Guide” is “use spacing that looks more like a book,” suggesting
including CSS rules to eliminate default space around block-level elements. It
does contain some sensible recommendations e.g. against using tags in most
contexts, but otherwise the “Guide” documents Adobe’s EPUB extensions and DE’s
quirks more than actual “best practices.” Which means that EPUB combines the
most extensive e-book formatting capabilities with the fewest guidelines for
producing actually readable books. And this is something of a challenge for
authors of EPUB viewers.</p>
<p>Coming full circle to the beginning of this post, the Aldiko EPUB viewer does
allow the user to set some basic formatting properties, including margins,
font, font-size, and line-height<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.6">6</a>. The first three work seamlessly, but
line-height somewhat less so. Aldiko handles books which don’t specify their
own line-height fine, but any book-specified line-height “wins.” Which isn’t
intended as a slight on Aldiko – this is a difficult problem to solve.</p>
<p>Something like page margin can really only be set via CSS in one “most
sensible” way<a class="footref" href="#fn.creator-vs-reader-and-the-adobe-epub-monopoly.7">7</a>, and thus is easy to override coherently and consistently in a
viewer. Font-size is potentially trickier, but the most obvious ways of
specifying font sizes (relative sizes and the CSS named absolute sizes) make a
simple solution fairly straightforward. Properties like line-height
unfortunately lack such a solution – all the allowed relative values for the
CSS <code>line-height</code> property are interpreted as relative to the <code>font-size</code>, not the
<code>line-height</code> of the parent element. This means that the <code>font-size</code> solution of
“change the base and respect subsequent relative changes” doesn't work for
<code>line-height</code>. Without doing some sort of layout analysis, all a viewer can do is
either ignore all book-specified line heights or respect all book-specified
line heights.</p>
<p>Solution? I’m not sure. One possible solution would be for book producers and
viewer author to agree on guidelines which allow viewers to consistently
override some set of formatting properties. Most of these could be fairly
simple, like Mobipocket’s “alignment must not be set” rule. Another solution
would be for e-book reader apps to do some sort of pre-rendering layout
analysis which allows them to automatically produce a per-book user
stylesheet. I like hands-off technological solutions, but I’m not sure how
feasible that will be on mobile devices.</p>
<p>Other ideas?</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.1" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.1">1</a> To my surprise, a good enough one to handle the <code>max-width</code> property on
images.</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.2" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.2">2</a> Quite literally, in the case of the page-turn animation.</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.4" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.4">4</a> Probably to stop people from reverse-engineering the DRM system.</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.5" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.5">5</a> The ones I’m currently aware of: the Sony Reader, Lexcycle Stanza, the
Bookeen Cybook, and the Elonex ebook.</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.6" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.6">6</a> Not paragraph alignment yet, but via e-mail the author has said it’ll
probably be added soon.</p>
<p class="footnote"><a class="footnum" name="fn.creator-vs-reader-and-the-adobe-epub-monopoly.7" id="fn.creator-vs-reader-and-the-adobe-epub-monopoly.7">7</a> One could set left and right margins on every block level element, but
hopefully no one actually does that.</p>