Improving Clojure Start Time

Problem

Slow Clojure boot time is frequently one of the highest ranked issues in the annual State of Clojure survey. A separate poll showed that many people are experiencing slow boot times in a variety of circumstances - not just when trying to use Clojure for "scripting", but also when starting a REPL or running a Clojure program in a variety of ways.

Goals

In Jan, 2016 an informal poll collected startup time data from ~100 users. This data was incomplete and anecdotal but the following summarizes the essence of what was reported:

Starting a REPL with lein repl(most frequent reported use case)
- 50% - reported 2-10 second start times (expected: 1-2 s)
- 30% - reported 20-30 second start times (expected: 2-5 s)
- 20% - reported 60+ second start times (expected: 2-15 s)
Starting Clojure program with lein run
- 50% - reported 2-6 seconds (expected: <1 s)
- rest - reported 10-60 seconds
Starting Clojure program with java command
- 50% reported 2-5 seconds (expected: <1 s)
- rest - reported 10-60 seconds (expected: 3-5 s)

There are a variety of different use cases where we would like to see progress:

Reduce clojure.core startup overhead- this helps everyone, but would be most felt in starting the REPL or running small, short-lived programs
- Given the 94 ms of Java startup time, we will never compete directly with interpreted langs
- Currently the Clojure boot time is about 640 ms - a good goal would be 400 ms (in addition to the Java startup time)
Reduce tooling startup time(lein/boot startup)
- While this has nothing to do with Clojure core, it is nevertheless the biggest component of start time for small projects
- lein repl should start in 1 second - if we can get JVM/Clojure overhead to < 0.5 s, then this means lein repl has to happen in 0.5 s
- Starting nrepl is the biggest factor in the startup time
- Impoving startup time helps twice as lein repl launches a second JVM
Reduce time per-ns and per-var to improve large program start time

While there is fixed overhead, the majority of the costs scale with the number of namespaces and number of vars
The per-var cost is a key factor contributing to startup time for larger apps which load a significant number of vars
- Lazy vars have the biggest potential here, but also optimizing the process of creating and binding vars
- We need to cut this cost to the realm of 0.1 ms/var instead

Raw startups for other dynamic langs (20-30 ms):

Command	Time (s)
ruby -e 0	0.031
python -c 0	0.019
node -e 0	0.027

Timings

The startup time that people experience can be broken into several pieces:

JVM startup
Clojure runtime startup, biggest pieces are:
- clojure.lang.RT
- clojure.core namespace
Tooling startup - tools like Leiningen or Boot add significant overhead
- Starting nrepl server
- Connecting to nrepl with client (like reply)
- Initializing middleware, most critically clojure-complete, which scans jars
Program load time

Each namespace you load in starting your program has a cost:
- Compilation (if source)
- Var loading (per var)
- Class loading (per function)
Whatever actual logic you run at startup time

How does this break down for an example program?

Timings done with:

Java 1.8 - Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)
Clojure 1.8.0
Leiningen 2.6.1
Boot 2.5.5
Macbook Pro - Quad core 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3, SSD
Times reported are best of 5 runs

ID	Command	Program	Time (seconds)	Classes loaded
J-NIL	java -cp java Hello	Java empty main	0.09	423
C-NIL	java -cp $CLOJURE18 clojure.main -e nil	Clojure eval nil	0.73	1996
C-REPL	echo \| time java -cp $CLOJURE18 clojure.main	Clojure REPL start/stop	0.89	2389
C-RUN	java -cp $CLOJURE18:src clojure.main -e "(require 'hello.main) (hello.main/hi)"	Run function in a ns	0.80	2078
C-RUN-2	java -cp $CLOJURE18:src clojure.main -e "(require 'hello.main2) (hello.main2/hi)"	main2 loads 2nd ns with 100 defns	0.87	2184
C-AOT	java -cp $CLOJURE18:target/classes hello.core	Run empty fn with AOT	0.74	1997

Diffing these we can estimate these costs by diffing the times above:

JVM startup = 90 ms (423 classes) - we can take this as a minimum bar
Clojure runtime startup = 640 ms (1573 classes) - C-NIL vs J-NIL
Clojure REPL startup = 160 ms (393 classes) - C-REPL vs C-NIL
Loading 1 namespace with 1 defn = 70 ms (82 classes) - C-RUN vs C-NIL
Requiring namespace containing 100 defns = 70 ms (106 classes) - C-RUN-2 vs C-RUN

If we AOT compile a Clojure namespace and invoke the compiled form directly, we see a reduction of 60 ms and 81 classes. These programs are too simple to evaluate the impact of JIT vs AOT.

Leiningen overhead

We can also look at the lein repl overhead for the commands above.

ID	Command	Program	Time (seconds)	Classes loaded	Vs C- (sec)	Vs C- (classes)
L-REPL	echo \| time lein repl	Clojure REPL start/stop	4.54	3383	0.89	2389
L-RUN	lein run -m hello.main/hi	Run function in a ns	2.50	2104	0.80	2078
L-RUN-2	lein run -m hello.main2/hi	main2 loads 2nd ns with 100 defns	2.54	2306	0.87	2184
L-AOT	lein run -m hello.core	run AOT fn	2.51	2185	0.74	1997

When running Leiningen repl, we see 3.65 sec of additional time and almost 1000 additional classes loaded.

When running Leiningen run, we see about 1.8 s of additional time added.

Both of these commands are using nrepl. Looking closer at the nrepl startup, we see that it takes:

1.6 s to start the nrepl server (this involves launching a second JVM)
0.2 s to start the nrepl client

These cannot currently be parallelized do to race conditions in nrepl-ack (according to the code).

There are other options for starting lein repl faster with fast trampoline. You can use this as follows (all times are after initial run which caches classpath etc):

ID	Command	Time (seconds)	Vs L- (sec)
LFT-REPL	LEIN_FAST_TRAMPOLINE=y echo \| lein trampoline run -m clojure.main Note: using Clojure REPL, not lein repl	2.42	4.54
LFT-RUN	LEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.main/hi	0.81	2.50
LFT-RUN-2	LEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.main2/hi	0.83	2.54
LFT-AOT	LEIN_FAST_TRAMPOLINE=y lein trampoline run -m hello.core	0.80	2.51

These run times are essentially as fast as not using lein at all after the first invocation. However, most people don't do this because this is more complicated.

clojure.core loading

clojure.core defines all of the core vars in the Clojure language. clojure.core is actually split up across a number of files, many of which are loaded from core.clj. We can put some numbers on the time to load each part:

part	ms
core, pt 1	216
core_proxy	8
core_print	23
genclass	12
core_deftype	15
core/protocols	24
gvec	10
instant	21
uuid	6
core, pt 2	79
core, pt 3 - datareaders	8

gvec is initialized during load but is not actually used by anything during load - this startup could potentially happen in parallel or lazily. This would save at most 10 ms.

instant and uuid are fairly independent chunks of code that don't need to be loaded until the data readers are read at the very end of core - they could potentially be loaded in parallel as well. This would save at most 21 ms though.

Clearly though the majority of these times correspond to loading each var in core - there is a per-var cost on the order of a few tenths of a ms.

RT loading

clojure.lang.RT is the other major component of Clojure runtime load time. How does it break down?

TODO

Per-var load time

It is common for people to report server load times in the order of 10s of seconds with a total load time over a minute. While there is likely little to do with respect to an application's logic, there is a per-var cost (due to definition, initialization, and classloading). Could dramatic improvements to per-var loading make significant changes to load time? Lazy var loading is one approach to this, but there are potentially other improvements that could help as well (optimizing Var.bindRoot or even using something lighter weight than vars).

Examining an example program

This section analyzes a larger "real" program, the Luminus Guestbook example app (forked here to maintain a stable build with some mods). This is a fairly typical Clojure web app using many of the most popular libraries for its implementation. The app was modified to System/exit as soon as the main starts so we are primarily measuring boot time.

1) What are some timings for various ways to run the guestbook app?

ID	Command	Description	Time (s)	Classes loaded
GB-C-JIT	time java -cp `cat cp` clojure.main -m guestbook.core	no lein, no aot	9.57 s	10494
GB-C-AOT	time java -jar target/guestbook.jar	no lein, aot	4.67 s	7344
GB-C-AOT-DL	time java -jar target/guestbook.jar	no lein, aot %2B direct linking	4.59 s
GB-L-JIT	time lein run	lein, no aot	13.43 s	10550
GB-L-AOT	time lein run	lein, aot	6.98 s	7482
GB-L-REPL	echo \| time lein repl	repl	13.99 s	10508

2) For non-AOT, is reading/compilation a dominant factor?

Yes - about half the time is spent in compilation (based on comparison to AOT).

3) Do lazy vars improve AOT times?

As tested, this affects both Clojure jar itself and the guestbook uberjar.

ID	Command	Description	Time (s)	Classes loaded
GB-C-AOT-LAZY	time java -jar target/guestbook.jar	no lein, aot, lazy	3.88 s (17% improved)	4440
GB-L-AOT-LAZY	time lein run	lein, aot, lazy	6.51 s (7% improved)	4611

Yes - lazy vars improve startup time 10-20%.

4) What is the Lein overhead?

The lein run overhead with AOT was 2.31 s, slightly bigger than we saw in L-AOT.

The lein run overhead with no AOT was 3.86 s, bigger than we saw in L-RUN and L-RUN-2 (overhead was about the same there).

5) For Lein, is some kind of classpath calculation caching worth doing?

Lein fast trampoline will cache the command startup and avoid the first lein execution.

Lein run AOT %2B LEIN_FAST_TRAMPOLINE: 7.00 s (about the same) - surprised this wasn't a couple seconds faster

Lein repl %2B LEIN_FAST_TRAMPOLINE: 8.58 s (39% improved) - why is this so much better than the lein run?

6) Why are lein repl times so much bigger than lein repl on a simple project and for lein run?

TODO

7) How big an impact is AOT compiling lein dependencies (like tools.nrepl)?

TODO

8) How big an impact is AOT compiling project dependencies?

TODO

Techniques for faster loads

This section discusses techniques available now to reduce load times.

JVM Args

The following JVM startup args may improve startup performance (note that some of these have important tradeoffs though):

-client -XX:%2BTieredCompilation -XX:TieredStopAtLevel=1
- These settings use the client compiler and use one the 1st stage compiler to favor faster start time.
- These settings prevent JIT compilation to get the highest level of performance over time, so you should only use these settings when running a REPL or short-running apps. Long-running server programs should not use these settings.
-Xverify:none
- Suppresses the bytecode verifier (which speeds classloading).
- Enabling this option introduces a security risk from malicious bytecode, so should be carefully considered.
-XX:%2BAggressiveOpts

This option enables optimizations that are not yet enabled by default but will be in the next version of the JDK. Generally this option improves performance for any program and is safe to use.

You should also consider your heap settings. If you can anticipate the max heap you expect to use then setting the min and max heap to that value will allow the JVM to do a single memory allocation rather than growing the heap during startup. Some experimentation may be required to determine the optimal values for this. For example: -Xms256m -Xmx256m

AOT compilation

When starting a Clojure program, any Clojure code that is loaded from a source file (.clj or .cljc) must be compiled. The Clojure compiler is fast, but it's a significant source of load time. Consider AOT compilation any time you expect to start the same instance of a program many times (without changing the source).

Most build tools (lein, boot, etc) provide the ability to compile your Clojure source ahead-of-time (AOT) to .class files. AOT is transitive (all source namespaces needed will be compiled) however distributing AOT compiled libraries that contain compiled versions of dependencies is a bad idea and will create version problems for downstream consumers - for this reason, most libraries are distributed as source and this is recommended. It is typically best to AOT compile when you build a final application (when creating an uberjar or war for deployment, for example).

The default Clojure jar is distributed in AOT compiled form.

Eliding metadata

When compiling, each function is compiled into a class. Metadata (like the docstring) is included into that compiled class file and stored as a constant in the constant pool. Metadata like the docstring is typically not used except when developing at the REPL. The Clojure compiler can be instructed to elide that meta, removing it from the compiled class. This reduces class size and improves classloading.

Using a Clojure jar compiled with metadata elided reduces load time 5-15 ms. For large AOT'ed Clojure apps with a lot of docstrings, there could be some impact as well, but it's estimated that the impact is small.

Direct linking

Var invocation involves looking up a var, retrieving it's function, and invoking it. Direct linking shortcuts this by compiling into a direct static invocation of the function class. When direct linking is enabled, the Clojure compiler is enable to avoid initializing many of the vars, so the class size is reduced and classloading is improved. Since Clojure 1.8, the default clojure jar is compiled with direct linking.

Small improvements

This section describes smaller fixes or enhancements that can be done to reduce load time. Most of these are linked to tickets with patches.

Delay socket server loading

In 1.8, new code was added to runtime startup to check for socket server Java system properties and start a socket server for each one found. However, this code is causing several namespaces to be loaded (clojure.core.server, clojure.edn) even when no socket servers are defined (the common case).

A patch to address this issue is available at CLJ-1891. Applying the patch reduces Clojure core start time about 20 ms.

Reduce `refer` overhead

Each time a new namespace is loaded, it will "refer" external vars into the namespace. This happens for clojure.core automatically and will also happen due to the use of `use` or `require` with `refer`. The current implementation of refer is not very efficient - it builds large intermediate maps, traverses all vars even when only a few are refer'ed, and cas'es each new var into the namespace individually. The patch at CLJ-1730 address the worst of these issues while still taking a relatively conservative approach in the changes. This reduces the cost of refer-clojure (done on the load of every new ns) by about 50% and the time for :refer :only by as much as 90%. However this is only a small percentage of typical load times which are dominated by var initialization and classloading.

Check elide-meta in add-doc-and-meta

The add-doc-and-meta function is used to attach docs later in clojure.core load, but this macro does not currently check whether the doc meta will be elided. A check in the macro could turn these into no-ops, saving some time.

Optimize Var.bindRoot()

This is called close to 1000 times on clojure.core start - it could potentially be optimized wrt watches, validation, and meta (optimize alterMeta wrt clearing macro flag).

Reduce RT and clojure.core load time

There are likely a number of changes that could be made in RT (and clojure.core) initialization to reduce load times:

When creating symbols and keywords, call the two arg fn with nil namespace, rather than the single arg form (which must analyze the string)

Symbol.intern("foo") should be Symbol.intern(null, "foo")

Intern common strings like: clojure.core, column, arglists, x, tag, &, coll, etc
When loading auto-imported java.lang classes, don't load the classes, just get the unloaded class instance
Remove {:static true} meta - no longer used
Remove {:added "1.0"} and make that the default "added" assumption.
Parallelize loading of gvec, instant, and uuid - these are mutually exclusive and don't interact much with the rest of core.

These need further testing to determine whether they are worth doing.

Big Improvements

This section discusses larger changes that could have a more significant impact on start time.

Reduce reader time

For non-AOT (common during dev/repl), reading is a significant time factor.

Reduce compile time

For non-AOT (common during dev/repl), compiling is a significant time factor.

Lazy vars

Most of this time is taken in loading and initializing vars, largely functions. The time is spent in classloading the function under the var, loading metadata from the constant pool, and initializing the vars.

Many of these vars are not actually needed to start a program - clojure.core in Clojure 1.8 has 725 interned vars but a large number of them are unused for most programs at startup time. Delaying the loading of these vars until they are needed would yield significant performance gains. Attached to this page is a patch (lazyvars.diff) adapted from Rich's prior work on the fastload branch of Clojure.

ID	Before patch (s)	Before Classes	After patch (s)	After classes
C-NIL	0.73	1996	0.53 (-27%)	1175 (-41%)
C-REPL	0.89	2389	0.64 (-28%)	1350 (-43%)
C-RUN	0.80	2078	0.60 (-25%)	1271 (-39%)
C-RUN-2	0.87	2184	0.68 (-22%)	1378 (-37%)
C-AOT	0.74	1997	0.53 (-28%)	1163 (-42%)
L-REPL	4.54	3383	4.52 (-0.4%)	2454 (-27%)
L-RUN	2.50	2104	2.45 (-2.0%)	1307 (-38%)
L-RUN-2	2.54	2306	2.50 (-1.6%)	1414 (-39%)
L-AOT	2.51	2185	2.44 (-2.8%)	1375 (-37%)

With lazily loaded vars we are seeing a significant reduction in both time and classes loaded for the Clojure runtimes but only slight improvements in the Leiningen runtimes (even though we see similar class reduction). This implies that while lazy vars do make a significant difference in Clojure start time, those gains are dwarfed by other parts of Leiningen start time.

One downside of lazy vars is that the JIT is not as good at inlining through the lazy var check which makes var indirection slower after the lazy var has been forced. One open question is whether invokedynamic changes this, possibly allowing this to be the default.

John Rose on using indy for startup

Improving Clojure Start Time

Problem

Goals

Timings

How does this break down for an example program?

Leiningen overhead

clojure.core loading

RT loading

Per-var load time

Examining an example program

Techniques for faster loads

JVM Args

AOT compilation

Eliding metadata

Direct linking

Small improvements

Delay socket server loading

Reduce `refer` overhead

Check elide-meta in add-doc-and-meta

Optimize Var.bindRoot()

Reduce RT and clojure.core load time

Big Improvements

Reduce reader time

Reduce compile time

Lazy vars

Parallel namespace loading

4 Comments

Kevin Downey

Kevin Downey

Kevin Downey

Kevin Downey