Login

Factual Blog /

Bug Du Jour: CDH5 Upgrade

We upgraded our Hadoop cluster to YARN/CDH5 last weekend, which brought along the usual flurry of “oops, gotta fix this” commits as various services had hiccups, and in many cases refused altogether to do anything useful. Last week Tom sent me my favorite message: “I just want this to work” (seriously, it’s awesome to get these because you never know what kind of bug it is).

The problem manifested itself as a dreaded classfile verification error:

java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; 
at java.lang.ClassLoader.defineClass1(Native Method) 
at java.lang.ClassLoader.defineClass(ClassLoader.java:788) 
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
at java.net.URLClassLoader.defineClass(URLClassLoader.java:447) 
at java.net.URLClassLoader.access$100(URLClassLoader.java:71) 
at java.net.URLClassLoader$1.run(URLClassLoader.java:361) 
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
at java.security.AccessController.doPrivileged(Native Method) 
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
at java.lang.Class.getDeclaredConstructors0(Native Method) 
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2483) 
at java.lang.Class.getConstructor0(Class.java:2793) 
at java.lang.Class.getConstructor(Class.java:1708) 
at org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62) 
at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36) 
at org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:49) 
at org.apache.hadoop.yarn.api.records.ContainerId.toApplicationAttemptId(ContainerId.java:244) 
at org.apache.hadoop.yarn.api.records.ContainerId.fromString(ContainerId.java:225) 
at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:178) 
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1406)

No user code anywhere, which is bad. I googled it and found this thread, which mentioned that there was a Protocol Buffers version conflict. It makes sense; if the access modifiers change from one version to the next and you’ve got a nondeterministic dependency inclusion, then you’d see stuff like this.

No problem; CDH5 uses protobuf 2.5.0, so let’s see which non-2.5.0 version my jar has:

$ lein deps :tree 2>&1 | grep protobuf
   [com.google.protobuf/protobuf-java "2.5.0"]
$

Ah. Ok, so maybe it’s some stray classpath entry on the cluster machines? Let’s see what happens there (SSH access to the hadoop worker nodes is a lifesaver):

$ ssh spencer@datanode
datanode$ locate protobuf | grep jar$
/usr/lib/avro/avro-protobuf-1.7.6-cdh5.4.0.jar
/usr/lib/avro/avro-protobuf.jar
/usr/lib/hadoop/parquet-protobuf.jar
/usr/lib/hadoop/lib/protobuf-java-2.5.0.jar
/usr/lib/hadoop-hdfs/lib/protobuf-java-2.5.0.jar
/usr/lib/hadoop-mapreduce/protobuf-java-2.5.0.jar
/usr/lib/hadoop-mapreduce/lib/protobuf-java-2.5.0.jar
/usr/lib/hadoop-yarn/lib/protobuf-java-2.5.0.jar
/usr/lib/hbase-0.94.0/lib/protobuf-java-2.4.0a.jar
/usr/lib/hbase-0.94.3/lib/protobuf-java-2.4.0a.jar
/usr/lib/parquet/parquet-protobuf.jar
/usr/lib/parquet/lib/parquet-protobuf-1.5.0-cdh5.4.0.jar
/usr/lib/parquet/lib/protobuf-java-2.5.0.jar
/var/dcache/prod/prod_config/runlib_cacheFE/lib/protobuf-java-2.5.0.jar
datanode$

We’ve definitely got some candidates for problems here; all of the 2.4.0a stuff might be on the classpath. I did some digging through the YARN job logs to find out, but it turned out that none of the conflicting jars were included.

Next I started looking at the uberjar itself. The problem shouldn’t be there, but you never know:

$ lein clean; lein uberjar
$ vim target/the-app.jar
the-app.jar:
...
META-INF/maven/com.google.protobuf/
META-INF/maven/com.google.protobuf/protobuf-java/
META-INF/maven/com.google.protobuf/protobuf-java/pom.xml        <- opened this
META-INF/maven/com.google.protobuf/protobuf-java/pom.properties
...

pom.xml:
...
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java</artifactId>
  <version>2.4.0a</version>                                     <- uh oh!
  <packaging>jar</packaging>
...

So lein deps :tree was lying after all! The fix I ended up with, rather than try to address the underlying problem, was to manually exclude com.google.protobuf/protobuf-java from the dependency tree (just added it to the :exclusions in project.clj), which didn’t cause any problems since v2.5.0 was already on the worker nodes’ classpath.

This wasn’t the most interesting bug I’ve worked on (that honor currently goes to a truly insane Clojure(delay) bug involving locals clearing, something I’ll write up when I get a chance), but the unexpected duplicity from Leiningen added a great twist. If you enjoy devious build tools, life-changingly cool bugs, and other hard problems, then Factual might just have your ideal job. Learn more here.

- Spencer Tipping, Bug Exterminator

We're hiring
See Job Openings