Return to Blog

A Day in the Life of a Factual Engineer: Polygon Compression

In this series of blog posts, Factual’s blog team asked engineers to describe what a typical day looks like.

Chris Bleakley, our resident polygon and Lucene expert, had written meticulous documentation about the problem he was solving. The first paragraph read:

“Because search times are dominated by the cost of deserializing JTS objects from when using Lucene to store polygon data, queries that match many polygons and require ALL results to be compared against the original polygon are slow. Is it possible to make polygon loading and point-polygon intersections faster?”

We had spoken a while back about creating an open-source polygon geometry library, and this seemed like a good enough reason to do it. His current approach of delta-encoding and quantizing everything to 1cm resolution got a 10x space and ~30x time advantage over using JTS and WKT, and given that each of the ~3 million polygons were each relatively small (11 points on average) any further gains were likely going to come from compressing whole cities at a time. My goal for the day was to figure out whether this was possible, and if so how to do it.

First Step: Getting the Data into a Usable Form
I started by exporting the Postgres geometries to a WKT text file using Chris’ instructions:

[bash]$ psql -c "copy (select id, st_astext(the_geog) from prod.asset
           where source_id = 10) to STDOUT" 
  | gzip > polygons.gz
$ nfu polygons.gz | wc -l       # make sure we have everything
$ nfu -T1 polygons.gz           # take a look at one of them
376d29ce-0c62-4dd7-b386-fa501e5a5366    MULTIPOLYGON(((-118.42078343774
34.1143654686345,-118.420652634588 34.1143771084748,-118.42064208021
34.1142953947106,-118.420514149298 34.1143067764812,-118.420506448828

These were all simple polygons; that is, each consisted of a MULTIPOLYGON with exactly one ring, so extracting the points turned out to be really easy. I went ahead and delta-encoded the points at the same time:

[bash]$ nfu polygons.gz 
    -m 'row %1 =~ /(-?d+.d+ -?d+.d+)/g' 
    -m 'my $ref_lat = 0;
        my $ref_lng = 0;
        row map {
          my ($lng, $lat)   = split / /;
          my ($dlng, $dlat) = ($lng - $ref_lng, $lat - $ref_lat);
          ($ref_lat, $ref_lng) = ($lat, $lng);
        } @_' 
  | gzip > polygon-deltas.gz[/bash]

My initial version of this code had two mistakes that I didn’t notice for a while:

  1. I started by assuming WKT used a comma to separate lat/lng. It actually uses a space for this, and a comma to separate points. (And it stores numbers as lng lat instead of lat lng.
  2. Instead of $dlat,$dlng, I had written sprintf(“%f,%f”, $dlat, $dlng). What I didn’t realize is that sprintf truncates floats to six decimal places, which is about 10cm of error.

I also took a look at some of the polygon shapes to get a sense for the type of data we had to work with:

[bash]$ nfu polygon-deltas.gz -T10000 --pipe shuf -T1 
  -m 'map row(split /,/), @_' -s01p %l[/bash]

Second Step: Finding a Compression Limit
At this point I still didn’t know whether it was even possible to compress these polygons much better than Chris already could. His format used about 130 bytes per polygon, so in order to justify much effort I’d have to get below 60 bytes. (If the idea of designing a compression algorithm is unfamiliar, I wrote up a PDF a while back that explains the intuition I was using.)

First I looked at the distribution of latitude and longitude deltas to see what kind of histograms we were dealing with. We’re allowed to introduce up to 1cm of error, so I went ahead and pre-quantized all of the deltas accordingly (the s/,.*//r below is just a faster way of saying (split /,/)[0], which pulls the latitude from the $dlat,$dlng pairs we generated earlier):

[bash]$ nfu polygon-deltas.gz -m 'map s/,.*//r, @_[1..$#_]' -q0.000000001oc 
  | gzip > polygon-lat-deltas.gz
$ nfu polygon-lat-deltas.gz -f10p %l            # histogram
$ nfu polygon-lat-deltas.gz --entropy -f10p %l  # running entropy[/bash]

Latitude delta histogram:

(In hindsight, quantizing the deltas wasn’t strictly valid because I could have easily accumulated more than 1cm of error for any given point – though for this preliminary stuff it probably didn’t impact the results very much.)

The latitude distribution entropy converges to just over 16 bits, which is bad news; if the latitude distribution follows the same pattern, then we’re up to just over 32 bits/4 bytes per delta, which is an average of 44 bytes of deltas per polygon. We’ll also need to encode the number of points total and the starting location, which could easily bump the average size to 56 bytes (and this assumes a perfect encoding against the distribution, which involves storing the whole thing somewhere – yet another space factor to be considered).

My next approach was to find out if latitude and longitude deltas showed any kind of dependence in aggregate. Maybe we can do better than encoding each delta by itself:

[bash]$ nfu polygon-deltas.gz 
  -m '@_[1..$#_]' -F , -q01,0.000000001 
  -m 'row %0 + rand(0.9e-8), %1 + rand(0.9e-8)'    # explained below
  -T1000000p %d[/bash]

Progressively zooming in:

Clearly there’s a lot of dependence here, particularly in the form of rings and latitude/longitude-only offsets. This suggested to me that a polar encoding might be worthwhile (I had to modify nfu to add rect_polar):

[bash]$ nfu polygon-deltas.gz 
  -m '@_[1..$#_]' -F , -q01,0.000000001 
  -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' 
  -T1000000p %d[/bash]

Here theta is on the Y axis and rho on the X axis. The first thing I noticed looking at this is that the bands of rho values are wavy; this is most likely because we’re converting from degrees latitude and longitude, which aren’t equal units of distance. That’s something I should be correcting for, so I found a latitude/longitude distance calculator and got the following figures for Los Angeles:

reference point = 34.11436, -118.42078
+ 1° latitude   = 111km
+ 1° longitude  = 92km

Converting the deltas to km (which incidentally makes the quantization factor a true centimeter):

# note: the conversion math below seems backwards to me, and I never figured 
#out why.

$ nfu polygon-deltas.gz 
  -m '@_[1..$#_]' -F , -m 'row %0 / 92.0, %1 / 111.0' -q01,0.000000001 
  -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' 
  -T1000000p %d[/bash]

At this point I was kind of stuck. The distribution looked good and fairly compressible, but I wasn’t quite sure how to quantize the polar coordinates maximally while preserving error bounds. At some level it’s simple trigonometry, but delta errors accumulate and that complicates things.

Explanation for rand(0.9e-8):
In the nfu code above I’m adding a small random number to each point. This is a hack I sometimes use with quantized data to get a sense for correlation frequency; 0.9e-8 is 90% of the quantum, so we end up with visible cell boundaries and stochastic shading corresponding to each cell’s population. It becomes visible as an interference pattern if you zoom way in:

Third Step: Figuring Out What to Take Away From This
I didn’t have any very concrete results at the end of the day, but had some ideas about what a good compression algorithm might look like. I also strongly suspected that it was possible to get below 50 bytes/polygon. Before knocking off I saved a bunch of screenshots and nfu commands to my weekly notes on the wiki to review later on.

Working at Factual (for me, anyway) involves a lot of this kind of exploration, learning a bunch of less-than-definite stuff, and later revisiting a problem to write something concrete. I’ve had other work to do in the meantime, but hopefully before too long I’ll have time to come back to this and write some beercode to implement an awesome compression strategy.

– Spencer Tipping, Software Engineer