Monday 19 November 2007

Enumerating permutations using inversion vectors and factoradic numbering

Permutations are a fundamental concept in discrete mathematics, and crop up in numerous places in computer science. The set of all possible permutations of n items is usually denoted Sn (since the set forms a group which is isomorphic to the Symmetric group of order n). The number of permutations in this set is n!.

This apparently simple concept poses some unexpected challenges. For instance, the following useful operations are messy to compute via the brute-force way of enumerating successive permutations:
  • List all the permutations in Sn (in lexicographic order)
  • Find the i'th permutation in Sn
This would be straightforward if there was a way of effectively enumerating all permutations; i.e. a mapping between Sn and the integers {1, 2, ..., n!}.

It turns out that this can be done using a pair of ingenious concepts known as the inversion vector and factoradic numbering. These can be used to provide an enumeration of Sn, conveniently in lexicographic order. (Inversion vectors are also known as inversion tables and Lehmer codes - the term vector seems to me to be most appropriate for the nature of the structure.)

A permutation can be written as a sequence of numbers

P = (p1 p2 ... pn) where pi is in {1, 2, ..., n}

An inversion is a pair (pi, pj) where i < j but pi > pj. We define di to be the number of inversions in the sequence where the first element is pi:

di = | { pj : i < j , pi > pj } |

Then the inversion vector for the permutation is the sequence

D = (d1, d2, ..., dn) (di is in the range {0, 1, ..., n-i})

It is fairly straightforward to recover the original permutation from the inversion vector. This paper gives a definition of the forward and reverse mapping algorithms.

There are n! possible sequences D. They are in one-to-one relationship with the set of permutations Sn, and their natural order gives a lexicographic ordering of Sn. So we're halfway to the enumeration that we are looking for.

Inversion vectors can in turn be mapped to the integers in {1, ..., n!} by using the concept of factoradic numbering. A factoradic number is a number made up of a sequence of digits where the radix for digit di is (n-i)!. This Wikipedia article explains the concept in detail.

Here's an example:

P = (4 6 2 5 3 1 8 7) is a permutation in S8.

Its inversion vector is L(P) = [3,4,1,2,1,0,1,0].

The factoradic representation of this is

3×7! + 4×6! + 1×5! + 2×4! + 1×3! + 0×2! + 1×1! + 0×0! = 18175

So inversion vectors and factoradic numbering provide an elegant way of enumerating permutations. But what does this have to do with geospatial processing? That will be a subject for another post...

Friday 16 November 2007

If programming languages attended computer conferences

If you're a language geek, this post by David Rupp is hilarious!

Thursday 15 November 2007

Fast polygon merging in JTS using Cascaded Union

A common requirement in spatial processing is to union a set of polygons together. Depending on the use case, the polygons may either be a coverage or be overlapping. Up until recently, there were two techniques for accomplishing this in JTS:
  • Iterated Union: Iterate over the polygons and union them one-by-one into a result polygon
  • Buffer Union: Collect the polygons into a GeometryCollection and run buffer(0) on the collection
The first one is straightforward, but quite inefficient for large input sets. The second technique is more efficient in many cases, but still tends to fail for large or complex data.

Since this is such a common operation, I've been keen to find a faster and more robust approach. An alternative strategy is to use a concept I call Cascaded Union. The idea is to union small subsets of the input together, then union groups of the resulting polygons, and so on recursively until the final union of all areas is computed. This can be thought of as a post-order traversal of a tree, where the union is performed at each interior node. If the tree structure is determined by the spatial proximity of the input polygons, and there is overlap or adjacency in the input dataset, this algorithm can be quite efficient. This is because union operations higher in the tree are faster, since linework is "merged away" from the lower levels.

This picture shows the algorithm in action:


Handily, the spatial indexes already in JTS can be used to determine an appropriate tree structure. Currently the Sort-Tile-Recursive packed R-Tree index is used - this seems to produce an effective tree (although I suspect that the algorithm is not very senstive to the precise spatial tree used)

It turns out that Cascaded Union can be much more efficient than Iterated Union. On a test dataset with 30,000 small polygons with a high degree of overlap, the performance results were:

Cascaded Union: 20 sec
Iterated Union: 3 hours 40 min !

Not bad - well worth the effort of coding... Also, it's more robust than Buffer Union, and often much faster as well.

This will be provided via the Geometry.union() (Unary union) method in JTS 1.9.

Wednesday 14 November 2007

Implementations of Polygon Overlay algorithms

Occasionally people ask about the algorithms used in JTS's overlay operations. I don't have a good answer for this, since I basically made it up (in an informed kind of way), and haven't had time to document it to the level I'd like (thank goodness for self-documenting code... 8^)

I recently came across this web site from back in 1998, which is happily still around. It has a good list of implementations of what it calls Polygon Boolean operations. (This is the same thing as the JTS overlay operations - I chose to use a different term to avoid confusion with the spatial predicates). Perhaps some of these sites have done a better job of documentation! There's also some useful references to academic papers covering aspects of the issue.

Friday 9 November 2007

The bible of Spatial Indexing

The Bhagavad Gita of Multidimensional Access Methods...
The Koran of Quadtrees...

One of the people I met at ACM GIS 2007 was the conference chair, Hanan Samet. He's published a series of highly useful books on spatial indexing methods, which are certainly the most comprehensive and detailed in this field. His latest work is Foundations of Multidimensional and Metric Data Structures and is surely his magnum opus. It could also be called Everything you wanted to know about Spatial Indexes (but were afraid to code). Recommended reading for spatial geeks, especially if you're looking for some reading to tide you over during your next year off.

His first book on the subject,
The Design and Analysis of Spatial Data Structures, is also worth looking at, especially since it's a bit more bite-sized. However, it's out-of-print, and expensive when you do find a copy (local plug: I found mine through ABE) If you really can't find a copy you might try contacting Hanan directly.

Hanan has a clear preference for quadtree structures over R-trees, but both approaches are covered well (including pseudo-code - yay!)






Conference Review: ACM GIS 2007

I just got back from ACM GIS 2007, over in Seattle. This is the first time I've attended this conference - or even known about it! It was a good year to go, since there was a lot of energy in the air due to this being the first year that the conference has been held as an independent event in its own right (which may be one reason I haven't heard of it before). One reason for doing this is to catch the building geospatial wave - more on this below.

The conference was very well attended (about 200 people). The bulk of attendees seemed to be from US universities - but hailed from all over the globe. Microsoft was there in force, as you might expect, and Google was very prominent too (a running joke at the conference was that every talk was connected to or given by someone tied to MS or Google). The omnipresent Michael Jones gave a good motivational talk about some of the technologies which are challenges for Google and would be fruitful research areas. Clearly the big guns of the GeoWeb are investing heavily in geospatial research, and this in turn is injecting a lot of cash and relevancy into this field.

Some of the presentations which stood out for me were:
  • Using Constraint Programming to conflate address data to urban imagery. Constraint programming is a well established "AI" technique with a lot of mature effective algorithms, and it seems like a promising approach for other kinds of conflation problems as well (such as road network conflation)
  • A guy from Google talked about conflating vector road networks to ortho imagery (surprise, surprise - this is a big use case for them). His approach seem to take place mostly in the raster domain (using edge detection and signal isolation), which I found interesting
  • A talk by Andrew Danner about building massive TINs and extracting drainage patterns from them. Having just tackled a very similar problem (in both domain and data size) I was interested to see his approach - which was a fair bit more clever. His group has a novel external memory algorithm for building TINs on huge datasets. This seems like an important focus for research, since data gathering bandwidth is outpacing RAM growth.
  • A couple of talks on approaches for compressing surface models and extracting varying levels of detail. Nice because they used some elegant mathematical concepts such as Morse-Smale codes.
There were lots of other talks - and a HUGE bonus was that the conference proceedings were provided to you at registration time (not 2 months later). So you could read ahead to get a good idea of what was coming up - or use downtime to get a more in-depth understanding of a previous talk.

Speaking of which, this is the first conference I've been to which had a single track of presentations. This made for pretty full days, and was probably only barely possible due to the size of this conference, but it certainly made for a more satisfying, less stressful, and more inclusive experience. No need to agonize on which one of 5 presentations might be the most interesting and then rush madly around repositioning during the 5-min break between talks. And it gave you time to catch up on your reading if the subject wasn't of total interest.

Next year the conference is in Irvine, CA. The organizers are keen to get more industry participation, so don't be shy - check it out.