We have several projects that involve processing large geospatial datasets (geo-data) and displaying them on maps. These projects present some interesting technical challenges involving the storage, transfer and processing of geo-data. This post outlines some of bigger challenges we have encountered and our corresponding solutions.
In the past we have used the GMap and OpenLayers libraries and their equivalent Drupal modules on our mapping projects. They are effective solutions when you have a small or even moderately sized collection of entities containing some simple geodata (points, lines, polygons) that you want to present as vector overlays on a map. Unfortunately they tend to fall apart fast when you attempt them with larger datasets. There are two main reasons for this:
Geospatial data can be large, particularly as we tend to encode it in text-based formats such as WKT or GeoJSON when we are sending it to a web browser. The larger the data, the longer it takes to transfer from server to client.
The information being sent is raw data which means that the client needs to parse and process the data before rendering it on the screen. The more data there is, the longer this takes.
Making things worse, the geo-data is often sent at the beginning of the html document (via Drupal.settings or similar). Most browsers will wait until they have downloaded and parsed this data before they begin to render the rest of the page, increasing the delay.
As a result of the above, it doesn't take much to have a serious negative impact on page load times and little more to actually crash your visitor's browser.
Heavy lifting server-side
A good solution to these issues is to process and render the geo-data as image tiles on the server. Tiles can then be cached and served to the client when requested and the data is only rendered whenever it is changed instead of each time the page is loaded. Bandwidth is also reduced as the image tiles are relatively consistent in size regardless of the complexity or amount of data used to produce them.
As a demonstration we have created two maps containing some sample road data:
- The first loads the data from an external GeoJSON file that is downloaded by the browser, parsed and rendered as a vector layer.
- The second map shows the same data but served as a set of tiles that have been generated on the server.
I recommend testing these examples in a variety of browsers as their performance varies on the different platforms - particularly for the first example.
There are several components involved in a server-side tile rendering pipeline. They can be loosely categorised under storage, rendering, and caching.
Geo-data can be stored in a variety of places and formats each with it's own advantages. Here are some that are common:
ESRI Shapefiles (commonly known as shapefiles) are a popular file format for storing and transfering geo-data. They are comprised of a .shp file and often bundled in a zip file with a collection of other files containing related information.
WKT and GeoJSON are formats used to encode geospatial data in plain text, making them convenient to read and parse at the expense of increasing file size.
PostGIS is a spatial database extension to the PostgreSQL database management system. The relational database gives you the ability to index, query, and manipulate your data with SQL and an extensive API of geospatial functions.
In Drupal it's common to store your data in fields attached to entities using the Geofield module; however the data is stored formatted as WKT in a column of type LONGTEXT and when compared to PostGIS it not very flexible.
We have therefore developed Sync PostGIS which allows site developers to flag entity types with geofields to have their data mirrored in a PostGIS database. The source data in Drupal's main database is retained and treated as best-copy, but all changes (insert, update and delete) are reflected in the PostGIS version. This gives us the ability to utilize PostGIS's rich geospatial features on our Drupal-managed geo-data!
Once we have our raw geo-data stored somewhere we need a method of converting it into the images that we will display on our maps. Mapnik is an excellent tool for the job.
TileMill is a desktop application for creating web maps. It is developed by Development Seed to complement their MapBox service. Powered by Mapnik and Node.js it allows users to define style rules using a CSS-like language called CartoCSS. With each change, the rules and data sources are passed to Mapnik and a preview map is rendered giving immediate feedback.
TileMill's main output will render tiles and package them in the MBTiles format. However it can also be used to generate a Mapnik XML stylesheet which can be passed to Mapnik by other applications to render tiles.
MapBox has a great collection of resources to get you up and running with TileMill. I recommend starting with their crash course.
So far, we have resolved the bandwidth issues discussed at the beginning of this post by rendering our data into tiles on the server with Mapnik. This also alleviates the visitor's web browser from the strain of processing large amounts of raw geo-data. However generating tiles on the server is also a resource-intensive process; depending on the area and zoom levels you wish to cover, rendering a set of tiles at once can take anywhere from a few seconds to more than a week.
Obviously we don't want to be rendering tiles from scratch with every request. Instead it is much more efficient to cache the tiles somewhere after they have been rendered and serve requests directly from the cache, only resorting to rendering when a cached tile doesn't exist. There are many ways to cache tiles on your server. Here are some methods that we use:
MBTiles is a file-format specification pioneered by Development Seed. It is essentially a SQLite database containing a whole set of rendered map tiles. Known as tilesets, these files are portable and lightweight and can be generated by TileMill. They are great for caching base layers or layers comprised of data that doesn't change frequently. However they require tiles to be rendered in advance, making them less useful for maps covering large areas and zoom levels, or data sources that often require updating.
Map tiles are individual image files, usually 256x256 pixels in dimension and rendered in a compressed image format such as .png. In most situations storing them directly on a file system is satisfactory.
If you are expecting a lot of requests concurrently, you may want to avoid the file system and cache tiles in memory. Memcache or similar systems are made for this task.
There are a plenty of options available for tile servers including TileCache, TileStache, TileLite, TileStream and Mod Tile. We have been using TileStache as it has an excellent balance of features and simplicity.
TileStache is a server application that handles requests for tiles and serves and caches tiles generated from Mapnik or other rendering providers. It's implemented in Python and designed to be extended with a solid plugin system.
Out of the box, its features include:
- Rendering Mapnik maps
- Serving MBTiles tilesets
- Caching tiles to file system, MBTiles, Memcache or Amazon S3
- Composite 'layers' into single tilesets
The compositing feature in particular is very powerful. In TileStache's configuration you define a set of 'layers', each layer being a different tileset and effectively its own map. You can then define composite layers which are new tilesets comprising of other layers on top of one another. This allows you to do things like combining a pre-rendered tileset stored in an MBTiles file with a tileset of features stored in PostGIS and serving them to your visitors browser as one flat set of tiles.
The range of tools and techniques described provides plenty of flexibility when we are working on mapping projects. It is all achieved without wasting bandwidth or bogging down our visitor's machines with redundant computation.
Previously we had a strict upper-limit on the amount of geo-data we could manage and serve, based on the limits of the network and our visitor's hardware. As evident in this final example, our challenge now is deciding how much data can we can fit into our maps without sacrificing their readability.