Parsing and Analyzing OpenStreetMap Export Files

OpenStreetMap provides geo-coded information in the form of extracted .osm files for download. .osm files are XML files with a defined schema, but this information alone does not really help in analyzing the files. The reason is that the simple structure <node><tag /></node> distinguishes its data by the tag’s k attribute value:

  <node id="567053682" [...] lat="48.2066502" lon="16.3707086">
    <tag k="addr:postcode" v="1010"/>
    <tag k="addr:country" v="AT"/>
    <tag k="addr:street" v="Neuer Markt"/>
    <tag k="addr:city" v="Wien"/>
    <tag k="addr:housenumber" v="16"/>

Therefore we first need to know which tag@k attributes are used in the XML. Given the size of .osm files (the extract for Austria is currently 3GB), a simple XmlDocument.Load() is bound to fail due to memory restrictions. Remember that both XmlDocument and XDocument are in-memory representations of the XML tree. See also my earlier comparison of XML access methods in .Net.

Fortunately, .Net also provides a way to a access an XML document without loading it into memory completely, namely by streaming it using XmlReader as illustrated in an MSDN sample.

Modifying this sample to process only nodes containing at least 1 tag, the enumerator to return such nodes looks like this:

static IEnumerable<XElement> GetNextNode(TextReader stringReader)
  using (XmlReader reader = XmlReader.Create(stringReader)) {
    while (reader.Read()) {
      switch (reader.NodeType) {
        case XmlNodeType.Element:
          if (reader.Name == "node") {
            XElement el = XElement.ReadFrom(reader) as XElement;
            if (el != null)
              if (el.Element("tag") != null)
                yield return el;

The code that calls this enumerator collects the values of the tag/@k attribute using a dictionary

var tagKeys = new Dictionary<string, int>();
using (var reader = new StreamReader(@"path\to\austria.osm", Encoding.UTF8)) {
  var i = 0;
  var dt = DateTime.Now;
  foreach (var el in GetNextNode(reader))  {
    var keys = new List<string>();

    foreach (var tag in el.Elements("tag")) {
      var ka = tag.Attribute("k");
      if (ka != null) {
        var key = ka.Value;
        if (tagKeys.ContainsKey(key))
          tagKeys[key] = tagKeys[key] + 1;
          tagKeys.Add(key, 1);

Finally, we can print the statistics on attributes and their numbers of occurrence:

foreach (var k in tagKeys.Keys.OrderBy(kT => kT))
  Console.WriteLine("{0} ({1})", k, tagKeys[k]);

After I found the various addr: attributes (see list with most frequent entries),

addr:city (44936)
addr:country (36215)
addr:housenumber (54034)
addr:postcode (45284)
addr:street (50469)

I wanted to know how often their combinations occurred:

if (keys.Contains("addr:postcode") && keys.Contains("addr:city")) {
  if (keys.Contains("addr:street")) {
    if (keys.Contains("addr:housenumber"))

resulting in

postcode+city 42352 
postcode+city+street 39864 
postcode+city+street+housenumber 39338

One Response to Parsing and Analyzing OpenStreetMap Export Files

  1. […] analyzing OSM extracts files, we can now import the data from XML into SQL Server. Depending on your requirements, the table to […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: