It’s Data but is it GIS Data??

I’ve been in the situation recently of having to make web maps from datasets that people are sending me. These have been hotels, hospitals, nursing homes, etc… and I typically get an excel spreadsheet with some data in it with the instructions “Map these”. It always seems pretty straightforward until I open the dataset. UGH! Here’s the badly formatted pile of data that I now have to totally rework to get it into usable GIS data. So below are my ideas of things you should tell potential data providers about how “normal” data should be formatted if you want to turn it into “GIS Data”.

  1. Spreadsheets vs Databases.

There are major differences between these things and although spreadsheet data CAN be used in GIS, it requires proper formatting. There should be Columns, each representing a characteristic or bit of data about the features to be mapped; and Rows, each representing one feature. The first row should have a name for each column that is concise and doesn’t repeat. It also helps if you take out the spaces and special characters. For instance a column called “Last Reported / Known Physical Address” can become “Last_Address” or “Major Impact of Current Study” can become “Impact”. ArcGIS Anything will want to replace the spaces, truncate long field names, and choke on special charaters.

Then each row in the spreadsheet should represent a single feature to be mapped – with each column containing the information about that single feature.

The mistake I see most often is having multiple rows for one feature, for example:

Sidewinder Hotel:   323 Main St



                                   Texas, 75001

That should be a single row with six columns! VERY difficult to reformat this into the correct format. But you can copy/paste each set of values into a new page and use the Paste Special – Transpose option to swap columns for rows. This is OK for just a dozen or so features but you wouldn’t want to do it on hundreds of features.

Name                           Address            Region               City        State     Zip

Sidewinder Hotel      323 Main St     Downtown        Dallas    TX        75001

Next you can take out all the headers, title blocks, and logos in the spreadsheet. What I like to do is make a new page in the spreadsheet called GISData, then copy all of the columns and rows that I want from the page into this new sheet – but remember to use the Paste Special – Values option when you do this because any calculated formulas will be lost otherwise.

If I have a group that wants to put together data that will be mapped later, I usually send them a template spreadsheet (or help them build one) and let them work from that. Then at least they are starting with something close to what I want.

2. Addresses for Geocoding

I very often get a spreadsheet with addresses to geocode, and there are two major formats I can use. One is to have all of the information in a single field (or column) and the other is to have everything split out into separate fields. Too many time the data is a mish-mash of the two, with two or more pats in one field and the rest in several other fields. One fix for this is to make a new column for single address and build a formula concatenating the other fields together. Then Copy/Paste Special – Values into a new page. But if you can head this off at the start it will save a lot of time.

3. Incomplete Data

Not every field for every feature HAS to have a value, but there are some important ones that SHOULD. For instance, if you want to geocode the file then every row MUST have an address.  Feeding the geocoder the address of “Sheraton Hotel” won’t find much, or worse if you give it a partial address it might find it in another city or even state! Also, any field that you intend to use as the symbology field MUST have a value for each feature. If you want to color code by “Water Level” then every feature needs to have a value. Any null values will throw off the data and make the map look weird.

4. Too much data in one field

My worst pet peeve of data – putting too much into one field. For instance, I’ve seen the value in the “Accepts Pets” column be “YES, must show proof of vaccination, $75 deposit”. Then they want the map of hotels color coded by their pet policy “Yes” or “No”. When you set this up and the symbology creator picks all the unique values, there could be dozens when there should be two. It would be better to have a field for “Accepts_Pets” restricted to Yes or No, then an additional field for “Pet_Notes”. In fact, any field that will be used for symbolizing the data falls into this restriction.

5. Data consistency

It’s bad enough when you get data with a mixed bag of upper and lower case entries – and while this doesn’t affect the use of the data it keeps you from having a consistent look and feel. But what I hate is when different people edit the data and use different values in the fields. For instance the Yes/No field for pets should ALWAYS be either the fully spelled out YES and NO, or the abbreviated Y and N. Don’t mix and match because the symbology creator will see four unique values instead of two.

Bummer dude.

This can get worse if you have a field for Status and everyone types in their own interpretation of the current status … when what we want for the map might be “Open”, Closed”, “By Appointment”, “Out of Business”. In Excel you can build a picklist (or domain) and help keep the data consistent.

Granted not everyone is going to see data through my eyes, properly formatted for GIS. But if they were aware of how GIS data is used BEFORE they started creating it, maybe they’ll heed a few of these warnings and make our lives easier.