Representation and sorting of complex data

6 07 2008

Yesterday I saw a blog post by Ray Camden, showing an example of how to do a bubble sort on a two-dimensional array. I like the fact that this kind of sort is so easy to implement, with so few lines of code. As Ray pointed out, however, the bubble sort not suitable for sorting large sets of data. As an alternative, Ray suggests creating the array as a query, which of course can be sorted on multiple keys by using ColdFusion’s in-memory query capabilities.

Another alternative is to replace the two-dimensional array with an array of structures. Representing a set of objects as an array of structures is something I often choose to do, and in this blog post I will try to explain my thoughts on this matter.

In cases such as described by Ray, where the information is about objects which have one or more properties, arrays of structures result into more readable and maintable code than two-dimensional arrays. Compare, for instance, the following blocks of code:

<cfset persons = arrayNew(2)>
<cfset persons[1][1] = "Martijn">
<cfset persons[1][2] = 29>
<cfset persons[2][1] = "Franca">
<cfset persons[2][2] = 33>

<cfset persons = arrayNew(1)>
<cfset personData = structNew()>
<cfset personData.firstName = "Martijn">
<cfset personData.age = 29>
<cfset arrayAppend(persons, personData)>
<cfset personData = structNew()>
<cfset personData.firstName="Franca">
<cfset personData.age = 32>
<cfset arrayAppend(persons, personData)>

OK, so the second block of code is longer, but it also provides more information: from the first block of code we would have to guess that the number at position 2 of the second index is in fact the age of the person. In the second block of code, this is explicitly stated. Dumping the array would show all properties of the persons by name, so if we pass the data around in an application it will be instantly obvious what properties we are dealing with, even without looking at the code that created the array. This also applies to debugging your code: error messages like “Element firstName is undefined in a CFML structure referenced as part of an expression. ” are much easier to decode than “Element 1 is undefined in a Java object of type class coldfusion.runtime.Array. “

Creating the array like in the second block is also less error-prone: often we would write the code for creating one record, than copy-paste-edit to create the rest of the record. Using arrays of structures, we would only have to edit the property values. When using two-dimensional arrays we would also have to edit the record number (the first index) and if you are not far more accurate than I am you are going to end up skipping or duplicating a row number, or worse: editing the wrong index.

So what about presenting the data as a query object? You could do something like this:

<cfset rs = queryNew("firstName,age")>
<cfset queryAddRow(rs)>
<cfset querySetCell(rs,"firstName","Martijn")>
<cfset querySetCell(rs,"age",29)>
<cfset queryAddRow(rs)>
<cfset querySetCell(rs,"firstName","Franca")>
<cfset querySetCell(rs,"age",33)>

I am aware of two advantages of arrays of structures over queries. The first advantage is, that an array of structures can be represented in almost any programming language, whereas query object are very specific to ColdFusion. So if you are building a webservice that is to be consumed by many different types of client technology, you might want to pick structures over query objects.

The second advantage is, that – using an array – you can pass all data pertaining to one record in one reference, like this:

<cfset secondPerson = persons[2]>
… instead of
<cfset secondPerson.firstName = persons.firstName[2]>
<cfset secondPerson.age = persons.age[2]

It is easy to see that the code referring to the query representation would have to be updated if we were to add a third property to each person. The code referring to the array-of-structs representation would stay the same.

So what are the advantages of the query object representation? The most important one to me is that queries provide a very easy way to select records based on one or more properties: you can query them as if they are database tables, like this:

<cfquery dbtyp="query" name="selectedPerson">
SELECT * FROM persons WHERE age >=30
</cfquery>

Using the array-of-structs representation, you would have to loop through the entire set of records and determine for every record wether it should be selected. Obviously, this is more tedious to implement and will certainly perfom poorly when using large datsets.

The second advantage would be that data in queries can easily be sorted on multiple key by adding an ORDER BY clause the query-of-query SQL. To overcome this, I have written a UDF that will do an insertion sort on an array of structures on multiple keys. Sort order can be specified seperately per key. The code containing the UDF and a simple example can be downloaded here. Enjoy!