Representation and sorting of complex data

6 07 2008

Yesterday I saw a blog post by Ray Camden, showing an example of how to do a bubble sort on a two-dimensional array. I like the fact that this kind of sort is so easy to implement, with so few lines of code. As Ray pointed out, however, the bubble sort not suitable for sorting large sets of data. As an alternative, Ray suggests creating the array as a query, which of course can be sorted on multiple keys by using ColdFusion’s in-memory query capabilities.

Another alternative is to replace the two-dimensional array with an array of structures. Representing a set of objects as an array of structures is something I often choose to do, and in this blog post I will try to explain my thoughts on this matter.

In cases such as described by Ray, where the information is about objects which have one or more properties, arrays of structures result into more readable and maintable code than two-dimensional arrays. Compare, for instance, the following blocks of code:

<cfset persons = arrayNew(2)>
<cfset persons[1][1] = "Martijn">
<cfset persons[1][2] = 29>
<cfset persons[2][1] = "Franca">
<cfset persons[2][2] = 33>

<cfset persons = arrayNew(1)>
<cfset personData = structNew()>
<cfset personData.firstName = "Martijn">
<cfset personData.age = 29>
<cfset arrayAppend(persons, personData)>
<cfset personData = structNew()>
<cfset personData.firstName="Franca">
<cfset personData.age = 32>
<cfset arrayAppend(persons, personData)>

OK, so the second block of code is longer, but it also provides more information: from the first block of code we would have to guess that the number at position 2 of the second index is in fact the age of the person. In the second block of code, this is explicitly stated. Dumping the array would show all properties of the persons by name, so if we pass the data around in an application it will be instantly obvious what properties we are dealing with, even without looking at the code that created the array. This also applies to debugging your code: error messages like “Element firstName is undefined in a CFML structure referenced as part of an expression. ” are much easier to decode than “Element 1 is undefined in a Java object of type class coldfusion.runtime.Array. “

Creating the array like in the second block is also less error-prone: often we would write the code for creating one record, than copy-paste-edit to create the rest of the record. Using arrays of structures, we would only have to edit the property values. When using two-dimensional arrays we would also have to edit the record number (the first index) and if you are not far more accurate than I am you are going to end up skipping or duplicating a row number, or worse: editing the wrong index.

So what about presenting the data as a query object? You could do something like this:

<cfset rs = queryNew("firstName,age")>
<cfset queryAddRow(rs)>
<cfset querySetCell(rs,"firstName","Martijn")>
<cfset querySetCell(rs,"age",29)>
<cfset queryAddRow(rs)>
<cfset querySetCell(rs,"firstName","Franca")>
<cfset querySetCell(rs,"age",33)>

I am aware of two advantages of arrays of structures over queries. The first advantage is, that an array of structures can be represented in almost any programming language, whereas query object are very specific to ColdFusion. So if you are building a webservice that is to be consumed by many different types of client technology, you might want to pick structures over query objects.

The second advantage is, that – using an array – you can pass all data pertaining to one record in one reference, like this:

<cfset secondPerson = persons[2]>
… instead of
<cfset secondPerson.firstName = persons.firstName[2]>
<cfset secondPerson.age = persons.age[2]

It is easy to see that the code referring to the query representation would have to be updated if we were to add a third property to each person. The code referring to the array-of-structs representation would stay the same.

So what are the advantages of the query object representation? The most important one to me is that queries provide a very easy way to select records based on one or more properties: you can query them as if they are database tables, like this:

<cfquery dbtyp="query" name="selectedPerson">
SELECT * FROM persons WHERE age >=30
</cfquery>

Using the array-of-structs representation, you would have to loop through the entire set of records and determine for every record wether it should be selected. Obviously, this is more tedious to implement and will certainly perfom poorly when using large datsets.

The second advantage would be that data in queries can easily be sorted on multiple key by adding an ORDER BY clause the query-of-query SQL. To overcome this, I have written a UDF that will do an insertion sort on an array of structures on multiple keys. Sort order can be specified seperately per key. The code containing the UDF and a simple example can be downloaded here. Enjoy!


Actions

Information

5 responses

8 07 2008
Martijn van der Woud

Just a quick note: I posted the UDF at CFLIB.org. Now let’s see if it gets accepted ;)

9 07 2008
Martijn van der Woud

Turns out there is already a similar UDF CFLIB.org. It does not support sorting on multiple keys, however.

27 09 2008
A UDF for sorting arrays of beans (instantiated CFC’s), or arrays of structures « Martijn van der Woud’s Weblog

[...] (instantiated CFC’s), or arrays of structures 27 09 2008 A while a ago, I posted a discussion about different ways to represent complex data, and a UDF to sort arrays of structures on multiple [...]

20 04 2009
jun

Great post, thanks it helps a lot.

Just a question if you don’t mind, arrays of structure is easy to follow than that of two dimensional arrays but did you also consider the performance between the two. I can only guess that two dimensional array is a lot faster than arrays of structure. It’s just an opinion and I stand corrected.

Thanks again for the UDF

21 04 2009
martijnvanderwoud

Hi Jun,

I am not an expert on the matter, but I guess the difference in performance between two-dimensional arrays and arrays of structures will only be noticeable if you have a huge set of data. There would be a much larger performance penalty if you used an array of objects instead. This is because of the comparatively large object creation penaly in ColdFusion. Then again, using objects instead of primitive datatypes can be a very good idea from a design perspective.

Leave a comment