Remove duplicate list or array elements using ColdFusion and Java

Posted on October 22, 2008

I’ve been working on a new feature for my company’s main application. Basically I need to associate some data with a list of students. The list of students is generated from existing class rosters. Since students can be in more than one class, it’s possible the same student gets selected twice. So how can I quickly remove any duplicates and store only the unique list of students?

The User Interface

As students are selected from a class roster (name=”AVAILABLE_STUDENTS”), they are added to an increasing list of check boxes (name=”SELECTED_STUDENTS”). From the list of selected students, the user can uncheck anyone they don’t want to associate before submitting the form. As you know, when you select multiple check boxes of the same name, upon submission of the form the values of the checked boxes are converted to a comma-delimited list.

So let’s create a list of student IDs that contains duplicate numbers.

<cfset form.SELECTED_STUDENTS = "3,2,1,2,3,4,1,5,6,7,3,3,9,0,3,2,0,3" />

Let’s convert the list to an array so we can better see the unique values.

Unique elements in form.SELECTED_STUDENTS

<cfset studentArray = listToArray( form.SELECTED_STUDENTS ) />

<cfdump var="#studentArray#" label="studentArray" />
studentArray- array
1 3
2 2
3 1
4 2
5 3
6 4
7 1
8 5
9 6
10 7
11 3
12 3
13 9
14 0
15 3
16 2
17 0
18 3

A Pure ColdFusion Solution

Ben Nadel has a great post that discusses one approach to this problem.

  1. Create an empty struct.
  2. Loop over your list.
  3. Each element becomes a key in that struct.
  4. Duplicate elements overwrite the previous key of the same name.
  5. Use StructKeyList() to return the list of keys, which are the unique values from your list.

This works well, but in my opinion, there are a few downsides to this process:

1. The loop

I hate loops. They’re like the plague. You start with just one, then another shows up and pretty soon there’s a swarm of them chewing up your server’s resources like locusts in Egypt.

2. It’s not case-sensitive

This isn’t an issue since we’re dealing with ID numbers in this instance, but when you deal with strings and especially with UTF-8 character sets, case is a big deal.

3. It doesn’t keep the order of the original list

The order in which you store data in a database can be very important. There will be times you want to store and retrieve data in exactly the order you displayed on a form. The form I built allows the user to reorder check boxes using some jQuery magic, so stored the order of the Student IDs or other check box data can change.

4. Performance

I can only wonder how well this performs under load and with lists of unnatural length.

A couple of Java Solutions

In a previous entry, I talked about the fact that under the hood a ColdFusion array is really a Java Vector. I decided to take a look at how Java handles this problem and found a couple of less verbose solutions.

The Vector

Let’s take a look at the inheritance stack for a java Vector:

  • java.lang.Object
    • is extended by java.util.AbstractCollection
      • is extended by java.util.AbstractList
        • is extended by java.util.Vector

It also implements the Collection Interface, among others.

If you take a look at what else implements the Collection Interface, you’ll find HashSet and LinkedHashSet.

Both of these classes implement another Interface called Set. Set also extends Collection. From the documentation, a Set is “[a] collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.”

In case you don’t speak geek to this level, this means that a Set can be null or it consists of unique elements. In other words, it wouldn’t surprise me that under the hood of a ColdFusion struct lies an implementation of a Java Set.

The HashSet

Looking at the Constructor Summary of the HashSet, you’ll see that you can create an empty HashSet or that you can create a HashSet based on a Collection.

  1. HashSet(): Constructs a new, empty set; the backing HashMap instance has default initial capacity and load factor.
  2. HashSet(Collection c): Constructs a new set containing the elements in the specified collection.

So let’s create an empty HashSet using ColdFusion:

<cfset myHashSet = createObject("java", "java.util.HashSet") />

Now let’s call its constructor and pass in a Collection. Do we have a Collection handy?

. . . anyone?

. . . anyone?

It’s the ColdFusion Array (Java Vector) we created from the list at the top of the post.

<cfset myHashSet.init( studentArray ) />

Now let’s display myHashSet in plain text to see what we have.

Contents of myHashSet

<cfdump var="#myHashSet.toString()#" />

<!--- output --->
[3, 2, 1, 0, 7, 6, 5, 4, 9]

The square brackets ( [] ) show that toString() “returns a string representation of this Collection”. Technically we don’t have a proper list (note the space between each element). In order to get one, we have to manipulate the Collection with both Java and ColdFusion.

Creating a proper list

<!--- toArray() is a Java method --->
<cfset uniqueArray = myHashSet.toArray() />

<!--- arrayToList() is a ColdFusion method --->
<cfset uniqueList = arrayToList( uniqueArray ) />

<cfoutput>#uniqueList#</cfoutput>

<!--- output --->
3,2,1,0,7,6,5,4,9

So here we have a unique list of IDs from the original list. As with the pure CF solution, the elements are not in the same order as the original list. From the documentation, a HashSet “makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.”

If we want a unique list AND keep the elements in order, we need to instead use a LinkedHashSet.

The LinkedHashSet

“This implementation differs from HashSet in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is the order in which elements were inserted into the set (insertion-order).”

Here we go:

Creating a unique, correctly ordered list.

<cfset form.SELECTED_STUDENTS = "3,2,1,2,3,4,1,5,6,7,3,3,9,0,3,2,0,3" />

<cfset studentArray = listToArray( form.SELECTED_STUDENTS ) />

<cfset lhs = createObject("java", "java.util.LinkedHashSet") />

<cfset lhs.init( studentArray  ) />

<cfoutput>#arrayToList( lhs.toArray() )#</cfoutput>

<!--- output --->
3,2,1,4,5,6,7,9,0

or if you like:

ADD Version

<cfset form.SELECTED_STUDENTS = "3,2,1,2,3,4,1,5,6,7,3,3,9,0,3,2,0,3" />

<cfset lhs = createObject("java", "java.util.LinkedHashSet") />

<cfoutput>
#arrayToList( lhs.init( listToArray( form.SELECTED_STUDENTS )  ).toArray() )#
</cfoutput>

<!--- output --->
3,2,1,4,5,6,7,9,0

One last thing

Let’s say you want to replace the contents of studentArray with the unique elements you’ve created. To do this, we call on another couple of Java functions inherited by the Vector class.

Updating studentArray with its unique elements

<cfset form.SELECTED_STUDENTS = "3,2,1,2,3,4,1,5,6,7,3,3,9,0,3,2,0,3" />

<cfset studentArray = listToArray( form.SELECTED_STUDENTS ) />

<cfset lhs = createObject("java", "java.util.LinkedHashSet") />

<cfset lhs.init( studentArray  ) />

<!--- Clear out the original elements --->
<cfset studentArray.clear() />

<!--- Add all of the elements in lhs --->
<cfset studentArray.addAll( lhs ) />

<cfdump var="#studentArray#" label="Unique studentArray">
Unique fooArray - array
1 3
2 2
3 1
4 4
5 5
6 6
7 7
8 9
9 0

Summary

Both (all three?) solutions are completely valid approaches. It seems to me that the Java solution might be faster under load, but we would have to stress test an application to determine that. My point here is that most of us forget that ColdFusion is now built on top of Java and that Java has an enormous amount of functionality that can easily be leveraged in a ColdFusion application. The next time you’re looking for a solution that seems difficult with ColdFusion, take a look at the Java documentation and see if it has a proven solution.

About the Author
Adrian J. Moreno

Adrian is a CTO and solution architect specializing in software modernization. More information