Tikalon Blog is now in archive mode.
An easily printed and saved version of this article, and a link
to a directory of all articles, can be found below: |
This article |
Directory of all articles |
Sorting Terabytes
August 6, 2010
Sorting is an essential human activity. Putting things in alphabetical order takes time while it's being done, but it has benefits later when you're looking for something. That's why books contain indices. Sorting is good also in decision making. When you have only a certain amount of money to spend in a month, you rank things according to importance and draw a line to separate the necessities from things that are lower on your list. Computers have made sorting easier, but at the same time they've allowed the accumulation of so much data that efficient methods of sorting large quantities of data are required.
Haven't we already found the "best" sorting method and discarded all the rest? Well, there's a problem here that involves the computer architecture used to do the sorting. Some sorts are better on some computers than on others. Some fast sorting methods require a lot of memory, but slower ones can sort your items "in place;" that is, you don't need any more memory than the memory in which the data already reside. Sometimes, the dataset is so huge that it's stored in pieces, which presents a problem. Sometimes the dataset is so small that speed is not an issue - a blink of an eye is as good as a half blink - so the programmer uses the technique that involves fewer lines of code. I've used the least efficient sorting method, "
bubble sort," on many occasions, since I can write the code off the top of my head. Here's a short list of some sorting methods.[1]
- • Bubble sort
- • Insertion sort
- • Shell sort
- • Merge sort
- • Heapsort
- • Quicksort
- • Counting Sort
- • Bucket sort
- • Radix sort
- • Distribution sort
The winner for most desktop applications is "Quicksort." I won't bore you with the details,which can be found
elsewhere, but it's very good. In the past (before
Wikipedia), Quicksort was "reinvented" many times over by the uninitiated who thought their names would go down in computer history. Quicksort was invented in 1960 by
C. A. R. Hoare. As an example of
necessity being the mother of invention, Hoare invented the technique to sort vocabulary words for a Russian-English machine translation program.
There's a realm of sorting with which few are acquainted; namely, the sorting of
terabytes of data. One terabyte is 1,000 gigabytes, or one million megabytes. That's 10
12 bytes of data, and it would be equivalent to the records of a total world census, if such existed; or a few seconds of data from the
Large Hadron Collider. A terabyte of data is equivalent to about 40 single-layer
Blu-Ray discs. Looking ahead, some companies would like to sort petabytes (1,000 terabytes) of data. There's even a contest, of sorts, maintained at
http://sortbenchmark.org, that has a goal of rapidly sorting terabytes of data, and terabytes of records that are 100 bytes long. This year's record, established by a team from the
University of California, San Diego, was 1.014 terabytes in one minute. This same team also broke another record by sorting one trillion data records in 172 minutes.
The sorting benchmark tests are called the Indy Minute Sort and the Indy Gray Sort. What's significant is that the UCSD team used just a quarter the number of computers as the previous record holder. They were able to do this by careful balancing of resources, such that no resources were unnecessarily idle. Another set of benchmarks at
http://sortbenchmark.org are for minimal energy sorting, something that's becoming important in today's energy-starved world. Said team member
Amin Vahdat, Chair in Engineering in the Department of Computer Science and Engineering and director of the UCSD Center for Networked Systems, "If you are idling your processors or not using all your RAM, you're burning energy and losing efficiency."[2]
The UCSD system was relatively simple.[3] It was composed of 52
HP ProLiant DL380 G6 servers, each of which had two quadcore
Intel Xeon E5520 2.27 GHz processors, 24 GB of RAM, and sixteen 2.5-inch 500 GB, 7200 RPM
SATA hard drives. Each server had a Myricom 10Gbps network card, and all communications were via standard Ethernet, connected by a
Cisco Nexus 5020 10 Gbps switch. Of course, all machines ran the
Linux operating system, version 2.6.32.8, with the
ext4 filesystem.
References:
- Sorting Algorithms on Wikipedia.
- Data sorting world record: 1 terabyte, 1 minute, Science Blog (July 27, 2010).
- Alexander Rasmussen, Harsha V. Madhyastha, Radhika Niranjan Mysore, Michael Conley, Alexander Pucher, George Porter and Amin Vahdat, "TritonSort," PDF Report.
- Dictionary of Algorithms and Data Structures, National Institute of Standards and Technology, Software and Systems Division, Information Technology Laboratory.
- Sorting Algorithm Animations at sorting-algorithms.com.
Permanent Link to this article
Linked Keywords: bubble sort; C. A. R. Hoare; necessity mother of invention;terabytes; Large Hadron Collider; sortbenchmark.org; University of California, San Diego; Amin Vahdat; HP ProLiant; Intel Xeon; SATA; Cisco; Linux; ext4 filesystem