Wavelet-Based Clustering for Very Large Multidimensional Datasets Gholam Sheikholeslami, Dantong Yu, Surojit Chatterjee and Aidong Zhang Department of

Transkript

1 Wavelet-Based Clustering for Very Large Multidimensional Datasets Gholam Sheikholeslami, Dantong Yu, Surojit Chatterjee and Aidong Zhang Department of Computer Science and Engineering State University of New York at Bualo Bualo, NY 14260, USA Abstract Clustering large multidimensional datasets is an important problem which tries to nd the densely populated regions in the data space to be used in data mining, knowledge discovery, or ecient information retrieval. A good clustering approach should be ecient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. In this article, we introduce a novel clustering approach based on wavelet transform which satises all the above requirements. Using multi-resolution property of wavelet transform, we can eectively identify arbitrary shaped clusters at dierent degrees of detail. We demonstrate that wavelet-based clustering can be eciently applied to both low-dimensional and high-dimensional datasets. 1 Introduction There are many databases such as nancial, crystallography and corporate databases, where very large multi-dimensional datasets with numerical attributes exist. Clustering in data mining is the discovery of interesting patterns that may exist in the underlying data. Because of the large size of these databases, a primary requirement for clustering algorithms for data mining is eciency. The clustering technique should be fast and scalable with the number of dimensions and the size of the input. Also, due to the diverse nature and characteristic of the source of the data, the clusters may assume arbitrary shapes. They may be nested within one another, may have holes inside, or may possess concave shapes. The problem of handling arbitrary shapes in high dimensions is particularly complex. A good clustering approach should be unaected by outliers (noise) and should detect This research is partially supported by an NSF CAREER grant IIS

2 them eectively. In addition, clustering algorithms should assume minimum domain knowledge, e.g., number of clusters or underlying probability distribution, and they should be insensitive to the order of the input data. Another desirable property for clustering algorithms is the ability to produce clusters at dierent levels of detail which is termed as multiresolution property. State of the Art One category of clustering methods is partitioning algorithms. Partitioning algorithms construct a partition of a database of N objects into a set of K clusters. Usually they start with an initial partition and then use an iterative control strategy to optimize an objective function. There are mainly two approaches i) k-means algorithm, where each cluster is represented by the center of gravity of the cluster, ii) k-medoid algorithm, where each cluster is represented by one of the objects of the cluster located near the center. Ng and Han introduced CLARANS (Clustering Large Applications based on RANdomaized Search) which is an improved k-medoid method [6]. This is the rst method that introduces clustering techniques into spatial data mining problems. The other category of clustering methods includes hierarchical algorithms which create a hierarchical decomposition of the database. The hierarchical decomposition can be represented as a dendrogram. The algorithm iteratively splits the database into smaller subsets until some termination condition is satised. Hierarchical algorithms do not need K as an input parameter, which is an obvious advantage over partitioning algorithms. The disadvantage is that the termination condition has to be specied. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) uses a hierarchical data structure called CF-tree which is a height balanced tree that stores the clustering features [12]. BIRCH tries to produce the best clusters with the available resources. Ester et. al. [2] presented a clustering algorithm DBSCAN relying on a density-based notion of clusters. It is designed to discover clusters of arbitrary shapes. The key idea in DBSCAN is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points. DBSCAN can separate the noise (outliers) and discover clusters of arbitrary shapes. CURE (Clustering Using Representatives) utilizes multiple representative points for each cluster that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specied fraction [3]. Recently a number of algorithms were presented which quantize the space into a nite number of cells and then do all operations on the quantized space. The main characteristic of these approaches is their fast processing time which is typically independent of the number of data objects. They depend only on the number of cells in each dimension in the quantized space. Wang et al proposed 2

3 a STatistical INformation Grid-based method (STING) for spatial data mining [10]. They divide the spatial area into rectangular cells using a hierarchical structure. They store the statistical parameters of each numerical attribute of the objects within cells. In STING, the hierarchical representation of grid cells is used to search for queries or assign a new object to the clusters. CLIQUE clustering algorithm identies dense clusters in subspaces of maximum dimensionality [1]. It argues that nding the clusters in the subspaces of the original space is more eective, because many dimensions can have noise or uniformly distributed values. Also in high dimensional spaces, the average density of points anywhere in the data space is likely to be quite low. CLIQUE partitions the data space into cells. To approximate the density of the data points, it counts the number of points in each cell. The clusters are unions of connected high density cells within a subspace. It generates cluster descriptions in the form of DNF expressions. Note that while CLIQUE focuses on nding clusters embedded in subspaces of high dimensional data, our proposed approach, similar to BIRCH and DBSCAN, is designed to detect clusters in full dimensional space. Hence, in that regard the scope and goal of our method are dierent from those of CLIQUE. We rst introduced WaveCluster which partitions the data space into cells and applies wavelet transform on them [8]. Using multiresolution property of wavelets, WaveCluster can detect arbitrary shape clusters at dierent degrees of detail. Though this approach meets all the desirable properties of a good clustering technique, its performance in terms of required memory and time degrades as the dimensionality of the data increases. Thus, it requires modications to better handle high dimensional datasets. In this article, we generalize the concept of wavelet-based clustering approach and provide approaches to demonstrate its usefulness in both low-dimensional and high-dimensional data spaces. 2 Wavelet-Based Clustering We rst discuss the relationship between multidimensional data and multidimensional signals and show how to use wavelet transform to detect the inherent relationships in the data. We propose to look at the multidimensional data space from a signal processing perspective. The collection of data in the multidimensional data space composes a d-dimensional signal. The high frequency parts of the signal correspond to the regions of the data space where there is a rapid change in the distribution of data, that is the boundaries of clusters. The low frequency parts of the d- dimensional signal which have high amplitude correspond to the areas of the data space where the data are concentrated. For example, Figure 1 shows a 2-dimensional data space, where the two dimensional data points have formed four clusters. Each row or column can be considered 3

4 as a one-dimensional signal, so the whole data space will be a 2-dimensional signal. Boundaries and edges of the clusters constitute the high frequency parts of this 2-dimensional signal, whereas the clusters themselves, correspond to the parts of the signal which have low frequency with high amplitude. When the number of data is high, we can apply signal processing techniques to nd the high frequency and low frequency parts of d-dimensional signal representing the data, resulting in detecting the clusters. The key idea is to apply signal processing methods to transform the space and nd the dense regions in the transformed space. Figure 1: A sample 2-dimensional data space. Wavelet transform is a signal processing technique that decomposes a signal into dierent frequency subbands (for example, high frequency subband and low frequency subband). It is a type of signal representation that can give the frequency content of the signal at a particular instant of time by ltering. A one-dimensional signal s can be ltered by convolving the lter coecients c k with the signal values: ^s i = M X?1 k=0 c k s i+k? M ; (1) 2 where M is the number of coecients in the lter and ^s is the result of convolution. Wavelet transform provides us with a set of interesting lters. For example, Figure 2 shows the Cohen- Daubechies-Feauveau(2,2) biorthogonal wavelet [9]. Figure 2: Cohen-Daubechies-Feauveau (2,2) biorthogonal wavelet. We now briey review wavelet-based multi-resolution decomposition. More details can be found in Mallat's paper [5]. To have multi-resolution representation of signals we can use discrete wavelet transform. We can compute a coarser approximation of the one-dimensional input signal S 0 by convolving it with the low pass lter H ~ and down sampling the signal by two [5]. All the discrete 4

5 approximations S j, 1 < j < J (J is the maximum possible scale), can thus be computed from S 0 by repeating this process. Figure 3 illustrates the method. ~ H 2 Sj S j ~ G 2 Dj S 0 ~ H ~ G 2 2 S1 D1 Figure 3: Block diagram of multi-resolution wavelet transform. D j denotes the dierence between S j and S j?1 and is called detail signal at the scale j. We can compute the detail signal D j by convolving S j?1 with the high pass lter G ~ and returning every other sample of output. The wavelet representation of a discrete signal S 0 can therefore be computed by successively decomposing S j into S j+1 and D j+1 for 0 j < J. This representation provides information about signal approximation and detail signals at dierent scales. We can easily generalize the wavelet model to d-dimensional data space in which one-dimensional transform can be applied multiple times. For example, in 2-dimensional data space, we can represent the data space as an image where each pixel of image corresponds to one cell in the data space. Wavelet transform can be applied along the axes x and y. It decomposes an image into an average signal (LL) and three detail signals which are directionally sensitive: LH emphasizes the horizontal image features, HL the vertical features, and HH the diagonal features. Figure 4 shows the wavelet representation of the image in Figure 1 at three scales. At each level, LL is shown in the upper left quadrant, LH is shown in the upper right quadrant, HL is displayed in the lower left quadrant, and HH is in the lower right quadrant. a) b) c) Figure 4: Multi-resolution wavelet representation at a) scale 1; b) scale 2; c) scale 3. 5

6 Useful Properties of Wavelet Transform in Clustering The motivation for using wavelet transform and thereby nding connected components in the transformed space is drawn from the following observations. Unsupervised Clustering: The hat-shape lters (such as the one shown in Figure 2) emphasize regions where points cluster, but simultaneously tend to suppress weaker information in their boundary. Intuitively, dense regions in the original space act as attractors for the nearby points and at the same time as inhibitors for the points that are not close enough. This means clusters in the data automatically stand out and clear regions around them, so that they become more distinct. It makes nding the connected components in the transformed space easier than that of the original space. Figure 5 shows an example of a data space before and after transform. This dataset contains 500,000 data in the two clusters plus 25,000 randomly distributed noise data. As the gure shows, the clusters in the transformed space are more salient and thus easier to be found. a) b) Figure 5: a) Original data space; b) Transformed space. Eective Removal of Noise: Noise data are the data that do not belong to any of the clusters and usually their presence causes problems for the current clustering methods. Applying wavelet transform removes the noise in the original space, resulting in more accurate clusters. As we will show, we take advantage of low-pass lters used in the wavelet transform to automatically remove the noise. Figure 5 shows that majority of the noise data in the original space are removed after the transformation. Multi-resolution: Multi-resolution property of wavelet transform can help detecting the clusters at dierent levels of detail. As we showed, wavelet transform provides multiple levels of decompositions which results in clusters at dierent scales from ne to coarse. The appropriate scale for choosing clusters can be decided based on the user's needs. 6

7 Cost Eciency: Since applying wavelet transform is very fast, it makes our approach costeective. As it will be shown later, clustering very large datasets takes only a few seconds. Using parallel processing we can get even faster responses. WaveCluster Algorithm Given a large set of data, the goal of the algorithm is to detect clusters and assign labels to them based on the cluster they belong to. The four main steps of WaveCluster algorithm are: (1) Quantize the data space: Since we use discrete wavelet transform, before applying the transform, the data space should be quantized. In quantization, each dimension A i in the d- dimensional data space is divided into m i intervals. If we assume that m i is equal to m for all the dimensions, there would be m d cells in the data space. Then the corresponding cell for the data will be determined based on their attribute values. For each cell we count the number of data contained in it to represent the aggregation of the data. The number (or size) of these cells and the aggregation information in each cell are important issues that aect the performance of clustering. Because of multi-resolution property of wavelet transform, we consider dierent cell sizes at dierent scales of transform. (2) Apply wavelet transform: Discrete wavelet transform is applied on the quantized data space. The d-dimensional space requires d-dimensional wavelet transform. Based on the representation of the data space, we have dierent implementations for wavelet transform. Applying wavelet transform on the cells results in a new data space and hence new cells. (3) Find the connected components at dierent scales: Given the set of new cells, WaveCluster then detects the connected components in the transformed data space. Each connected component is a set of cells in the transformed space and is considered as a cluster. Corresponding to each resolution r of wavelet transform, there would be a set of clusters C r, where usually at the coarser resolutions, number of clusters is less. Each cluster w, w 2 C r, will have a cluster number. (4) Map the data to clusters: WaveCluster labels the cells in each cluster in the transformed space with its cluster number. These clusters are in the transformed space and are based on wavelet coecients. Thus, they cannot be directly used to dene the clusters in the original space. WaveCluster makes a lookup table to map the cells in the transformed space to the cells in the original space. Each entry in the table species the relationship between one cell in the transformed space and the corresponding cell(s) of the original space. WaveCluster 7

8 assigns the label of each cell in the original data space to all the data in that cell, and thus the clusters are determined. Discussion When the data are assigned to the cells of the quantized space at step 1 of the algorithm, the nal content of the cells is independent of the order in which the objects are presented. Since WaveCluster processes these cells in the remaining steps, thus the algorithm is order insensitive with respect to input data. WaveCluster nds the connected components in the average subband of the wavelet transformed space, as the output clusters. As mentioned earlier, average subband is constructed by convolving the low pass lter along each dimension and down sampling by two. So a wavelet transformed cell will be aected by the content of cells in the neighborhood covered by the lter. It means that the spatial relationships between neighboring cells will be preserved. The algorithm to nd the connected components labels each cell of transformed space with respect to the cluster that it belongs to. The label of each cell is determined based on the labels of its neighboring cells [4]. It does not make any assumptions about the shape of connected components and can nd convex, concave, or nested connected components. Hence, WaveCluster can detect arbitrary shapes of clusters. WaveCluster applies wavelet transform on the data space to generate multiple decomposition levels. Each time we consider a new decomposition level, we ignore some details in the average subband and eectively increase the size of a cell's neighborhood whose spatial relationship is considered. This results in sets of clusters with dierent degrees of details after each decomposition level of wavelet transform. In other words, we will have multi-resolution clusters at dierent scales, from ne to coarse. In our approach, a user does not have to know the exact number of clusters. However, a good estimation of number of clusters helps in choosing the appropriate scale and the corresponding clusters. One of the eects of applying low pass lter on the feature space is the removal of noise. WaveCluster takes advantage of this property, and removes the noise from the feature space automatically without requiring extra processing time. WaveCluster is a very fast method and as we will show its time complexity, it performs very eciently on very large databases. However, the performance of WaveCluster depends on the values of m (number of intervals in each dimension) and d (number of dimensions in the data space). In other words, quantization and dimensionality of the data space are two important issues in WaveCluster that we discuss below. Quantization of the Space. All the grid-based approaches for clustering spatial data suer 8

9 from the Modiable Areal Cell Problem (MAUP) addressed in [7]. The problem occurs in terms of scaling and aggregation. The problem of scaling is in selecting appropriate size and number of cells to represent the data. Aggregation is the problem of summarizing the data contained in each cell. All the present grid-based algorithms suer from these problems. In general, when the quantization value m is too low (very coarse quantization), more objects will be assigned to the same cell, and there is higher probability for the objects from dierent clusters to belong to the same cell. We call this case under-quantization problem. This results in merging of the clusters and mislabeling their objects, thus the quality of clustering decreases. In contrast, if the quantization value m is too high (very ne quantization), each object will be in a separate cell which might be far from the other cells. We call this over-quantization problem. Over-quantization can result in many unnecessary small clusters (that might be later removed as noise) and does not nd the real clusters, thus it will also decrease the quality of clustering. Aggregation also plays a role in clustering and it depends on the kind of algorithm used for clustering. In STING each cell maintains a list of statistical attributes like number of objects in the cell, mean, standard deviation, min, max, type of distribution of the values in the cell [10]. In CLIQUE proposed by Agrawal et. al., each cell is classied as dense or not based on the count value in each cell [1]. But none of the methods discusses the problems regarding aggregation. We argue that in this context, scaling is an inherent problem in what a human user can call a cluster, in other words, the denition of cluster. As Openshaw and Taylor stated, it seems very unlikely that there will ever be either a purely statistical or mathematical solution for MAUP [7]. To have an optimal quantization, application domain information should be incorporated. While other existing grid-based clustering methods ignore this problem, WaveCluster has the advantage of producing clusters at multiple scales at the same time. This means that the results of WaveCluster implicitly reect multiple quantizations of the space, resulting in multiple sets of clusters that can be selected based on the user's requirements. We may use a heuristic-based approach to experimentally nd a good quantization. We can start with an over-quantized space and try to nd reasonable clusters. If necessary, we then increase the size of cells and repeat the process until we get some acceptable clusters. At this point, WaveCluster, using multiresolution property of wavelet transform, can provide multiple sets of clusters at dierent scales. Dimensionality of the Space Assuming m intervals in each of d dimensions of the data space, there would totally be K = m d cells. Let the total number of data be N. Based on the dimensionality of data space, we use 9

10 two dierent representations for the data space. In the WaveCluster algorithm, steps 2 (applying wavelet transform) and 3 (nding connected components) will be dierent for these two cases, while steps 1 (quantization) and 4 (mapping) are the same. For low-dimensional spaces, we represent the space using a multi-dimensional matrix. It is a fast method which is simple to implement. This method is appropriate for large databases when N K. As an example, for a database with 1,000,000 objects when the number of dimensions d is less than or equal to 6, and the number of intervals m is 10, this condition holds. We present this approach in Section 3. However, for high number of dimensions we may have N < K. So the time and space complexity will grow exponentially with d. We use the sparseness of data in such spaces and represent the data space using a hash-based data structure. But applying wavelet transform and nding connected components on this representation are nontrivial problems which we address in Section 4. 3 Clustering in Low-dimensional Space Data Space Representation For low-dimensional data spaces, we can represent the space using a multi-dimensional matrix. Each element of the matrix corresponds to one cell in the quantized space. It provides a simple and fast method to access the information of neighboring cells of each cell. These information are required to apply wavelet transform or to nd the connected components. Applying Wavelet Transform Applying wavelet transform on the multi-dimensional matrix is straight forward. Convolution with the lters can be easily done resulting in subbands at dierent scales. In our experiments, we applied the three-level wavelet transforms Haar, Daubechies, and Cohen-Daubechies-Feauveau ((4,2) and (2,2)). Average subbands give approximations of the original data space at dierent scales, which help in nding clusters at dierent levels of details. For example, as shown in Figure 4, for a 2-dimensional data space, the subbands LL show the clusters at dierent scales. Finding Connected Components We use the algorithm in [4] to nd the connected components in the 2-dimensional data space (image). The same concept can be generalized for higher dimensions. The label of each cell is specied based on the labels of its neighboring cells. The connected component analysis consists 10

11 of scanning through the image once to nd all the connected components, and then equivalence analysis to re-label the components. This takes care of components with holes and concave shapes. There are many well known algorithms for nding connected components in images and we used the one mentioned in [4] for our purpose. Examples In Section 2, we showed how WaveCluster handles very large datasets (525,000 objects in the data presented in Figure 5) and how it can remove the noise. We also showed in Figure 4 how it can represent the clusters at dierent levels of details. Data mining methods should be capable of handling any arbitrary shaped clusters. Figure 6-a,b show a dataset and its clustering using WaveCluster. There are 2 arbitrary shaped clusters in the original data which are correctly detected. This result emphasizes eectiveness of the methods which do not assume the shape of the clusters a priori. Figure 6-c shows an example of a concave shape data distribution. Figure 6-d presents the clustering produced by WaveCluster. From these results, it is evident that WaveCluster is also very powerful in handling any type of sophisticated patterns. a) b) c) d) Figure 6: a) Original space; b) WaveCluster results; c) Original space; d) WaveCluster results. WaveCluster is a very fast method and most of its time is spent in reading the input data. For example, for the datasets presented in Figure 6, (with more than 200,000 objects), it only took 11

12 14.5 seconds to cluster (if we apply a 512x512 quantization). About 11 seconds of this time was spent in reading and quantization, and only 3.5 seconds was required for the real processing. We performed our experiments on a SUN SPARC workstation using 168 MHz UltraSparc CPU with SunOS operating system and 1024 MB memory. Time Complexity The time complexity of the rst and last steps of WaveCluster algorithm is O(N), because they scan all the database objects. Assuming m cells in each dimension of feature space, there would be K = m d cells. Complexity of applying wavelet transform would be O(K). To nd the connected components, the required time is O(K). Thus the time complexity of processing data (without considering I/O) which is performed in steps 2 and 3 would in fact be O(K). Since we assume that N K, the overall time complexity of the algorithm is O(N) [8]. During applying wavelet transform on each dimension of the data space, the required operations for each cell can be carried out independent of the other cells. Thus, using parallel processing can speed up transforming the space. The connected component analysis can also be speeded up using parallel processing. 4 Clustering in High-dimensional Space In high-dimensional space, it is expected that data will be sparse and most of the cells in the quantized space will be empty. An ecient way of storing only the nonempty cells in the quantized space is expected to drastically reduce the space complexity. We use a hash table approach to keep track of the nonempty cells only. The main idea is to eciently represent high-dimensional data in limited memory and perform wavelet transform as well as connected component analysis on this representation. But performing a convolution operation such as wavelet transform on this representation is a nontrivial problem. We present an accumulative approach to calculate wavelet transform in high dimensional space. And nally we nd k-connected components in high dimensional space by a depth-rst search through the hash table. Data Space Representation In the quantized space, every cell c i has the form of hc i1 ; c i2 ; : : : ; c id i which is called the key or index for c i, where c ij = [l ij ; h ij ) is the right open interval in the partitioning of dimension A j. The address of a cell in the hash table can be calculated by applying appropriate hash function on 12

13 the index of a cell. A hash table requires much less storage than a direct-address table, which was used in [8]. Specically, the storage requirement can be reduced from O(m d ) to (N 0 d), where N 0 is the number of nonempty cells in the quantized feature space. With hashing, a cell c i = hc i1 ; c i2 ; : : : ; c id i is stored in the hash bucket h(c i ); that is, a hash function h is used to compute the address for the cell c i. Formally, the hash function h maps the universe U of c i = hc i1 ; c i2 ; : : : ; c id i into the entries in the hash table T [0 : : : n? 1], where n is the number of buckets in the hash table. That is, h : U! 0; 1; :::; n? 1: Designing a hash function. In our approach, since hashing is performed frequently, the time spent on hashing directly aects eciency. Also, both applying wavelet transform and nding connected components require the neighborhood information, that is, to locate neighbor cells of a given cell. However, hashing permits any element to be mapped into any of the hash table buckets. Thus, it introduces the problem of determining or locating the neighbors of a cell. Another issue is the collision problem in which two or more cells may be hashed into the same bucket. Our goal to design the hash function is to achieve eciency, easy computing of neighbor cells, and minimal collision. We now dene our hash function. l = log 2 n and n is the number of hash table buckets as follows: A ld = We randomly generated an integer matrix A ld, where 0 r 1;1 r 1;2 ::: r 1;d r 2;1 r 2;2 ::: r 2;d ::: ::: ::: ::: r l;1 r l;2 ::: r l;d 1 : C A For each key c i = hc i1 ; c i2 ; : : : ; c id i, hash function h(c i1 ; c i2 ; : : : ; c id ) is: h(c i1 ; c i2 ; : : : ; c id ) = 0 r 1;1 r 1;2 ::: r 1;d r 2;1 r 2;2 ::: r 2;d ::: ::: ::: ::: r l;1 r l;2 ::: r l;d 1 C A K 0 c i1 c i2 ::: c id 1 0 = C A z 1 z 2 ::: z l 1 ; (2) C A J where is equivalent to matrix multiplication in binary operations and dened in [11]. Result z = hz 1 ; z 2 ; : : : ; z l i will be a string of 0 and 1, which is the address of the entry where cell c i is located in the hash table. In [11], we gave detailed denition of the hash function. It can be proved that the hash function given in Equation 2 maps a cell in any hash bucket with equal probability, which reduces collision. We further resolve collision by extended queuing with each bucket. 13

14 Calculating Wavelet Transform on Hashed Feature Space Wavelet transform is applied on hashed representation of the quantized space to generate a new hash table consisting of only the signicant cells (see the denition later). By scanning through the hash table and convolving the lter given in Equation 2 with each cell and its neighbors, we generate new cells in the transformed space. For any nonempty cell c i = hc i1 ; c i2 ; : : : ; c id i, the cells which will contribute to its value in the transformed space along dimension A j are, c k = hc i1 ; c i2 ; : : : ; c ij +k; : : : ; c id i, where M k M and M is the width of the wavelet lter. All the cells 2 2 stored in the hash table will get new values after wavelet transform is applied (See Figure 7). Also, because of the convolution operation in wavelet transform, some of the previously empty cells will become nonempty by receiving contributions from their neighboring cells. We call each potential nonempty cell a receiver and each old nonempty cell a contributor. In traditional implementation of wavelet transform, each receiver knows which cells to ask for contributions. Thus, the algorithm scans through all the potential receivers. In a multidimensional array implementation, every cell is considered as a potential receiver. So the algorithm has to scan through the entire space of cells which we should try to avoid in high dimension case because of the exponential growth in the number of cells. But, in the case of hashed implementation we only have information about the cells which are nonempty. So there are many potential receivers about whose hashed location we have no knowledge. Therefore, it is not possible to use traditional scanning algorithms directly on the hashed quantized feature space. h( c i ) h( c ) j h( c ) k s i s j s k + c s 1 + c s 2 + c s 3.. i j k h( c i ) h( c ) j h( c ) k s i s j s k h( c ) l s l + c ws l h( c ) l s l Original table Transformed table Figure 7: Traditional approach of calculating wavelet transform. In our approach, each contributor knows its receivers and the addresses of the receivers can be calculated by using the hash function in time O(l). So, instead of receivers asking for values, contributors distribute values to receivers. Thus, it is sucient to just scan through the contributors 14

15 which are already saved in the hash table. With slight modications, we can rewrite Formula 1 as: where?m=2 j < M=2. ds i+j = cm 2?js i + M 2?j?1 X k=0 c k s i+j? M X 2 +k + M?1 k= M 2?j+1 c k s i+j? M 2 +k; (3) Using this formula while scanning the hash table, each old nonempty cell or contributor is multiplied by a coecient and the result is accumulated into its receiver cells which are hashed into a new table (See Figure 8). h( c i ) h( c ) j h( c ) k h( c ) l s i s j s k s l Original hash table + c s 1 k + c 2 s k + c 3 s k... + c s k M h( c i ) h( c ) j h( c ) k h( c ) l s i s j s k s l Transformed table Figure 8: Accumulative approach of calculating wavelet transform. Due to the generation of new nonempty cells, the number of cells in the new hash table will be increased after wavelet transform is applied. In many cases, a large number of new nonempty cells tend to have very small count values. Many of these low count values are expected to be caused by the outliers rather than the actual clusters. Also, the actual cluster shapes are distorted on the surfaces because of the directionality property of convolution operation used in wavelet transform. Removing low count value cells by applying a threshold on the count values will eectively remove majority of the outliers and help preserving the original shape of the clusters. In addition, reduction in number of cells in the hash table is expected to improve the time complexity of the algorithm. We dene the signicant cell as a cell which has count values greater than a particular threshold. In the new hash table constructed after applying wavelet transform on the original hash table, only the signicant cells are stored. The threshold plays an important role in the quality of clustering and outlier removing. The details of determining the threshold can be found in [11]. 15

16 Finding Connected Components in Hash Table The hash table is essentially a graph G = (V; E), where V = fc i c i is a signicant cell in transformed spaceg and E = f(c 1 ; c 2 )jd(c 1 ; c 2 ) " = 1g. Here, we dene distance D as City-block distance or kl 1 k metric: D kl1 k(c 1 ; c 2 ) = dx i=1 j c 1i? c 2i j : There is an edge between two cells if and only if their indices dier on only one dimension. So every signicant cell has at most 2d neighbors. Let c j = hc j 1 ; c j 2 ; : : : ; c j k ; : : : ; c j d i be a cell and c 0 j = hc j 1 ; c k 2 ; : : : ; c 0 j k ; : : : ; c j d i is a neighbor of c j, where jc j k? c 0 j k j. The hashed index value of c 0 j can be computed using the following formula1 : h(c 0 j) = h(c j ) M 0 #(c j k ^ r 1;k ) #(c j k ^ r 2;k ) ::: #(c j k ^ r d;k ) where operator L is dened to be bitwise exclusive OR. 1 C A M 0 #(c j 0 k ^ r 1;k) #(c 0 j k ^ r 2;k ) ::: #(c 0 j k ^ r d;k ) 1 C A ; (4) Thus, given the index of a cell, the bucket number of the neighboring cells can be computed by the hash function in Equation 4. The clusters are the connected components of graph G, which can be found by a depth-rst-search algorithm. WaveCluster starts from the rst bucket in the hash table, assign it the rst cluster number and search the cells it is connected to. It continues to scan the hash table until all the cells are visited. Example We use parallel coordinates to visualize the clusters in high dimensional space. On the plane with xy-cartesian coordinates and starting on the y-axis, d parallel vertical lines are placed equi-distant and perpendicular to x-axis. They all have the same positive orientation as the y-axis. The values on each of the d axes that correspond to an individual point in the data space are connected by line segments between successive vertical axes resulting in a polygonal line. Figure 9-a,b show two points C(3, 3, 1) and D(2, 2, 3) in the 3-dimensional space and their parallel coordinates. Figure 9-c shows the parallel coordinates for a point P = (v 1 ; v 2 ; : : : ; v d?1 ; v d ) in d-dimensional space. Figure 10-a shows the parallel coordinate representation of a 12-dimensional dataset. dataset has 50,000 objects (including 10% noise data) which are grouped into 9 clusters. WaveCluster's results in Figure 10-b show that it has detected all the 9 clusters correctly and it has removed the noise data. The clusters are color-coded, however some of the clusters (which are not neighbors) 1 See [11] for the proof. This 16

17 z D(2, 2, 3) x C (3, 3, 1) 3 C D v 1 P v 2 v 3 v d-1 v d y x y z x 1 x x x x 2 3 d-1 d a) b) c) Figure 9: a) 3-D space; b) 3-D parallel coordinates; c) d-dimensional parallel coordinates. may have the same color. Our analysis showed that about 95% of data were correctly clustered by WaveCluster. a) b) Figure 10: a) Original data space; b) Clustering results. 17

18 Time Complexity It can be proved that by introducing the hash data structure to represent the dataset, we can cluster d-dimensional data in the time complexity of O(N d logn). Detailed analysis can be found in [11]. 5 Conclusion In this article, we presented the wavelet-based clustering approach for both low-dimensional and high-dimensional datasets. This grid-based approach applies wavelet transform on the quantized feature space and then detects the dense regions in the transformed space. Applying wavelet transform makes the clusters more distinct and salient in the transformed space and thus ease their detection. Using multiresolution property of wavelet transform, WaveCluster can detect the clusters at dierent scales and levels of details which can be very useful in the user's applications. Moreover, applying wavelet transform removes the noise from the original feature space, and thus can handle them properly and nd more accurate clusters. Our approach does not make any assumption about the shape of clusters and can successfully detect arbitrary shape clusters such as concave or nested clusters. It is also a very ecient method, which makes it specially attractive for very large databases. This approach is insensitive to the order of input data to be processed. Current clustering techniques do not address these issues suciently, although considerable work has been done in addressing each issue separately. This approach is the rst attempt to apply the properties of wavelet transform in the clustering problem in spatial data mining. It is a clever, yet natural, application of wavelets with spectacular end-results. References [1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 94{105, Seattle, WA, [2] M. Ester, H. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 2nd International Conference on KDD,

19 [3] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An ecient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 73{84, Seattle, WA, [4] Berthold Klaus Paul Horn. Robot Vision. The MIT Press, forth edition, [5] S. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trnasactions on Pattern Analysis and Machine Intelligence, 11:674{693, July [6] R. T. Ng and J. Han. Ecient and Eective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, pages 144{155, Santiago, Chile, [7] S. Openshaw and P. Taylor. Quantitative Geography: A British View, chapter The Modiable Areal Unit Problem, pages 60{69. London: Routledge, [8] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In Proceedings of the 24th VLDB conference, pages 428{439, New York City, August [9] Greet Uytterhoeven, Dirk Roose, and Adhemar Bultheel. Wavelet transforms using lifting scheme. Technical Report ITA-Wavelets Report WP 1.1, Katholieke Universiteit Leuven, Department of Computer Science, Belgium, April [10] Wei Wang, Jiong Yang, and Richard Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd VLDB Conference, pages 186{195, Athens, Greece, [11] D. Yu, S. Chatterjee, G. Sheikholeslami, and A. Zhang. Eciently detecting arbitrary shaped clusters in very large datasets with high dimensions. Technical Report 98-8, State University of New York at Bualo, Department of Computer Science and Engineering, November [12] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Ecient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103{114, Montreal, Canada,