Quantcast
Channel: Question and Answer » k-means
Viewing all 63 articles
Browse latest View live

Clustering Analysis for large data in R

$
0
0

I am trying to perform a clustering analysis for a csv file with 50k+ rows, 10 columns. I tried k-mean, hierarchical and model based clustering methods. Only k-mean works because of the large data set. However, k-mean does not show obvious differentiations between clusters. So I am wondering is there any other way to better perform clustering analysis? Thanks in advanced!

The data looks like this

Revenue  Employee  Longitude Latitude  LocalEmployee BooleanQuestions ...
1000     100       xxxx      xxxx      10
...                                                                   ...

Here is part of my code:

mydata <- scale(mydata)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")

fit <- kmeans(mydata,7)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

enter image description here


K mean clustering algorithm on 1D data

$
0
0

I’m really confused on what are the steps on how to perform k-means clustering algorithm on 1 dimension data. So suppose I have the following array of data and it should be clustered in two groups:

data = [40, 20, 30, 10, 22, 94, 66];

I have read the following site and it helped me get an idea on how to approach it but I’m still a little unsure.
http://www.macwright.org/2012/09/16/k-means.html

My approach is:

  • I would first calculate the mean of the entire dataset.
  • then I would find calculate the euclidean distance between each point and the mean.
  • then I would cluster them in to two groups, one group that had the shortest distance to mean and the other that wasn’t so close.

My question is are these steps correct and how would I perform k-means clustering on the dataset if k>2. I feel like my thinking is flawed, any help would greatly appreciated.

Find the optimal number of clusters in large dataset using R

$
0
0

I’ve a got a data which I did a PCA on. I want to do a kmeans on the individuals coordinates on the 5 first principal components. Therefore I have a 200000 x 5 matrix of the coordinates. I’m looking to find a way to determine the optimal number of cluster so I can run a kmeans on my coordinates data using R. I found many methods to do that using R (here is a list : Cluster analysis in R: determine the optimal number of clusters). None of those methods have worked for me because my data is too large. I get an error like : “negative length vectors are not allowed”. I really need help on that because I shouldn’t decide what number of cluster I should use, I have to let the statistic decide. Thank you very much.

k means clustering on sales geolocation data

$
0
0

I have geolocation data (lat and long) per customer per online purchase, and my end goal is to identify common locations per purchase per customer. (basically to see what people typically buy when they are at home, vs what they buy when at work etc)

As a start I wanted to group the latitude longitude pairs per customer into a ‘home’ set, a ‘work’ set etc, and then I can link the purchases to each area set.

So to cluster the data pairs (and ultimately define my ‘sets’), I had initially thought k-means clustering would help, but I have a different amount of geolocation data per general area per customer.
(what I mean is, for one customer I have (LATITUDE,LONGITUDE) = (-25.756124, 28.23253) call this ‘Location A’ and 3 other pairs near that ‘Location A’, and then at ‘Location B’ I will have 50 pairs around ‘Location B’. This is what makes me think that k means clustering might not be best)

Can someone please send me on the right track?

K-Means Clustering Not Working As Expcected

$
0
0

I have a script that I’m testing with in Python3 with Scikit to cluster terms based on either words or character n-grams. Basically, it’s fed a list of training data with corresponding labels. For example:

Name            Label
mexican food    1
greek cuisine   1
hotel night     7
...
airfare         7

After I run the program I type in raw input which should be transformed and predicted. However, no matter what I put it, the program makes the same prediction. This occurs even if I put in a term such as ‘mexcian’ which only appears once in the training data and hence should be trivial to predict. Can anyone spot the issue?

from __future__ import print_function

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
              dest="n_components", type="int",
              help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
              action="store_false", dest="minibatch", default=True,
              help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
              action="store_false", dest="use_idf", default=True,
              help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--analyzer",
              type='str', default='word',
              help="Which analyzer to use. Valid options are 'word' and 'char'")
op.add_option("--use-hashing",
              action="store_true", default=False,
              help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
              help="Maximum number of features (dimensions)"
                   " to extract from text.")
op.add_option("--verbose",
              action="store_true", dest="verbose", default=False,
              help="Print progress reports inside k-means algorithm.")

print(__doc__)
op.print_help()

(opts, args) = op.parse_args()
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)

opts.analyzer = opts.analyzer.lower()
assert opts.analyzer in ['word','char']

###############################################################################
# Read in the data
inputfile = '../data/dodcategories.csv'
data = np.loadtxt(inputfile,dtype=[('type','|S16'),('subID',np.int),('ID',np.int)],delimiter='t',skiprows=0,unpack=True)
X = np.array([str(item,'utf-8').lower() for item in data[0]])
labels = np.array(data[1])
true_k = np.unique(labels).shape[0]


print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    if opts.use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', non_negative=True,
                                   norm=None, ngram_range=(1, 10), binary=False, analyzer=opts.analyzer)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       non_negative=False, norm='l2',
                                       binary=False, ngram_range=(1, 10), analyzer=opts.analyzer)
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf, ngram_range=(1, 10),analyzer=opts.analyzer)
X = vectorizer.fit_transform(X)

print('------------------------------------------------')
print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

if opts.n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    lsa = make_pipeline(svd, Normalizer(copy=False))

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()


###############################################################################
# Do the actual clustering

if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=opts.verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X,labels)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, sample_size=1000))

print()

if not (opts.n_components or opts.use_hashing):
    #print("Top terms per cluster:")
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    #for i in range(true_k):
        #print("Cluster %d:" % i, end='')
        #for ind in order_centroids[i, :10]:
            #print(' %s' % terms[ind], end='')
        #print()

test = 'test';   
while test.lower() not in ['exit','',None]:
    test = input("Enter a category (Type exit to quit): ")
    X_test = [test.lower()]
    print("Test: {}".format(X_test))
    X_test = vectorizer.transform(X_test)
    print("Test: {}".format(X_test))
    result = km.predict(X_test)
    print("Result: {}".format(result))

Estimating number of clusters using Gap Statistics

$
0
0

Since my application is for streaming data, I chose to use BIRCH to create clusters. BIRCH doesn’t produce high quality results, therefore it requires “global clustering step” to improve output clusters. Global clustering is often performed using Agglomerative clustering or K-Means.

I am trying to use BIRCH clustering results as input to Gap statistics in order to calculate number of clusters (K), which would be the input for K-Means as the Global step in BIRCH.

Instead of whole dataset, I am feeding Gap statistics with BIRCH subcluster centers as a new dataset. I am also testing this approach with Pham method, which seems to give better results than Gap statistics.

One of the datasets I am using for testing is from sklearn BIRCH examples, 100K points around 100 centers. On Fig 1. Pham method guessed correctly number of clusters in this dataset (BIRCH produced 148 clusters; centers of those 148 clusters were the input points for Pham).

Fig 1.

Using Gap statistics I am always getting K=1 as a result. Following this post I was changing the scale, but I am still unable to get good results. Results and dataset are shown on Fig 2. (Dataset is again made of subcluster centers produced by BIRCH)

Gap statistics

Do you have any suggestions how I can improve results for Gap Statistics?

Proof of convergence of k-means

$
0
0

For an assignment I’ve been asked to provide a proof that k-means converges in a finite number of steps.

This is what I wrote:

In the following, $C$ is a collection of all the cluster centres.
Define an “energy” function
$$E(C)=sum_{mathbf{x}}min_{i=1}^{k}leftVert
mathbf{x}-mathbf{c}_{i}rightVert ^{2}$$ The energy function is
nonnegative. We see that steps (2) and (3) of the algorithm both
reduce the energy. Since the energy is bounded from below and is
constantly being reduced it must converge to a local minimum.
Iteration can be stopped when $E(C)$ changes at a rate below a
certain threshold.

Step 2 refers to the step which labels each data point by its closest cluster centre, and step 3 is the step where the centres are updated by taking a mean.

This is not sufficient to prove convergence in a finite number of steps. The energy can keep getting smaller but it doesn’t rule out the possibility that the centre points can jump about without changing the energy much. In other words there might be multiple energy minima and the algorithm can jump about between them, no?

Using clustering for unsupervised classification (visualizing k-means cluster centers)

$
0
0

I know that the cluster centroid is the middle of a cluster. It’s a vector containing one number for each variable, where each number is the mean of a variable for the observations in that cluster.

I cluster my dataset (MNIST handwritten digits) using K-Means into 3,5 and 10 clusters. My question is: which characteristic of the data is captured by the centroids?

Plotting the centroids as images, I can see that with 3 clusters centroids are not well defined. For example, the 3 and 7 digits overlap, as you can see in the image. The same thing happens with 4 and 5 digits.
While with 10 clusters, centroids are better defined (as you can see in the image), but there are some repetitions for certain values (eg 2 centroids for 3 and 4). Why is this happening?

Centroid with 3 clusters

Centroid for 4 with 10 clusters (there are 2 similar centroids)


Why is K-Means++ SLOWER than random initialization K-Means?

$
0
0

K-Means is an iterative clustering method which randomly assigns initial centroids and shifts them to minimize the sum of squares. One problem is that, because the centroids are initially random, a bad starting position could cause the algorithm to converge at a local optimum.

K-Means++ was designed to combat this – It chooses the initial centroids using a weighted method which makes it more likely that points further away will be chosen as the initial centroids. The idea is that while initialization is more complex and will take longer, the centroids will be more accurate and thus fewer iterations are needed, hence it will reduce overall time. (Source)

In fact, the people who devised K-Means++ tested how fast it could cluster data, and found that it was twice as fast. (Source)

However, some basic tests in R show that K-Means++ requiring fewer iterations than K-Means does not make up for the extra time taken to initialize, even for normal sized datasets.

The test:

I tested it with datasets sized from 100 to a few thousand points. The data is named ‘comp’ in the code. If you want to test it, you can use whatever dataset you want.

K-Means

First, we do the clustering:

k <- kmeans(comp, 2, nstart=1, iter.max=100, algorithm = "Lloyd")

Next, the sum of squares can be added to a list

results <- k$withinss

Now, if we put the clustering in a loop, which adds to the ‘results’ list every time it loops, we can see how many times it has looped in a certain amount of time

repeat {
k <- kmeans(comp, 2, nstart=1, iter.max=10, algorithm = "Lloyd")

results <- c(results, k$withinss)
}

(I did it in this inefficient way because I initially used this code to test the accuracy i.e which method had the average total Sum of Squares)

If we let the loop run for 60s, we find that the list is 132,482 objects long. (it looped 66241 times since each time adds two objects to the list).

K-Means++

Now compare that with ++ initialisation.

#Set-up
library(LICORS)
k <- kmeanspp(comp, k = 2, start = "random", iter.max = 100, nstart = 1)

results <- k$withinss

#Loop
repeat {
k <- kmeanspp(comp, k = 2, start = "random", iter.max = 100, nstart = 1)

results <- c(results, k$withinss)
}

The ‘results’ list ended up having 22712 objects (it looped 11356 times).

K-Means was over 5 times faster than K-Means++, so this is clearly not just a measurement error. The ratio changes depending on the dataset I use for the test, but I’ve tried everything up to datasets with thousands of points, and the results consistently show that K-Means++ is slower.

My first thought was that maybe the package I used (LICORS) has inefficient code for performing K-Means, but then I saw that LICORS actually uses the default kmeans function after ++ initialization. In other words: Everything was the same except for the method of initialization, and ++ was slower. (another package for K-Means++ called flexclust which uses different code was even slower!)

Perhaps the dataset needs to have tens of thousands of points for K-Means++ to be faster? In this case, it would be very misleading for every source I’ve seen to say that K-Means++ is faster. Perhaps I’ve misunderstood something, or there’s something wrong with the test?

Can any experts here (such as Tim, who claims here that K-Means++ is faster) explain these results? Thanks for reading.

grouping a bunch of articles [closed]

$
0
0

I have 4 soccer teams. The teams are Barcelona, Real Madrid, Arsenal and Manchester United. I have to use a clustering algorithm (K means) to achieve this. I have to go through each article and search for key words like Spain, Messi, Ronaldo, etc and then score them. How should I score these key words in order for the clustering algorithm to work properly?

A question on cosine similarity & k-means

$
0
0

I used the following code to perform clustering of a dataset in R.

distMatrix1 <- dist(sample2, method="cosine")
km<-kmeans(distMatrix1,3)

I have got some questions, them being

  1. When the distance matrix is created it is an N*N matrix, so is the average of each row fed to kmeans function in R.
  2. How are the cluster centroids calculated in this case, does clustering happen using Euclidean distances or does it happen using cosine dot product formula?
  3. What is the significance of the clusters obtained? Do the entities which lie in the same cluster behave similarly?

Is there a situation when one would use L1 norm over L2 norm in k-means algorithm? [duplicate]

Clustering discrete dataset with strange metric using K-means

$
0
0

I have a dataset of n objects and a matrix A with their correlations.
So A[i][j] is a correlation of object i and object j, and I do not know anything more about them. My task is to cluster them by correlation (higher correlation = closer objects) using K-means (actually, I must implement distributed version with MPI).
My problem is I do not know how to correctly choose a new mean for a group of objects related to an one ‘old’ mean.
According to the algorithm, I have to choose a point with the least within-cluster sum of squares (WCSS), but in my task I can not create new object to make it a mean, so I can not choose such point.
Will the algorithm still be correct enough if I will use existing points with the least WCSS?
How can you suggest to parallel it on cluster (MPI)?
thanx

Proper dataset format for K-Means and DBSCAN clusterers

$
0
0

I’m trying to classify web traffic using clustering algorithms with my own C program, capturing packets with libpcap.
In this article K-Means, DBSCAN and AutoClass algorithms were used to classify web traffic.

I tested my dataset with different implementations of K-Means and DBSCAN, yet it is unclear to me how these two algorithms deal with data:

in K-Means from a C clustering library, the dataset is a matrix of double** where the rows are the points of observation and the columns are the features; this data struct suits me fine, in my code every row is a connection and every column is a feature (delay, or average dimension, or bps, …) of it;

in DBSCAN from a git repo the dataset is an int ** matrix;

in another DBSCAN repo there is no matrix of features like above, but a linked list of struct point with double x, y coordinates.

I’m confused about the dataset representation: in my program, for every connection (row) there are a number of features (columns), but this representation seems to not agree with a linked list of points like in the second implementation of DBSCAN: should I convert the K-Means dataset implemented as a matrix in a linked list of struct point for DBSCAN?

I know this seems a CodeReview question but I want to figure this out.

How to find the 1000 closest point to a centroid built from another matrix

$
0
0

I actually work on text-mining. I try to find the 1000 closest documents (inside a corpus of 56000 documents) to a selected corpus of document (150). There are a lot of words in my dictionary. I computed the tf matrix and the TF/IDF matrix.

My idea is to build a knn model (or a kmeans) based on the cosine distance.
Then to query that model to get the 1000 most relevant documents.

The problem is: How can I build a knn model without “negative examples”?


Standard reference for K-means [duplicate]

User segmentation by clustering with sparse data

$
0
0

Imagine that I have 100k users and 1k categories. For each user, up to 5 categories, I know how much money they have spent. Obviously my data is very sparse.

Now I want to group users by the money they spend on different categories. This way, I could group together users who are ‘cheap’ in some certain categories and ‘snobby’ in some other categories.

After standardizing the values by calculating the number of times of standard deviation they deviate from the category means, I have tried k-means clustering but I ended up one cluster getting bigger and bigger while others shrink to clusters that contain only few users as the number of iterations k-means do increases.

How can I tackle clustering with sparse data problem? Any pointers, suggestions or ideas are appreciated.

K-means and reproducibility

$
0
0

I’m trying to find the optimal K-means clustering for a set of elements. For a particular K, K-means repeated several times does not always converge to the same clustering due to the randomness in the initialization. This means that whatever internal performance statistic (a clustering criterion, such as Dunn’s index) I’m using to choose k is going to be dependent somewhat on when I run the algorithm, unless I use a seed. To decide whether the clustering is “stable” for a particular k (making it possible for me to believe the performance statistic is representative), I calculate how often the resulting clusters for that k overlap from run-to-run:

$$ frac{1}{n(r-1)} sum_{j=2}^r sum_{i=1}^n frac{|C_{j-1}^{(i)} cap C_{j}^{(i)}|}{|C_{j-1}^{(i)}|}$$

Where $n$ is the number of elements I’m clustering, $r$ is the number K-means runs (with fixed k), and $C_j^{(i)}$ is the cluster from the $j$th run containing the $i$th element.

However, this statistic is not from the literature, and I’d be curious to know how others have addressed this problem and if there are issues with my approach that haven’t occurred to me. Thanks.

Why is it that a larger 'k' value fails to converge but a smaller 'k' converges?

$
0
0

I’m doing clustering via GMM, which is initialized first by k-means.

I am using a data matrix that cannot be classified as small by any standards, they are usually of the size 15000 x 1800, where 15000 are the number of observations and 1800 the size of each observation.

My reading of GMM in speaker recognition papers suggested that with large data we need higher number of gaussians, and most of the papers were using 512 or 1024 gaussians for best convergence.

For my data I tried with 128 and 256 Gaussians. I am using Matlab, but even after 300 iterations I am given a warning that the method failed to converge. But when I use a comparatively small value of k, such as 32. The methods converge without any problems and usually in less than 100 iterations.

I cannot explain this.

R – How to fix NbClust error with error message: “The TSS matrix is indefinite. There must be too many missing values.”

$
0
0

I would like to know how I can use clustering methods in R (in this case, Kmeans) if I have an “unkind” input matrix (I get this error log:

The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.)

I could see that I might get this error if my matrix produces negative eigenvalues (like, here: http://stackoverflow.com/questions/20669596/nbclust-package-error), but what I’m missing is the “next step” part. I could see a suggestion was to “go back to the Data”, but what should I do then? Is there any transformation or something that might help? (I’m pretty new to R and clustering in general…)

The Data I’m using are the result of a survey (which I briefly transformed and scaled via the scale function in R) so I was wondering if there were some algorithms or methods I could use in order to go on with my analysis (from literature I couldn’t find great help). Or, if you think this is unfixable or simply non the best solution, do you have any other suggestion for clustering my data? What I’m willing to do is to identify some clusters of possible users/customers of some services, depending on their usual habits (e.g.: if they use many social networks they will be more likely to use chat/whatsapp/app to ask for bank account information – I have both the information of their social network usage and their ways of communicating with a “bank assistant”).

The Dataset consists of 994 rows and 103 columns. Don’t know if it may help, but the code is simply this:

Data2<- read.csv(...)
bDataScale <- scale(Data2)
nc <- NbClust(bDataScale, min.nc=2, max.nc=993, method="kmeans")

And I get:

Error in NbClust(bDataScale, min.nc = 2, max.nc = 993, method = “kmeans”) :
The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.

Thank you in advance for your help or any corrections,

Julia

P.S.: as it would be logical to expect, I get the same error also with the unscaled matrix.

Viewing all 63 articles
Browse latest View live