k-means++ help!

A

almurph

Hi,

I'm trying to really understand the k-means++ algorithm before I do
an implementation of it in C#. However I'm a bit stuck and need some
assistance.
I'm following the instruction set on the Wikipedia website.
Specifically the section entitled "Initialization Algorithm".

I understand point 1 i.e. "Choose one center uniformly at random from
among the data points." That's easy, no worries there.

Point 2 is also straight-forward, just a calculation really: 2. "For
each data point x, compute D(x), the distance between x and the
nearest center that has already been chosen." That's okay, just a
calculation really over the set.

Point 3 is the one that I'm confused on. Point 3 states: "Add one new
data point at random as a new center, using a weighted probability
distribution where a point x is chosen with probability proportional
to D(x)^2. " This, I don't understand. Can anyone please explain it to
me, perhaps with a simple worked example would be great...

Thanks,
Al.
 
A

almurph

   I'm trying to really understand the k-means++ algorithm before Ido
an implementation of it in C#. However I'm a bit stuck and need some
assistance.
   I'm following the instruction set on the Wikipedia website.
Specifically the section entitled "Initialization Algorithm".
   I understand point 1 i.e. "Choose one center uniformly at randomfrom
among the data points." That's easy, no worries there.
   Point 2 is also straight-forward, just a calculation really: 2. "For
each data point x, compute D(x), the distance between x and the
nearest center that has already been chosen."  That's okay, just a
calculation really over the set.
   Point 3 is the one that I'm confused on. Point 3 states: "Add one new
data point at random as a new center, using a weighted probability
distribution where a point x is chosen with probability proportional
to D(x)^2. " This, I don't understand. Can anyone please explain it to
me, perhaps with a simple worked example would be great...

Not really a C# question (even though it's possible to answer in C#…see
below :) ), but what the heck…

They are explaining that once at least one cluster center has been
selected (the first one, you just pick randomly from the entire set),
you then need to select more centers until you have /k/ centers (i.e.
the whole thing is an initialization for yet another algorithm).

For all but the first center you pick, the first step in picking another
center is to calculate, for every point not already selected as a
cluster center, the square of the distance (i.e. D(x)^2 = dx^2 + dy^2
where dx and dy are the differences in the x and y coordinates
respectively) from that point to the cluster center point _nearest_ to
it that has already been picked.

Then, once that distance to the nearest center has been found and
calculated for every remaining point in the set, a new center is
selected randomly based not on a uniform distribution, but rather a
weighted distribution where the weighting is the D(x)^2 you calculated
for that point.

For the second center, all of the points will simply be weighted
according to their distance from that one first center you picked.  But
after that, the D(x)^2 for each point may be relative to different centers.

For example, if you've already got 2 cluster centers selected out of 10
points total and you're getting ready to select a 3rd cluster center,
then before you do that you need to calculate 16 different distances (or
rather, squares of distances): for each remaining point, one distance
for each cluster center already picked.  And for each remaining point,
you remember the smallest D(x)^2 that you calculated, and use that for
the weight.

The general idea is that, when choosing additional cluster center
points, points that are farther away from _any_ other cluster center are
weighted more heavily so that they are more likely to be selected.

Here's a simple, brute-force, unoptimized, uncompiled, untested
implementation in C#:

   class PointRecord
   {
     public Point Point { get; private set; }
     public double MinimumDistance { get; set; }

     public PointRecord(Point point)
     {
       Point = point;
       MinimumDistance = 1;
     }
   }

   List<Point> PickCenters(List<Point> inputData, int k)
   {
     Random random = new Random();
     List<Point> output = new List<Point>();
     List<PointRecord> data = new List<PointRecord>(
       inputData.Select(point => new PointRecord(point)));

     while (k-- > 0)
     {
       double distanceSum = UpdateDistances(data, output);
       PickNextCenter(data, output, random.NextDouble() * distanceSum);
     }
   }

   double UpdateDistances(List<PointRecord> data, List<Point> output)
   {
     if (output.Count == 0)
     {
       // The sum of the MinimumDistance values for the entire data
       // list is the count of that list, since they are initialized
       // to 1.
       return data.Count;
     }

     double sum = 0;

     foreach (PointRecord record in data)
     {
       record.MinimumDistance = double.MaxValue;

       foreach (Point point in output)
       {
         double dx = point.X - record.Point.X,
           dy = point.Y - record.Point.Y;
         double distance = dx * dx + dy * dy;

         if (distance < record.MinimumDistance)
         {
           record.MinimumDistance = distance;
         }
       }

       sum += record.MinimumDistance;
     }

     return sum;
   }

   void PickNextCenter(List<PointRecord> data,
     List<Point> output, double index)
   {
     for (int i = 0; i < data.Count; i++)
     {
       PointRecord record = data;

       if (index < record.MinimumDistance)
       {
         data.RemoveAt(i);
         output.Add(record.Point);
         return;
       }

       index -= record.MinimumDistance;
     }

     throw new Exception("Should never get here");
   }

Pete- Hide quoted text -

- Show quoted text -


Hi Pete,

Thanks for the reply. Let me see if I got this right. You said in
your explanation that "but rather a weighted distribution where the
weighting is the D(x)^2 you calculated for that point."
Now from what I can see you use the smallest distance and multiply it
by a random number. Am I correct? You then seem to add up the the
minimum distances and multiply said sum by the random number. This is
what I don't understand Pete, this multiply by the random number. I
though you would multiply the maximum distance, no?

Still a bit confused,
Al.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top