Skip navigation
3926 Views 60 Replies Latest reply: Mar 1, 2009 12:00 AM by PhilipLeitch RSS
PhilipLeitch Copper 388 posts since
May 8, 2010
Currently Being Moderated

Feb 18, 2009 12:00 AM

Cluster Analysis - Agglomerative Nesting (AGNES)

AGGLOMERATIVE NESTING (AGNES) USING THE GROUP AVERAGE METHOD OF SOKAL AND MICHENER (1958)

As adapted from "L. Kaufman and P.J. Rousseeuw (1990), FINDING GROUPS IN DATA: AN INTRODUCTION TO CLUSTER ANALYSIS, New York: John Wiley." Chapter 5.

For those who believe all statistics are lies - rest assured there is no significance testing here - the decision for what is a seperate group and what is not is purely up to the researcher's own discretion.

Philip
___________________
Correct answers don't require correct spelling.
Attachments:
  • ptc-1368288 Copper 15,155 posts since
    Nov 15, 2007
    Currently Being Moderated
    Feb 18, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    >No matter how good 11 was, it is outdated and superseded many times over <.
    ______________________________

    How many times you have read from the "Mathcad Power Users" that what worked in 11 does not any more in 13, 14 ? You must have read like 100's of times and how many times it was reported to me directly that 13 fails reading 11.

    No more argument, if you know how to make it work in 11 and save 2001i, even 2000, you will have extra possible contribution. Only reading will save me precious time.

    jmG
      • A.Non Diamond 9,901 posts since
        May 11, 2010
        Currently Being Moderated
        Feb 20, 2009 12:00 AM (in response to PhilipLeitch)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        On 2/19/2009 12:30:44 AM, pleitch wrote:

        >I don't know how to make it
        >work on 11.

        I fixed it so it works in version 11 and version 14. You had some statements in programs with an undefined variable on the rhs. Surprisingly, they worked versions 13 and 14 because the variable was also on the lhs. Also, rows of an empty string throws an error in version 11, so I changed it to rows(0).

        I also added a few more distance metrics.

        AND....

        I figured that since you had done the work of writing the hierarchical clustering algorithm, which is something I have wanted in Mathcad but been to lazy/busy to write, I would write the other essential part: a dendogram drawing algorithm. It makes it a heck of a lot easier to see the grouping!

        Lastly, as I suspected, your calculation for AC is wrong. This is how it's supposed to be calculated:

        http://www.unesco.org/webworld/idams/advguide/Chapt7_1_4.htm

        You can check it here:

        http://www.wessa.net/rwasp_agglomerativehierarchicalclustering.wasp

        I ran the snake data though the web program and your sheet and the dendogram looks fine: all the distances match. The web value of AC is about 0.9 though, which is a lot more reasonable.

        Richard
      • ptc-1368288 Copper 15,155 posts since
        Nov 15, 2007
        Currently Being Moderated
        Feb 18, 2009 12:00 AM (in response to PhilipLeitch)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        In the program COMPARISON,

        "Results" is red as not included in the LHS function, presumably one the dist(,,,) above. In Mathcad 11 it would have to be assigned a matrix variable name... next is rows(combined_ED) as index: that does not work either in 11. Too much work to convert to 11. What's wrong is the 13 style that could probably be done 11 style. I understand that if you never had enjoyed 11 or earlier version you can't figure how more logical the programming style was.

        Interesting but does not go far enough.

        jmG
      • A.Non Diamond 9,901 posts since
        May 11, 2010
        Currently Being Moderated
        Feb 19, 2009 12:00 AM (in response to PhilipLeitch)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        On 2/18/2009 8:57:09 PM, pleitch wrote:
        >It looks like there should be
        >three groups.

        I would argue there are four groups. You need to be very careful with data scaling.


        The Story of the Court Mathematician

        Once upon a time, in a land far, far away, there lived a King. The King had a very pretty daughter, but he also had a problem. His daughter was very fond of the numerous snakes that could be found in the palace grounds (she was pretty, but a little strange). Some of these snakes were poisonous and some were not, but nobody knew how to tell them apart. This worried the King greatly, because he did not want his daughter to be bitten by a poisonous snake before he could demand a huge dowry for her hand in marriage from the King of the much larger kingdom to the west. He knew he could not just get rid of all the snakes, because this would upset his beloved daughter, so he called in his most learned court scholars. When presented with the Kings dilemma, the Court Mathematician promptly announced �there is a new method called �cluster analysis� that I think may elucidate the problem�. The King, not being nearly as learned as the wise mathematician, replied �I didn�t know you could elucidate a snake, or why that would protect my daughter, but if you think it would help then your suggestion has my full support�. The mathematician was a little perplexed by the King�s answer, but was wise enough to know you did not question a king. The next day he set about making some measurements of the dead snakes he found in the palace grounds (being very wise, he realized that dead snakes, poisonous or otherwise, couldn�t bite him). He measured the length and diameter of each snake he found, as well as the length of the fangs. When he had collected enough data, he plotted the three measurements on a graph. He made sure to use the same scale for each axis, because he didn�t want to favor one measurement over another. This is what he saw:



        There were clearly two species of snake! The only remaining problem was to determine which species was poisonous. Being very wise, he realized that although he could only do this using a live snake, along with a disposable prisoner from the King�s dungeon, he did not need to take unnecessary risks by catching more than one. It did not take long for the Court Mathematician to catch one of the larger snakes and determine that, unfortunately for the prisoner, it was poisonous. The next day the Court Mathematician took his findings to the King, who was immensely pleased. The King immediately ordered that all the larger snakes be captured and released over the border of the much smaller kingdom to the east (he did not like the King of the kingdom to the east, because many years before he had demanded a huge dowry for the hand in marriage of his very pretty daughter).

        Time passed happily, until one day the King�s daughter was bitten by a snake and died. The King was furious, and summoned the Court Mathematician. �You told me that only the large snakes were poisonous, and now my daughter is dead. As a punishment that you will never forget you will be elucidated! Take him away!�

        Eventually the ex-Court Mathematician recovered enough from his punishment to investigate where he had gone wrong. After much study he solved the problem by inventing two new techniques for data analysis, which he presented in a very high-pitched voice at the next inter-kingdom symposium on applied mathematics. He called these new techniques �mean centering� and �variance scaling�. When he applied these new techniques to his snake data, this is what he saw:



        There were three species of snake! One of the two smaller species was also poisonous! The other mathematicians were so impressed they gave him a major award with a nice engraved plaque he could hang on his wall. Alas, he could never have a son to inherit the plaque and be proud of his father�s achievements.

        There are two morals to this story:
        1) If you want to continue to speak in a normal voice, and perhaps have children, do not anger kings
        2) If you do not want to anger kings, scale your data correctly prior to analysis.

        Richard
        Attachments:
    • A.Non Diamond 9,901 posts since
      May 11, 2010
      Currently Being Moderated
      Feb 19, 2009 12:00 AM (in response to PhilipLeitch)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      On 2/18/2009 7:52:41 PM, pleitch wrote:

      >The thread was a little off
      >with what it said. There are
      >two types of cluster analysis,
      >but they are Partitioning and
      >Hierarchial.

      Thanks for the clarification.

      >Within each
      >there are multiple approaches.
      >The approach I posted is
      >Hierarchial

      That helps. I haven't had time yet to really look at your worksheet in detail (too busy writing fairy stories!), but it is much easier to figure out what's going on when you know that. I would suggest even adding a comment to that effect at the top of the sheet somewhere.

      Richard
  • ptc-1368288 Copper 15,155 posts since
    Nov 15, 2007
    Currently Being Moderated
    Feb 18, 2009 12:00 AM (in response to ptc-1368288)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    BTW, there was a project similar to what
    you might have done, can't retrieve.

    jmG
      • A.Non Diamond 9,901 posts since
        May 11, 2010
        Currently Being Moderated
        Feb 18, 2009 12:00 AM (in response to PhilipLeitch)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        On 2/18/2009 6:17:20 PM, pleitch wrote:
        >Thanks Jean.
        >
        >I realise there are projects
        >like I have done

        This might be what Jean was referring to:

        http://collab.mathsoft.com/read?27146,17e#27146

        As far as my final sentence of the thread goes, it's still on my to-do list :-)

        Now I can also add to my to-do list a comparison of your algorithm and k-means. At the current rate I should have that done within the next decade or so :-)

        Richard
  • PhilipOakley Silver 2,066 posts since
    Feb 20, 2007
    Currently Being Moderated
    Feb 18, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    In V11.2a there is a bug on "Results" in the big bit of code. It is in the Test for Results==0 as Results is at that point undefined.

    It is big code ;-)

    Philip Oakley
    • A.Non Diamond 9,901 posts since
      May 11, 2010
      Currently Being Moderated
      Feb 20, 2009 12:00 AM (in response to PhilipLeitch)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      On 2/19/2009 7:00:03 PM, pleitch wrote:

      >If you look at my work sheet
      >you will actually find that
      >the AC was very low,
      >indicating that there wasn't
      >much in the way of clustering
      >at all

      See my other post for what I think about the AC metric.



      >A PCA or a Factor analysis,
      >could be used to find a
      >scaling of the dimensions,

      I don't see how you can find the scaling using PCA. In fact, PCA is very dependent upon the scaling of the data.

      > or
      >a Factor Analysis could move
      >the data in Euclidean space to
      >a point that maximises the
      >effect of variables equally,
      >which would then be the
      >appropriate transformations
      >for the data to scale (I
      >haven�t written the factor
      >analysis stuff yet).

      I have :-) PCA, anyway. There are more types of factor analysis than I care to think about, let alone write in Mathcad. I wrote the PCA stuff for myself though, so I would have to do some work on it before I was prepared to post it. PCA is implemented in the Data Analysis Extension Pack though, so if you get that you can save yourself some effort.

      >This may be why/how the PCA
      >has been found appropriate to
      >maximise the efficiency of the
      >k-mean approach?? I haven't
      >read into it yet.

      No. PCA just lets you reduce the number of variables, assuming the variables in the raw data are collinear. That can make it much easier for the k-means clustering to find the clusters.


      >So - when all is said and done
      >what should the mathematician
      >done?
      >
      >He should have gone to a
      >biologist (like me) and asked
      >what should be done.

      He couldn't. The Court Biologist was the Princess. I just forgot to mention that point ;-)

      >Assuming that this is back in
      >the days of alchemy when they
      >didn't make the
      >distinction....

      They didn't. Except perhaps the princess, but of course she wasn't asked.

      >They should have still asked a
      >biologist.

      I think the more important point is that if the King wanted a practical solution to a real world problem, he shouldn't have asked a mathematician :-)

      Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 20, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/20/2009 8:00:47 AM, pleitch wrote:
    >I've looked at the data from
    >the k-mean approach and I
    >still don't see 4 groups.

    See below


    >However - it does draw a very
    >good point. Are these the
    >priciple components associated
    >with the data?

    I have no idea, but I doubt it. It's just the example data that was with the Cluster 3.0 software, and nothing whatsoever is known about it. For all I know it's just made up.

    >Using the snake example, you
    >wouldn't take random
    >measurements of snakes and
    >then assume that they will be
    >related to how venomous they
    >are.

    Well, perhaps not in the snakes example, but there are plenty of examples where that is what you do. You measure what you can, and then try to correlate that with the known property.

    >Instead, you would attempt to
    >determine which variables are
    >principle in determining the
    >venom attribute (or degree of
    >venomousness). This may well
    >be done after measurements are
    >taken.

    Exactly. So you measure everything you can, then correlate afterward. We in the spectroscopy world do it all the time.

    >If there is no easy measure of
    >venom - other than killing
    >something/someone, then the
    >principle components would
    >logicallyb be the ones that
    >are most useful in
    >distinguishing types of
    >snakes.

    No, not necessarily. Principal components are based solely on variance in the data. That variance may not correlate with the property you wish to measure though. As an example, take 2 species of snake. They both have about the same length, and that length is highly variable. That's one variable you measure. Now let's say you measure 10 other variables with high variance but little or no discriminatory power (diameter, etc). Finally, you measure color. The snakes have very similar colors, but the within-snake color variation is very small, so you can tell them apart using this. You now have 11 variables, 10 of them with high variance that tell you nothing about the snake species, and one with very low variance that does. The PCs will be dominated by the high variance variables, not the color, and will not help solve the problem. In fact it will make it much worse, because it will take the one useful variable you did have and mix it up in linear combinations with all the others.

    >I have now attempted to make
    >several views of the data,
    >including a mean divided by
    >stdev. Even then I don't see
    >four groups.
    >
    >With this data I would happily
    >agree that when viewed from 2
    >of the three data axes, there
    >does indeed look like 4
    >groups. But under the third
    >these groups disipate.

    You can't look at only 2 variables at a time. It's a 3 dimensional data set.

    > The
    >final group (the most extreeme
    >one) is so unclustered that if
    >it is to be considered a
    >group, so must the first group
    >(the small one that arguably
    >could be considered two
    >groups).

    That's the one :-)

    However, I agree it's completely subjective, and since we know nothing about the data it could be any number of groups. Maybe it's just one badly sampled continuous distribution!


    >However - even with the data
    >transormation, the AC is so
    >low as to assume that there is
    >no clustering/grouping
    >occuring at all.

    I am wondering if you have the formula for AC correct. If you have, then I would consider it a rather useless metric, because even for the snake data it's only 0.19. The grouping in the snake data is obvious (for 2 groups, anyway).

    Richard
    • ptc-1368288 Copper 15,155 posts since
      Nov 15, 2007
      Currently Being Moderated
      Feb 20, 2009 12:00 AM (in response to A.Non)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      You are right,

      The King's daughter preferred baby snakes, shorter on the meter stick. Do you mind if I pass that lovely story to my best friend in statistics as well ?

      Jean
    • PhilipOakley Silver 2,066 posts since
      Feb 20, 2007
      Currently Being Moderated
      Feb 20, 2009 12:00 AM (in response to A.Non)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      On 2/20/2009 10:01:05 AM, rijackson wrote:
      >
      >No, not necessarily. Principal
      >components are based solely on variance
      >in the data. That variance may not
      >correlate with the property you wish to
      >measure though. As an example, take 2
      >species of snake. They both have about
      >the same length, and that length is
      >highly variable. That's one variable you
      >measure. Now let's say you measure 10
      >other variables with high variance but
      >little or no discriminatory power
      >(diameter, etc). Finally, you measure
      >color. The snakes have very similar
      >colors, but the within-snake color
      >variation is very small, so you can tell
      >them apart using this. You now have 11
      >variables, 10 of them with high variance
      >that tell you nothing about the snake
      >species, and one with very low variance
      >that does. The PCs will be dominated by
      >the high variance variables, not the
      >color, and will not help solve the
      >problem. In fact it will make it much
      >worse, because it will take the one
      >useful variable you did have and mix it
      >up in linear combinations with all the
      >others.
      >
      >Richard

      The techniques around PCA, such as SVD, etc., will identify the separate groups. In a multidimensional case one has to be cautious about misunderstandings (e.g. the separation axis may not be one of the dimensions). [This is noted for the general reader, rather than Richard who I believe already appreciates this].

      We need to be careful in the explanations about where the particular distinctions are in each approach and how, often, they are different ends of the same calculation. The PCA and SVD method normally sort the components by various measures of size. Sometimes we want to start at the big end and some times the small end, depending what we want to achieve.

      It is an optimisation problem. We are trying to optimise the separtion between putative groups based on various flexible criteria...

      Philip Oakley
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 20, 2009 12:00 AM (in response to ptc-1368288)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/20/2009 10:40:12 AM, jmG wrote:
    > Do you mind if I
    >pass that lovely story to my
    >best friend in statistics as
    >well ?

    You can pass it on to whoever you wish. Once I post something here I figure it's in the public domain anyway.

    Richard


  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 20, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/20/2009 9:49:32 PM, pleitch wrote:
    >Thank you Richard
    >
    >"I am wondering if you have
    >the formula for AC correct.
    >
    >It matched the book that I got
    >it from but that DOES NOT mean
    >that I have it correct.

    "R" is the programming language designed and used by statisticians. I would be truly amazed if they had it wrong.

    >Anyway - my point about PCA -
    >and more specifically factor
    >analysis (next couple of
    >projects I'll be doing), is
    >that you can move the factor's
    >effects in euclidean space to
    >maximise the clustering..... I
    >think.

    With or without a-priori information about the data? With a-priori information about the data that's certainly true. Without it, I'm not so sure. You can extend PCA though to give much better classification. In my field a very successful algorithm is SIMCA:

    http://www.camo.com/resources/simca.html

    Beware though! I believe it is patented.

    Richard
      • A.Non Diamond 9,901 posts since
        May 11, 2010
        Currently Being Moderated
        Feb 21, 2009 12:00 AM (in response to PhilipLeitch)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        On 2/21/2009 3:22:52 AM, pleitch wrote:
        >yech - patants are terrible
        >things.

        Only when they are other peoples :-)

        >My background is Applied
        >Biology and Environmental
        >Science (double major BSc) -
        >so I did statistics (a lot of
        >statistics), but never dealved
        >into clustering, PCA/Factor
        >analysis. I've more recently
        >completed an MBA - but that
        >was devoid of statistics.
        >
        >I've done two honours (year of
        >research), for both the BSc
        >and MBA and both were
        >statistics based. But again,
        >neither was based of this area
        >which is why I am teaching
        >myself. Same as Baysian
        >statistics - I've dabbled
        >enough to know I don't know
        >enough so I'm about to order
        >some books on that.

        Well, my only formal training in statistics was when I was force fed a dose of it during my physics degree. That did not cover anything to do with multivariate analysis, cluster analysis, etc. I have learned it from books, by listening to a lot of talks at conferences, and by getting a lot of advice from colleagues that know more about it than I do. In my field people usually avoid the term "statistics", and use the term "chemometrics" instead. I believe the term was coined because the statistics based algorithms are often applied to data and/or applied in a way that violates the statistical assumptions. So the results are not really statistical, but they work and are exceeding useful.

        >R - is that a free software?
        >Do you have some links for it?
        >I have seen some books on it
        >but I've never heard about it.
        >I didn't know it was a
        >programming language until
        >your post just now.

        I don't know much about it myself. I had someone recommend it to me about a year ago. He seemed to think it was the best thing since sliced bread (which, as it happens, I hate), but then he was a statistician. I think it's sort of a statistical Matlab. You can download it here:

        http://www.r-project.org/

        Richard
        • A.Non Diamond 9,901 posts since
          May 11, 2010
          Currently Being Moderated
          Feb 21, 2009 12:00 AM (in response to PhilipLeitch)
          Cluster Analysis - Agglomerative Nesting (AGNES)
          On 2/21/2009 6:59:13 AM, pleitch wrote:
          >I think I found the flaw in
          >the AC.

          I think it's still wrong. If I feed the data into the web program I get AC=0.8877464975. That's high, which is what I would expect. I'm sure it's doing the same calculation, because this is the dendogram it produces:



          The branches are not in the same order, but the groupings and (very importantly) the distances are identical. That's good, because it means that your program and the dendogram routine are correct.

          When I read this:

          "Let d(i) denote the dissimilarity of object i to the first cluster it is merged with, divided by the dissimilarity of the merger in the last step of the algorithm.
          The agglomerative coefficient (AC) is defined as the average of all [ 1-d(i)]"

          it says to me that you have to calculate AC as you go through each step of the clustering. That's why I didn't attempt to fix it. It needs to be calculated in your main program, and I didn't want to mess with that. I figured you could probably do it in less time that it would take me to figure out where to begin.

          >By the way - I LOVE the line
          >combinations - dendogram.

          I wouldn't know what to do with a hierarchical clustering algorithm without it :-)

          Richard
          Attachments:
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 20, 2009 12:00 AM (in response to A.Non)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    A minor bug fix.

    Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 22, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/22/2009 4:50:24 AM, pleitch wrote:

    >This value looks much better,
    >but still isn't exact. I
    >would put money on the
    >calculations being slightly
    >different, leading to a small
    >difference in value to the R
    >version.

    You would have lost your money. It's a bug. Your indexing should start at 0, not 1 (Labels has no header row). Then the numbers match to within roundoff. I figured out how to get the distances from the web program too, and those also match to within roundoff (a rather more demanding comparison than just eyeballing dendograms!). So numerically, we are now golden.

    I changed the functions slightly so that UPGMA takes the function name of the distance measure as a parameter. That makes it much easier to call it multiple times to compare the measures. I also added an autolabeling function, because I'm not into typing large vectors of arbitrary strings every time the data is changed.

    Richard
        • A.Non Diamond 9,901 posts since
          May 11, 2010
          Currently Being Moderated
          Feb 24, 2009 12:00 AM (in response to PhilipLeitch)
          Cluster Analysis - Agglomerative Nesting (AGNES)
          On 2/24/2009 12:55:14 AM, pleitch wrote:
          >Attached with Ward's
          >method.

          >Ward's method uses a
          >completely different approach,
          >so I had to re-program some
          >areas.

          What's the point of the "Null" distance? You can't use it with the average linkage (at least, not meaningfully), and Ward's method doesn't use the distances so it doesn't matter what the distance metric is.

          >What I am hoping to
          >achieve is a worksheet that I
          >can plug values in the top,
          >then using a radio button or
          >combo box, select the
          >appropriate method AND
          >distance calculation. So this
          >is still a work in
          >progress.

          Yes, that would be my goal too. I can add a bunch of that stuff, but I have a long trip coming up in a couple of days (3-4 weeks in Asia) and I have a lot to do before I go so I'm not sure when I'll get to it.

          Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 22, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/22/2009 6:13:35 PM, pleitch wrote:

    >Thanks for the colaboration
    >with this.

    This one definitely works both ways. I also wanted hierarchical cluster analysis in Mathcad, and you have done much more than half the work.

    I think we need to add the calculation of AC to the UPGMA program too. It would make it much easier if it's called more than once with different parameters. I may do it tomorrow. I'll have UPGMA return a nested matrix with the current Groups matrix as one entry and AC as the other. That way it won't break the dendogram routine :-)

    Richard

    • A.Non Diamond 9,901 posts since
      May 11, 2010
      Currently Being Moderated
      Feb 23, 2009 12:00 AM (in response to PhilipLeitch)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      On 2/23/2009 6:43:24 PM, pleitch wrote:
      >RE Tag it on the end... Yeah -
      >that will work... but it will
      >be slower than what I
      >suggested.
      >
      >Probably won't notice much
      >unless the are lots and lots
      >of objects.

      But no slower than it is now. Thinking about it though, maybe the right thing to to is just encapsulate AC in it's own function. Then if you don't need it, the execution time is zero :-)

      >Ward's method - also known as
      >ESS or Error Sum of Squares -
      >is one of the next couple I am
      >doing (litterally implementing
      >now).

      I haven't heard it called that, but that's not surprising. If you want to check it's working correctly, that R-based web program also does Wards method.

      >If you want to wait off - I'll
      >have it done soon.

      I am not one to unnecessarily duplicate work!

      Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 23, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/23/2009 12:51:42 AM, pleitch wrote:
    >Although it would make sense
    >to include it - I think I'll
    >leave it seperate.

    AC isn't
    >frequently noted when
    >calculating this measure. Out
    >of three text books I have,
    >only the book that ONLY deals
    >with "finding groups in data"
    >has it.

    Also - will it make
    >it quicker? I can't see that
    >the AC calculation will
    >actually take very long.

    No, it would just make it easier to call it multiple times. I am a great believer in encapsulating everything in a function, rather than wring out the same code more than once.

    The
    >AC can only be calculated at
    >the end - because it measures
    >the length of each item as (1
    >- Proportional Distance). The
    >proportional distance is the
    >distance of the item's join to
    >the distance of the final
    >join.

    So you would need to
    >have a check on each
    >itteration to determine if one
    >of the items is "singular", if
    >so, add it to a running total.
    >Then - at the end of the
    >expression, divide all the
    >"singular" distances by the
    >number of items, then by the
    >final distance...


    Assume n
    >is the number of Items

    A
    >minimum has been found:
    ? Is
    >this a singular Item ?
    If Yes

    >TotalDistance <-
    >(TotalDistance +
    >(CurrentDistance/n))
    ? Is this
    >the final join ? //Note - not
    >mutually exclusive to last
    >if.
    If Yes
    AC <-
    >(1-(TotalDistance/CurrentDista
    >nce))

    You are making it sound much more difficult than it is. I was just going to tag it on to the end of the routine, right before you return "Results."


    >I'll
    >probably be adding other
    >distance measures to this
    >shortly....

    In my experience what makes a much bigger difference to the clustering is the linkage: i.e. eUPGMA. What you have is the average linkage, but there are lots of others. single, complete, and centroid are common. I have had a lot of success with Ward's method in the past.

    Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 24, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/24/2009 1:09:40 AM, pleitch wrote:
    >Done - check my recent posts.
    >
    >Just to be clear, it isn't
    >called ESS/Error sum of
    >squares. However, it is based
    >on this. It finds the
    >Euclidean distances between
    >the objects of the cluster and
    >it's centroid. So instead of
    >determining distances to
    >combine close objects, it is
    >determining the next objects
    >that would combine to create
    >the next smallest group (i.e.
    >the next tightest group of
    >objects). That way the focus
    >is on creating small groups -
    >stepping out to larger groups
    >as apposed to joining close
    >objects together as possible.
    >
    >At least... this is my
    >understanding.

    Sounds about right. Here's another description:

    "The previous algorithms merge the two groups which are most similar. Ward's Algorithm, however, tries to find as homogeneous groups as possible. This means that only two groups are merged which show the smallest growth in heterogeneity factor H. Instead of determining the spectral distance, the Ward�s Algorithm determines the growth of heterogeneity H."

    Richard
  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 25, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/25/2009 7:35:21 AM, pleitch wrote:

    >Ward doesn't need it - but the
    >existing procedure still
    >called it. So to avoid
    >re-writing the procedure I
    >supplied a distance method
    >that didn't calculate
    >anything.

    But why not just ignore it? It doesn't matter what the distance method is. If you wanted some minor time saving, why not just set all the elements to 0?

    >If you are traveling to
    >Australia, let me know - our
    >paths might cross.

    No. I have to go to Bali first, then Beijing. It's all business, but I'll slot in some personal time in Bali (would be dumb not to!). I've been to Beijing so many times that's just another long business trip :-(

    Richard
    • A.Non Diamond 9,901 posts since
      May 11, 2010
      Currently Being Moderated
      Feb 26, 2009 12:00 AM (in response to PhilipLeitch)
      Cluster Analysis - Agglomerative Nesting (AGNES)
      On 2/26/2009 6:22:09 AM, pleitch wrote:
      >Sure thing.
      >
      >From my perspective this has
      >been collaborating at its
      >best.

      I agree. It's been very productive.

      Richard
      • ptc-1368288 Copper 15,155 posts since
        Nov 15, 2007
        Currently Being Moderated
        Feb 26, 2009 12:00 AM (in response to A.Non)
        Cluster Analysis - Agglomerative Nesting (AGNES)
        On 2/26/2009 8:39:23 AM, rijackson wrote:
        >On 2/26/2009 6:22:09 AM, pleitch wrote:
        >>Sure thing.
        >>
        >>From my perspective this has
        >>been collaborating at its
        >>best.
        >
        >I agree. It's been very productive.
        >
        >Richard
        ______________________

        Hopefully Mona does catch that great piece
        of work and have PTC include in a future DAEP.

        Yes, Philip:
        this forum is the "Mathcad Klondike",
        a good pit to dig. BTW, as you are in
        the medical fields, do you have in project
        the Delaunay diagram ? I tried long time
        ago but not to avail, because too novice.

        Jean



  • A.Non Diamond 9,901 posts since
    May 11, 2010
    Currently Being Moderated
    Feb 26, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/26/2009 1:39:32 AM, pleitch wrote:
    >Latest Additions/Revisions:
    >Cleaned up some of the
    >variables and text.
    >
    >Cleaned up Null distance to be
    >zero (should have been this
    >from the start).
    >
    >
    >Added more methods:
    >Single Linkage (Nearest
    >Neighbour)
    >Clomplete Linkage (Furthest
    >Neighbour)
    >Centroid Method (Average)

    Excellent :-)


    >Updated Bar graph is set to
    >match the length shown on the
    >web site (which is different
    >to my book):
    >http://www.wessa.net/rwasp_agg
    >lomerativehierarchicalclusteri
    >ng.wasp#output
    >
    >However, the order of the bar
    >graph is still out. The order
    >is based on both distance and
    >join path (New_Labels variable
    >later in the worksheet) - but
    >I don't have the time/care
    >factor to do that right now.

    I also noticed the differences in the graphs. If you compare the one on the web site to the dendogram on the web site you'll notice they contain exactly the same information though. The bar graph is a sort of "filled in" dendogram. If I find time while I'm in Asia I might see if I can modify the dendogram routine to also return the correct data for the banner plot (a.k.a bar graph :-)). I also intend to convert the AC and dendogram calculations to functions. Ideally, it should be possible to call all of these routines multiple times without having to copy code. Finally, a lot of stuff needs to be dropped into collapsed areas so it's easier to scroll up and down the worksheet. and of course we need list boxes, etc to pick methods. I'll see what I can get done, but for sure nothing for the next week or so (until I get to Beijing).

    Richard
  • ptc-1368288 Copper 15,155 posts since
    Nov 15, 2007
    Currently Being Moderated
    Feb 27, 2009 12:00 AM (in response to PhilipLeitch)
    Cluster Analysis - Agglomerative Nesting (AGNES)
    On 2/27/2009 12:12:40 AM, pleitch wrote:
    >>BTW, as you are in
    >>the medical fields, do you have in
    >>project
    >>the Delaunay diagram ? I tried long time
    >>ago but not to avail, because too
    >>novice.
    >
    >No. I've never dealt with
    >that.
    >
    >Do you mean creating a
    >triangle mesh, where...
    >constraints...
    >
    >If so - then no... but here is
    >a link that might be helpful:
    >http://www.cs.cmu.edu/~quake/t
    >riangle.html
    >
    >If not - what is it you are
    >after?
    >
    >Ta.
    >Philip
    >____________________________

    Interesting applets in there.

    http://www.diku.dk/hjemmesider/studerende/duff/Fortune/

    It sounds a big project !

    jmG


More Like This

  • Retrieving data ...

Bookmarked By (0)