Central Tendency — measures

The 3 common measures of central tendency used in statistics are :

  • 1. Mean
  • 2. Median
  • 3. Mode

There are of course other methods as the Wikipedia page attests. However the inspiration for this post was from yet another J.D.Cook’s blog ..

Note: That all these three and the other measures do obey the basic rules of measure theory.

The point being what you choose to describe your central tendency is key and should be decided based on what you want to do with it. Or more precisely what exactly do you want to optimize your process/setup/workflow for, and based on that you’ll have to choose the right measure. If you read that post above you’ll understand that:

Note: that even within mean there are multiple types of mean. For simplicity I’ll assume mean means arithmetic mean (within the context of this post).

  • Mean — Mean is a good choice when you want to minimize the variance(aka, squared distance or second statistical moment about central tendency measure).. That’s to say your optimization function is dominated by a lot of square of distance(from central tendency measure) terms. Think of lowering mean squared error. and how it’s used in straight line fitting
  • Median — Median is more useful if your optimization function has distance terms but not squared ones. So this will in effect be the choice when you want to minimize the distance from central tendency.
  • Midrange — Midrange is useful when your function looks like max(distance from central measure)..

If most of that sounded too abstract then here’s a practical application I can think of right away to use. Imagine you’re doing performance testing and optimization of a small API you’ve built. Now I don’t want to go into what kind of API/technology behind it or anything. So let’s just assume you want to run it multiple times and calculate a measure of central tendency from it and then try to modify the code’s performance(with profiling + different libraries/data structures whatever….), so what measure of central tendency should you pick?

  • Mean — Most Engineers would pick Mean and in a lot of cases it’s enough but think about it. It optimizes for variance of run/execution time. Which is important and useful to optimize in most cases, but in some cases may not be that important.
  • Mode — An example is if your system is a small component of say a high-frequency trading platform and the consumer of it has a timeout and fails if it times out.(aka your api is mission-critical, it simply cannot fail). Then you want to make sure even in the lowest case your program completes. If the worst case runtime complexity is what you want to lower then you should pick mode. (Note this is still a trade-off over not lowering the average/mean use-case, just like hard-choice.)
  • Median — This is very similar to Mean, except it doesn’t really care about variance. If you’re picking median, then your optimized program is sure to have the best performance in the average run/case/dataset
  • Midrange — Well this is an interesting case. Think about it.. even in the previous timeout example i mentioned this could be useful. Here it goes,suppose your api is not mission-critical(i.e: if it fails the overall algorithm will just throw out that data term and progress with other data sources). when you want to maximize the number of times your program finishes within the timeout. i.e: you’re purely measuring the number of times you finish/return a value within the timeout period. You don’t care about the worst-case scenario.

There are other measures, such as:

 

Additionally, you can take mean of functions(non-negative ones too). See JDCook’s blog again.

Squeeze theorem

To quote from “The girl next door”
The first lesson of politics is “Always know whether the squeeze is worth the juice”. Now i was trying to finally make a genuine effort at understanding Central Limit theorem. Throughout my life(30 years), i have always been suspicious whenever statistics goes beyond the mean, median, mode, SD and Variance. (i.e to say, whenever any stat goes above first and second moments). Part of it because i never really learnt or rather never paid enough attention to convince myself of the theorems involved in reasoning with distributions. Anyways, i figured Central limit theorem would be a good place to start and in learning by teaching am summarizing what i’ve learnt so far.

It started off as i came across this post on HN and going through comments and critique realized the demo is more of a special case and while i did get that specific example(and sure of what CLT says) am still unsure of why Central limit theorem is true or how one formulate it in math terms. It is important for me to understand those, if i am ever to be able to question someone claiming some implication of CLT. Anyway, i came across the squeeze theorem in one of the HN comments and since it seems it’s part of the proof for CLT, I ended up reading and here’s the result of that.

Anyway, enough story. Let’s go onwards. So here goes straight from the wikipedia page:

Assumptions:
There are three functions f,g,h defined over a limit l.
a is a limit point.
f,g,h may not be defined at a, since it is the limit point.

g(x) leq f(x) leq h(x)

lim_{x to a} g(x) = lim_{x to a} h(x) = L

To be proved:
lim_{x to a} f(x) = L

Proof:

Limits:

I’ll try and clarify what is a limit as mathematically defined, and hopefully without equations,but words only.
Well, according to wikipedia page, limit of a function f(x) means that the function f(x) can be made as close to a value (say L),
by making x sufficiently close to c.

Or to write out the equation
lim_{x to c}f(x) = L

Harmonic Mean

This is a followup post to geometric mean post.

What exactly is Harmonic mean ?
Well to summarize the wikipedia link, it is basically a way to average of rates of a some objects.

Continuing with the Laptop, example , let’s see how to compare the laptops in terms of best bang for the buck.

Once again, we have three attributes and we divide the attribute values by the cost of the laptop. Now this will give us (rather approximately) how much GB/Rupee* we get.

The we apply the formula for harmonic mean.: i.e: 3/(1/x1 + 1/x2 +1/x3).

Just for the fun of argumentation, I threw in a Raspberry Pi 2 + cost of 32 GB SD Card inside.
And Of course** the Raspberry Pi 2 comes out on top on the Harmonic mean(of most bang for the buck) ranking..

Note, how i divided the attributes by cost. In other words, I did that because harmonic mean doesn’t make sense to apply to values that are not rates. (aka, for the engineers, the units have to have a denominator.)

Also note that, the Raspberry Pi 2 is lower in both the arithmetic and geometric means of the attributes(CPU speed, Disk space, RAM), but higher when it comes to value per price. That’s one reason to use harmonic mean of rates (of price/time/) when comparing similar purchases, with multiple attributes/values to evaluate.

Now, so far these are all individual attributes, that don’t talk about or evaluate other factors.

Like for example the apple’s retina display technology. Or for that matter, CPU Cache, or AMD vs Intel processor, Or multithreading support, Or number of cores etc..

All of these could be weighted, if you do know how to weight them. And weighting them right would require some technical knowledge, and reading up reviews of products with those features on anandtech’s reviews/comparison blog posts.

* — If you look closely at the Excel sheet, I’d have multiplied the GHz by 1000, and get KHz to get the numbers in a decent level.

** — Of course, because, it doesn’t come with a monitor, keyboard or mouse. It is simply a PCB chip.

Geometric Mean

There are more than one type of average(or mean).

UPDATE: In fact there’s a generalized way of finding the mean. It’s called Generalized Mean and all of the below are special cases of that.

The most famous Euclid’s three are :
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean

In measure theory terms, there are different ways to measure the central tendency of a distribution and each is used in different situation depending on the demands of context.

Arithmetic Mean is what most of us are taught is schools and most used. i.e: adding up values and dividing by the number of values.

What exactly is geometric mean. and where and why is it useful.

The definition of what it is rather simple.
If you want to find geometric mean of n numbers (positive integers), you multiply them all and take the nth root of the resulting product. Now, why would it be useful, and what’s the point of doing this. The positive integers is not really big difficulty in real life, (as in the worst case we can just shift/translate the origin for the axis with negative numbers)

Ok, Now imagine you have standard graph with the axes having very different limits. i.e: x axis varies from 0-0.5 while y axis varies from 0-100.
Now suppose you want to compare two(or more) objects/distribution both of which have measures along x and y. You can plot points in different colours(for diff objects) on the x and y, and then try to make a decision, based on what you want to pick.

That sounds fine till you think there are very few(say 5-10) different objects to be compared. What if you have say (100 laptops and 10 features) you want to compare them across?.
Ah now we’re in real trouble. How do we know which ones are which among the 100 colours, on top of that you have 5 graphs(for 10 features).

What we need is a way to combine these axes into one axis. Then we can go back to simple bar charts.

Here comes geometric mean to the rescue. If you look at the definition it multiplies the feature values which gives us a area(if 2 features), volume (if 3) or a n-dimensional volume value.
We can’t simply use this because, at the moment this value is biased towards features that have a higher range of values.
i.e: in the previous example y axis which ranged from 0-100 will simply wash out any differences in x.

So we take the (2 or 3 or n)th root of this value. In effect we have found normalized the axis range itself.

Note: the cool part here is we don’t need to know anything about the actual range itself. The nature of the operations on the valiues (i.e: product and nth root) itself ensures the final geometric mean value is equalized.

For an example I’ll pick laptop CPU Speed, Hard disk size, and RAM here’s a link..
If you look at it closely, while in the examples i have picked, while all the three pythagorean means don’t change ordinality/ranking of the laptop being compared, the Arithmetic mean gets dominated/boosted simply by raising Hard Disk space.
On the other hand the geometric mean, doesn’t get raised (as much simply) by raising the attribute with higher values.

It’s not really surprising, since the geometric mean is a exponential function and arithmetic mean is a linear function.

You can ignore the Harmonic mean for now, as it’s not at all clear what’s common among the laptops. I’ll later make another post/update detailing how harmonic mean can be used for this case.

One case where it is used is in finding F-Score for comparing predictive algorithms, statistical tests etc.

UPDATE: Harmonic mean post is here.

UPDATE 2: One way to approach and/or defend against confusopoly is to choose a good measure to normalize against the value of the features.. Say like geometric mean. https://softwaremechanic.wordpress.com/2016/07/18/geometric-mean/ .. However note that it assumes you’ll need to find what are comparable features and how meaningful and inter-changeable they are… That’s not trivial and needs deep domain expertise.