Regularization

Based on a small post found here.

One of the standard problems in ML with meta modelling algorithms(Algorithms that run multiple statistical models over given data and identifies the best fitting model. For ex: randomforest  or the rarely practical genetic algorithm) ,  is that they might favour overly complex models that over fit the given training data but on live/test data perform poorly.

The way these meta modelling algorithms work is they have a objective function(usually the RMS of error of the stats/sub model with the data)  they pick the model based on.(i.e: whichever model yields lowest value of the objective function).  So we can just add a complexity penalty(one obvious idea is the rank of the polynomial that model uses to fit, but how does that work for comparing with exponential functions?)  and the objective function suddenly becomes RMS(Error) + Complexity_penalty(model).

 

Now depending on the right choice of Error function and Complexity penalty this can find models that may perform less than more complex models on the training data, but can perform better in the live model scenario.

The idea of complexity penalty itself is not new, I don’t dare say ML borrowed it from scientific experimentation methods or some thing but the idea that the more complex a theory or a model it should be penalized over a simpler theory or model is very old. Here’s a better written post on it.

Related Post: https://softwaremechanic.wordpress.com/2016/08/12/bayesians-vs-frequentistsaka-sampling-theorists/

 

Statistical moments —

Inspired by this blog from paypal team. Moment is a physics concept(or atleast I encountered it first in physics, but it looks it has been generalized in math to apply to other fields).

If you followed that wikipedia math link above, you’ll know the formula for moment is
\mu_n = \int\limits_{-\infty}^{+\infty} (x-c)^n f(x)\ dx\
where x — the value of the variable
n — order of the moment (aka nth moment, we’ll get to that shortly)
c — center or value around which to calculate the moment.

However if you look at a few other pages and links they ignore that part c.. and of course use the summation symbol.**

The reason they don’t put up ‘c’ there is they assume moment around the value 0. As we’ll see below this is well and good in some cases, but not always.

The other part n- order of the moment is an interesting concept. It’s just raising the value to nth power. To begin with if n is even the the negative sign caused by differences goes away. So it’s all a summary and becomes a monotonically increasing function.

I usually would argue that the ‘c’ would be the measure of central tendency like mean/median/mode and a sign of fat-tailed/thin-tailed distributions is that the moments will be different if you choose a different c and the different moments change wildly.

The statsblog I linked above mentions something different.

Higher-order terms(above the 4th) are difficult to estimate and equally difficult to describe in layman’s terms. You’re unlikely to come across any of them in elementary stats. For example, the 5th order is a measure of the relative importance of tails versus center (mode, shoulders) in causing skew. For example, a high 5th means there is a heavy tail with little mode movement and a low 5th means there is more change in the shoulders.

Hmm.. wonder how or why? I can’t figure out how it can be an indication of fat-tails(referred by the phrase importance of tails in the quote above) with the formula they are using. i.e: when the formula doesn’t mention anything about ‘c’.

** — That would be the notation for comparing discrete variables as opposed to continuous variables, but given that most of real-world application of statistics uses sampling at discrete intervals, it’s understandable to use this notation instead of the integral sign.

Continuity of a function.

Most of us, would have studied(likely in high school) about the idea of functions being continuous.

As the wikipedia section states, we end up with 3 conditions for a function over an interval [a,b].

  • The function should be defined at a constant value c
  • The limit has to exist.
  • The value of the limit must equal to c.

 

Now this is a perfectly useful notion for most of the functions we encounter in high school. But there are functions that would satisfy these three conditions, but still won’t be helpful for us to move forward. And I just discovered while reading up for my self-educating on statistics. One moment there I was trying to understand what’s the beta distribution, or for that matter what sense does it make to talk about a probability density function, I mean understand what’s probability, but how can it have density and that sort of thing.. I lose focus a few seconds, and find myself tumbling down a click-hole to find a curiouser idea about 3 levels/types(ordered by strictness) of continuity of a function namely

 

The last one being what we studied and what I described above.
Now let’s get Climb up one more step on the ladder of abstraction and see what’s uniform continuity.
Ah we have five more types of continuous there namely

Ok I won’t act vogonic and try to understand or explain all of those . I just put them out there to tease feel free to click your way in.

To quote first line from the uniform continuous wiki:

In mathematics, a function f is uniformly continuous if, roughly speaking, it is possible to guarantee that f(x) and f(y) be as close to each other as we please by requiring only that x and y are sufficiently close to each other; unlike ordinary continuity, the maximum distance between f(x) and f(y) cannot depend on x and y themselves. For instance, any isometry (distance-preserving map) between metric spaces is uniformly continuous.

So what does this mean and how does it differ from the ordinary continuity? Well they say it up there that the maximum distance between f(x) and f(y) cannot depend on x and y themselves. i.e: the dist.function: df(f(x), f(y)) has neither x or y in it’s expression/input/right hand side.

The more formal definition can be quoted like this:

.
Given metric spaces (X, d1) and (Y, d2), a function f : X → Y is called uniformly continuous if for every real number ε > 0 there exists δ > 0 such that for every x, y ∈ X with d1(x, y) < δ, we have that d2(f(x), f(y)) < ε.

Now why would this be relevant or useful and why is it higher/stricter than ordinary continuity. Well note that it doesn’t say anything about an interval. The notion of ordinary continuity is always defined on an interval in the input space and clearly is confined to that. i.e: it is a property that is local to the given interval in input/domain space and may or may not apply on other different intervals.

On the other hand if you can say the function is uniformly continuous you’re effectively saying the function is continuous at all intervals.

Now how do we find a more general definition(i.e: absolute continuity?) Well look at the 3 conditions we defined at the start of this blog post. The first two can be collapsed to say the function must be differentiable over the given interval [a,b]. The third is the distance/measure concept we used in uniform continuous definition to remove the bounds on the interval and say everywhere. So obviously for the absolute continuous definition we do the next step and say the function must be uniformly continuous and differentiable everywhere.(aka uniformly differentiable).

Ok all of this is great, except where the hell is this useful. I mean are there function that belong to different continuous classes, so that these definitions/properties and theorems built on these can be used to differentiate and reason about functions. Turns out there are . I’ll start with something i glimpsed on my way down the click-hole, the cantor distribution. . It’s the exception that causes as to create a new class of continuous. It’s neither discrete nor continuous.

It’s distribution therefore has no point mass or probability mass function or probability density function.* Therefore throws a lot of the reasoning/theorem systems for a loop.

For the other example i.e: something ordinarily continuous but not uniformly continuous see here. . It’s a proof by contradiction approach.

* — Ok, I confess, the last point about point mass and probability mass/density function still escapes me. I’ll revisit that later, perhaps this time with the help of that excellent norvig’s ipython notebook on probability.

Bayesians vs Frequentists(aka sampling theorists)

This is a long standing debate/argument and like most polarized arguments, both sides have some valid and good reasons for their stand. (There goes the punchline/ TLDR). I’ll try to go a few levels deeper and try to explain the reasons why I think this is kind of a fake argument. (Disclaimer: am just a math enthusiast, and a (willing-to-speculate) novice. Do your own research, if this post helps as a starting point for that, I’d have done my job.)

  • As EY writes in this post about how bayes theorem is  a law that’s more general and should be observed over whatever frequentist tools we have developed?
  • If you read the original post carefully, he doesn’t mention the original/underlying distribution, guesses about it or confidence interval(see calibration game)
  •  He points to a chapter(in the addendum) here.
  • Most of the post otherwise is about using tools vs using a general theory and how the general theory is more powerful and saves a lot of time
  •  My first reaction to the post was but obviously there’s a reason those two cases should be treated different. They both have the same number of samples, but different ways of taking the samples. One sampling method(one who does sample till 60% success) is a biased way of gathering data .
  • As a different blog and some comments point out, if we’re dealing with robots(deterministic algorithmic data-collector) that precisely take data in a rigourous deterministic algorithmic manner the bayesian priors are the same.
  • However in real life, it’s going to be humans, who’ll have a lot more decisions to make about considering a data point or not. (Like for example, what stage of the patient should be when he’s considered a candidate for the experimental drug)
  • The point I however am going to make or am interested in making is related to known-unknowns vs unknown-unknowns debate.
  • My point being even if you have a robot collecting the data, if the underlying nature of the distribution is unknown-unknown(or for that matter depends on a unknown-unknown factor, say location, as some diseases are more widespread in some areas)  mean that they can gather same results, even if they were seeing different local distributions.
  • A contiguous point is that determining the right sample size is a harder problem in a lot of cases to be confident about the representativeness of the sample.
  • To be fair, EY is not ignorant of this problem described above. He even refers to it a bit in his 0 and 1 are not probabilities post here. So the original post might have over-simplified for the sake of rhetoric or simply because he hadn’t read The Red queen.
  • The Red queen details about a bunch of evolutionary theories eventually arguing that the constant race between parasite and host immune system is why we have sex  as a reproductive mechanism and we have two genders/sexes.
  • The medicine/biology example is a lot more complex system than it seems so this mistake is easier to make.
  • Yes in all of the cases above, the bayesian method (which is simpler to use and understand) will work, if the factors(priors) are known before doing the analysis.
  • But my point is that we don’t know all the factors(priors) and may not even be able to list all of them, let alone screen, and find the prior probability of each of them.

With that I’ll wind this post down. But leave you with a couple more posts I found around the topic, that seem to dig into more detail. (here and here)

 

P.S: Here’s a funny Chuck Norris style facts about Eliezer Yudkowsky.(Which I happened upon when trying to find the original post and was not aware of before composing the post in my head.) And here’s an xkcd comic about frequentists vs bayesians.

UPDATE-1(5-6 hrs after original post conception): I realized my disclaimer doesn’t really inform the bayesian prior to judge my post. So here’s my history/whatever with statistics. I’ve had trouble understanding the logic/reasoning/proof behind standard (frequentist?) statistical tests, and was never a fan of rote just doing the steps. So am still trying to understand the logic behind those tests, but today if I were to bet I’d rather bet on results from the bayesian method than from any conventional methods**.

UPDATE-2(5-6 hrs after original post conception):  A good example might be the counter example. i.e: given the same data(aka in this case frequency of a distribution, nothing else really, i.e: mean, variance, kurtosis or skewness) show that bayesian method gives different results based on how it(data) was collected and frequentist doesn’t. I’m not sure it’s possible though given the number of methods frequentist/standard methods use.

UPDATE-3 (a few weeks after original writing): Here’s another post about the difference in approaches between the two.

UPDATE-4 (A month or so after): I came across this post with mentions more than two buckets, but obviously they are not all disjoint sets(buckets).

UPDATE-5(Further a couple of months after): There’s a slightly different approach to splitting the two cultures from a different perspective here.

UPDATE-6: A discussion in my favourite community can be found here.
** — I might tweak the amount I’d bet based on the results from it .

Central Tendency — measures

The 3 common measures of central tendency used in statistics are :

  • 1. Mean
  • 2. Median
  • 3. Mode

There are of course other methods as the Wikipedia page attests. However the inspiration for this post was from yet another J.D.Cook’s blog ..

Note: That all these three and the other measures do obey the basic rules of measure theory.

The point being what you choose to describe your central tendency is key and should be decided based on what you want to do with it. Or more precisely what exactly do you want to optimize your process/setup/workflow for, and based on that you’ll have to choose the right measure. If you read that post above you’ll understand that:

Note: that even within mean there are multiple types of mean. For simplicity I’ll assume mean means arithmetic mean (within the context of this post).

  • Mean — Mean is a good choice when you want to minimize the variance(aka, squared distance or second statistical moment about central tendency measure).. That’s to say your optimization function is dominated by a lot of square of distance(from central tendency measure) terms. Think of lowering mean squared error. and how it’s used in straight line fitting
  • Median — Median is more useful if your optimization function has distance terms but not squared ones. So this will in effect be the choice when you want to minimize the distance from central tendency.
  • Midrange — Midrange is useful when your function looks like max(distance from central measure)..

If most of that sounded too abstract then here’s a practical application I can think of right away to use. Imagine you’re doing performance testing and optimization of a small API you’ve built. Now I don’t want to go into what kind of API/technology behind it or anything. So let’s just assume you want to run it multiple times and calculate a measure of central tendency from it and then try to modify the code’s performance(with profiling + different libraries/data structures whatever….), so what measure of central tendency should you pick?

  • Mean — Most Engineers would pick Mean and in a lot of cases it’s enough but think about it. It optimizes for variance of run/execution time. Which is important and useful to optimize in most cases, but in some cases may not be that important.
  • Mode — An example is if your system is a small component of say a high-frequency trading platform and the consumer of it has a timeout and fails if it times out.(aka your api is mission-critical, it simply cannot fail). Then you want to make sure even in the lowest case your program completes. If the worst case runtime complexity is what you want to lower then you should pick mode. (Note this is still a trade-off over not lowering the average/mean use-case, just like hard-choice.)
  • Median — This is very similar to Mean, except it doesn’t really care about variance. If you’re picking median, then your optimized program is sure to have the best performance in the average run/case/dataset
  • Midrange — Well this is an interesting case. Think about it.. even in the previous timeout example i mentioned this could be useful. Here it goes,suppose your api is not mission-critical(i.e: if it fails the overall algorithm will just throw out that data term and progress with other data sources). when you want to maximize the number of times your program finishes within the timeout. i.e: you’re purely measuring the number of times you finish/return a value within the timeout period. You don’t care about the worst-case scenario.

There are other measures, such as:

 

Additionally, you can take mean of functions(non-negative ones too). See JDCook’s blog again.

Squeeze theorem

To quote from “The girl next door”
The first lesson of politics is “Always know whether the squeeze is worth the juice”. Now i was trying to finally make a genuine effort at understanding Central Limit theorem. Throughout my life(30 years), i have always been suspicious whenever statistics goes beyond the mean, median, mode, SD and Variance. (i.e to say, whenever any stat goes above first and second moments). Part of it because i never really learnt or rather never paid enough attention to convince myself of the theorems involved in reasoning with distributions. Anyways, i figured Central limit theorem would be a good place to start and in learning by teaching am summarizing what i’ve learnt so far.

It started off as i came across this post on HN and going through comments and critique realized the demo is more of a special case and while i did get that specific example(and sure of what CLT says) am still unsure of why Central limit theorem is true or how one formulate it in math terms. It is important for me to understand those, if i am ever to be able to question someone claiming some implication of CLT. Anyway, i came across the squeeze theorem in one of the HN comments and since it seems it’s part of the proof for CLT, I ended up reading and here’s the result of that.

Anyway, enough story. Let’s go onwards. So here goes straight from the wikipedia page:

Assumptions:
There are three functions f,g,h defined over a limit l.
a is a limit point.
f,g,h may not be defined at a, since it is the limit point.

g(x) leq f(x) leq h(x)

lim_{x to a} g(x) = lim_{x to a} h(x) = L

To be proved:
lim_{x to a} f(x) = L

Proof:

Limits:

I’ll try and clarify what is a limit as mathematically defined, and hopefully without equations,but words only.
Well, according to wikipedia page, limit of a function f(x) means that the function f(x) can be made as close to a value (say L),
by making x sufficiently close to c.

Or to write out the equation
lim_{x to c}f(x) = L

Harmonic Mean

This is a followup post to geometric mean post.

What exactly is Harmonic mean ?
Well to summarize the wikipedia link, it is basically a way to average of rates of a some objects.

Continuing with the Laptop, example , let’s see how to compare the laptops in terms of best bang for the buck.

Once again, we have three attributes and we divide the attribute values by the cost of the laptop. Now this will give us (rather approximately) how much GB/Rupee* we get.

The we apply the formula for harmonic mean.: i.e: 3/(1/x1 + 1/x2 +1/x3).

Just for the fun of argumentation, I threw in a Raspberry Pi 2 + cost of 32 GB SD Card inside.
And Of course** the Raspberry Pi 2 comes out on top on the Harmonic mean(of most bang for the buck) ranking..

Note, how i divided the attributes by cost. In other words, I did that because harmonic mean doesn’t make sense to apply to values that are not rates. (aka, for the engineers, the units have to have a denominator.)

Also note that, the Raspberry Pi 2 is lower in both the arithmetic and geometric means of the attributes(CPU speed, Disk space, RAM), but higher when it comes to value per price. That’s one reason to use harmonic mean of rates (of price/time/) when comparing similar purchases, with multiple attributes/values to evaluate.

Now, so far these are all individual attributes, that don’t talk about or evaluate other factors.

Like for example the apple’s retina display technology. Or for that matter, CPU Cache, or AMD vs Intel processor, Or multithreading support, Or number of cores etc..

All of these could be weighted, if you do know how to weight them. And weighting them right would require some technical knowledge, and reading up reviews of products with those features on anandtech’s reviews/comparison blog posts.

* — If you look closely at the Excel sheet, I’d have multiplied the GHz by 1000, and get KHz to get the numbers in a decent level.

** — Of course, because, it doesn’t come with a monitor, keyboard or mouse. It is simply a PCB chip.

Urlshortener in python

I attempted this(url shortener) project as a kind of a dare. Over a lunch one day, atul casually commented that a url shortener,

would be a ten minute work for someone like you and a blog about it would make a nice fit for my site.

I was not sure it’s a 10 minute work, but sounded like a small fun project for weekend attempts.

It took more than half a day, for me to be content(not happy, but just content) with the resulting work.

Nevertheless, it was time well spent. Here’s the link to the actual project.<a href=”https://github.com/emofeedback/urlshortener”&gt;

 

Few clarifications about some choices(should i call it software engineering??):

  1. Redis because it’s extremely efficient, stays in RAM, and has some unconfirmed reports of being used to power china’s internal twitter clone.
  2. RAM based data base means lower latency, as opposed to a disk+ cache based database, but needs higher RAM.
  3. Used 5 chars, because, well the internet is big, and thought it’s a good number to cover most unique urls.(especially, when combined with the 78 chars allowed for url, i.e: 5^78) Thanks to a colleague’s suggestion for the idea, i was thinking about hashing/random before his suggestion.
  4.  A hash map of shortened url-> original url and original url -> shortened url was kinda against my idea. I still keep thinking this is too much memory, we should find a different way. I think if I take away the use-case of already shortened url, being submitted for new shortening, should return, I can eliminate the original url -> shortened url hash map.
  1. To check if a new original url has already been shortened, the hyperloglog comes to the rescue. Just pass it through the HLL algorithm and see if it returns True and False.
  2. From the UI point, well i just put up a minimal html form to take a original url and return a json.
  3. Yet to do, add a proper fabric script to setup virtualenv, python packages, redis install and nginx config and restart nginx.

You can see it live here.
 

Probability of a biased coin.

Story

I started off with reading up this book on . Fairly, quickly into the book, they start off with some code plotting posterior bayesian probability estimates, of a coin toss. For this book style this is just an obligatory example. I also have come across a set of stories designed to teach probabilities. You can find these . One of the exercises in that series, is where the teacher lies to the student about having a fair coin and demonstrates coin toss series of heads. A question at the end of the story is how many series of heads, do you have to see, before you suspect some foul-play. 1 Now, that’s an interesting question and while the text gives some answers,it’s just a set of heuristics. I on the other hand, wanted to see some specific graphs based on the bias. So I picked up the code and modified. Here’s the base code for an unbiased coin. Written inside an ipython shell.

figsize( 11, 9)
import scipy.stats as stats
dist = stats.beta
ntrials = [0,1,2,3,4,5,8,15, 50, 500]
data = stats.bernoulli.rvs(0.5, size = ntrials[-1] )
x = np.linspace(0,1,100)
for k, N in enumerate(ntrials):
sx = subplot( len(ntrials)/2, 2, k+1)
plt.xlabel("p, probability of heads")
if k in [0,len(ntrials)-1] else None
plt.setp(sx.getyticklabels(), visible=False)
heads = data[:N].sum()
y = dist.pdf(x, 1 + heads, 1 + N - heads )
plt.plot( x, y, label= "observe plt.fillbetween( x, 0, y, color="#348ABD", alpha = 0.4 )
plt.vlines( 0.5, 0, 4, color = "k", linestyles = "–", lw=1 )
leg = plt.legend()
leg.getframe().setalpha(0.4)
plt.autoscale(tight = True)
plt.suptitle( "Bayesian updating of posterior probabilities", y = 1.02, fontsize = 14); plt.tightlayout()

Unbiased coin means P(H) = P(T) = 0.5 And here is the plot. figure_1

Now, let’s go and change that distribution. The code derives a binomial distribution from the scipy.stats package. See the line data = stats.bernoulli.rvs(0.5,size=ntrials[-1]) ? All we have to do is change that 0.5 to 0.6.
Here’s the plot: figure_2_0.6
Notice how the distribution starts varying very visibly right after 4 tosses. Now, I would say, that’s about the first sign of trouble.Unfortunately, in real scenario, you won’t be able to visualize distribution, but you can have a heuristic. Let’s assume you start of playing with a $1000 2 Bet for each coin-toss, about the 4th toss is when you say, I am suspicious, I should sum up the past results and reconsider. If they are all 4 Heads/Tails<whichever you lose> you should either stop or bet less.3 By the time you’re at 8th Toss, if you have even seen 6 Heads, you can just quit and call the other a liar.

If my understanding of Casinos, usually don’t design any payoffs worse than this. So the rest of the pictures may be just academic. But, I had plotted them, so here it goes.

Now, let’s try with 0.7.
figure_3_0.7
If you notice close enough, you’ll see that all first 5 tosses resulted in Heads.

Now, let’s try 0.8

figure_4_0.8

Interesting graphs. See even after 15 tosses there are 11 Heads. Meanwhile with 0.7, 15 tosses resulted in 13 Heads. Power of randomness. This means, that for those of you with a high risk appetite, if the reward is big enough,you can afford to wait for 15 tosses, before you review your decision.

Now, let’s try 0.9 figure_5_0.9

Once again, even 15 tosses, display 12 Heads. For risk-averse, right away ditch it after 3-4 tosses.

Now how is it all useful or meaningful for a real-life use?

Well, if you think of stocks, with the company’s quarter profits as coin-tosses it’s a interesting idea to form investment strategy. But never forget, this is a laboratory experiment. The distribution assumes probability of head/tail to be static over time. Not only do Company profits, vary, but the probability of a company making profits also varies based on a whole lot of other factors.


  1. Biased-coin,some hidden manipulation of the toss, false reporting of result etc..

  2. I know the amount sounds crazy,unrealistic.

  3. I don’t know how much money you have :)