cpython hashing function implementation for long int object.

This is a follow up to this post.

So went on opening up the Objects/dictobject.c file and start reading up the comments. It’s extremely well documented. I read through the first set of comments(about 50 lines) to learn that
python’s dict has four type of slots. i.e: an entry in the python’s dict object can be one of 4 types:

1. Active : active (key,value pair)
2. Unused: key=value=NULL
3. Dummy: key=dummy value=NULL — was previously active, but was deleted.
4. Pending not yet inserted or deleted. key !=dummy and key != null

As for further implications and usage of these different states i refer you to go here.

Moving on to the comments on hashing function.

From the comment section:

Major subtleties ahead: Most hash schemes depend on having a “good” hash
function, in the sense of simulating randomness. Python doesn’t: its most
important hash functions (for strings and ints) are very regular in common
cases:
To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are “consecutive” strings. So this
gives better-than-random behavior in common cases, and that’s very desirable.

Also, the comment goes on to talk about how the simple implementation has problems with collisions and the collision resolution strategy is crucial. But that’s stuff for another post.

Ok clearly, my question can’t be answered by looking at this file. i’ll have to look deeper

So i look up where the actual hash function is found and realize it’s part of the builtin modules. Therefore found in Python/bltinmodule.c . Now the builtin hash function implementation here just takes a PyObject(the C-representation of a Python object) and calls its’ hash function. I was initiall confused by the PyObject var and was searching for a similar variable anywhere else in the file and implemented. Dang, my C is rusty and broken… Anyway, after asking some awkward newbie questions on the python-dev irc, some developers there clarified that the hash function implementation is type-specific.

Again realization hits pile screw driver. I was not thinking about it and somehow had assumed a generic hash function. Curse you anand, you’re letting the dynamic typing of python let form bad habits in your thinking wake up. there’s no magic in the real world.
Only people willingly believing there is for whatever reason (pain-avoidance/laziness/sanity etc…)

2100 hrs.
And stupid me looks up Objects/listobject.c. Ofcourse it’s not implemented because well it(hashing)’s not needed for a python list. Conveniently, being just post dinner mind/body state is a good excuse. moving on.

Next stop is the long int object source at Objects/longobject.c

File Size: 154K, Lines: 5K.
Whoa, i could spend a week writing about this cpython long int type implementation. May be i will write more blog posts after this one on hash function.

static Py_hash_t
long_hash(PyLongObject *v)
{
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    i = Py_SIZE(v);
    switch(i) {
    case -1: return v->ob_digit[0]==1 ? -2 : -(sdigit)v->ob_digit[0];
    case 0: return 0;
    case 1: return v->ob_digit[0];
    }
    sign = 1;
    x = 0;
    if (i < 0) {         sign = -1;         i = -(i);     }     while (--i >= 0) {

        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |             (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
    return (Py_hash_t)x;
}

I have removed some of the comments for the sake of space and anyway, am going to read them para phrase them. but before i start with the comments paraphrasing, i’ll try to summarize the function with my guesses. It’ll be good practice to refresh my C and will be fairly simple, given the code base is old enough and has meaningful variable names.

Py_uhash_t x; this is a unsigned type of variable(probably a long int or int in C) renamed.
Py_ssize_t i; this is a short int type of variable(used to store sizes of vars/objects).
int sign; sign of the long object being passed.
i = Py_SIZE(v); a function to find the size of a python object PyObj

Ok this is feeling ridiculous thankfully variables are over.

welcome to the switch case.
If the size of the object is -1,0 or 1 this switch-case returns.
In case of, -1 it checks ob_digit(i am assuming this is the first digit)
and returns 1

           if it's 0:
              return 1
           else:
              return -(sdigit)v ->ob_digit[0];

here comes sign. it is being assigned value = 1. Ok i assume that means Py_Size() return value is greater than 1(ignoring < -2 possibility) if the long object is unsigned. or signed. whatever that sign variable was meant to represent :-)

Oops hasty thinking that reasoning is meaningless as the next if condition (i ob_digit[i] modulo
_PyHASH_MODULUS.

x * 2**PyLong_SHIFT % _PyHASH_MODULUS.

while (–i >= 0) {

x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) | (x >> (_PyHASH_BITS – PyLong_SHIFT));
x += v->ob_digit[i];
if (x >= _PyHASH_MODULUS)
x -=

cpython hash function

This is a follow up to this post. So went on opening up the Objects/dictobject.c file and start reading up the comments. It’s extremely well documented. I read through the first set of comments(about 50 lines) to learn that python’s dict has four type of slots. i.e: an entry in the python’s dict object can be one of 4 types:


 1. Active : active (key,value pair)
 2. Unused: key=value=NULL
 3. Dummy: key=dummy value=NULL -- was previously active, but was deleted.
 4. Pending not yet inserted or deleted. key !=dummy and key != null

As for further implications and usage of these different states i refer you to go here.

Moving on to the comments on hashing function. From the comment section:

Major subtleties ahead: Most hash schemes depend on having a “good” hash function, in the sense of simulating randomness. Python doesn’t: its most important hash functions (for strings and ints) are very regular in common cases: To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are “consecutive” strings. So this gives better-than-random behavior in common cases, and that’s very desirable.

Also, the comment goes on to talk about how the simple implementation has problems with collisions and the collision resolution strategy is crucial. But that’s stuff for another post. Ok clearly, my question can’t be answered by looking at this file. i’ll have to look deeper So i look up where the actual hash function is found and realize it’s part of the builtin modules. Therefore found in Python/bltinmodule.c . Now the builtin hash function implementation here just takes a PyObject(the C-representation of a Python object) and calls its’ hash function. I was initiall confused by the PyObject var and was searching for a similar variable anywhere else in the file and implemented. Dang, my C is rusty and broken… Anyway, after asking some awkward newbie questions on the python-dev irc, some developers there clarified that the hash function implementation is type-specific. Again realization hits pile screw driver. I was not thinking about it and somehow had assumed a generic hash function. Curse you anand, you’re letting the dynamic typing of python let form bad habits in your thinking wake up. there’s no magic in the real world. Only people willingly believing there is for whatever reason (pain-avoidance/laziness/sanity etc…)

I will follow this up with an actual implementation of hash function for the long integer object,  in a couple of days.

Insight porn

I had recently unsubscribed(about a month ago) from ribbon farm’s rss feed(on my google reader). I had done it in an effort cut my read posts on reader/month ~50 from 100. But i did not really think a lot about why. i know i spend reading the ribbonfarm posts just that little extra time and/or attention. I also know, what i learn from ribbonfarm is valuable. But the number of posts per month convinced me to drop it and read  it instead by visiting on the browser. Ofcourse, due to the historical old posts browsing, it was already on my block list when i am in ‘work’ mode.

I hadn’t considered the tradeoff (attention vs lesson/learning value.) till i saw this post today. He refers to another blog , which talks about insight porn type of blogs. I managed not to click on it and go reading around yay… Anyway, given i have been reading ribbonfarm and do tend to write a blog post** on top of a specific ribbonfarm post and feel a very shallow/temporary rush of work done.  Note that it was nowhere close to what i get when i read up some math and try to sum it up or just take notes/thoughts on it.

Anyway, i realized in a sense venkat is right. He’s generating insights that are useful, but not  change your world type deep.  Atleast not any more. I won’t pretend to be following him from back in 2008 to know this, but i do know i kinda have some idea or the other of his posts’ direction with a little skimming. This is probably an effect me going and digging around his archived posts reading up for a couple of hours at a time . Anyway, these are porn in the sense, they have surprising perspectives and interesting metaphors sometimes, but sometimes they are just one/two-level logical connections from some well-known principles. (A side effect, being you become a lazier thinker.* )  I guess that’s where the 20-80 rule/problem comes up. Infact, i originally thought, i would go and make a list of his blog posts that are not insight porn and those that are give a feedback, before i realize that list is going to be different for different sets/types of people.

Besides, i don’t think there’s any reliable way (that i can suggest ) for him to measure(nay grok) the distribution to make the blog better.

Takeaways:

1. i now vow to write less posts surrounding/developing around ribbonfarm posts and definitely not new posts.

2. I vow to read ribbonfarm on intermittent + serendipity seeking basis. i now have a list of sites for serendipity and lesswrong is moving up in that list.

3. Forming a list of stuff to write on a habit basis( some mix of math + open-source s/w) — not yet ready.

4. Progress enough to move john d cooks’ blog from the google reader subscription to serendipity reading list. instead get something like the p=np blog as a subscription.

 

* — I guess we all optimize for some kinds of think and become lazy thinkers in some area or another realistically.

**– One of my own example is this. it doesn’t really qualify as insight porn(it falls short of offering reasonably useful insights) at least not  sophisticated, but very crude. but definitely qualifies as traffic generator posts..

 

UPDATE 1: ironically, enough this post itself seems to have become a bit of clicks attractor..

UPDATE 2: ok, i realized a couple of things, the site traffic monitor (inherent in wordpress and google analytics) both, perhaps useful if checked once a month or so are a pain if checked every day(which is what i had started doing). the problem with checking every day or once in two days is it is very easy to settle for the local maxima of clicks coming in from backtrack links on existing popular blogs. my core faults so far.

Python list vs dictionary.

Was talking to a colleague(.Net developer) and ended up lecturing him about how array(list in python) is a specific type of data structure and is a specific type of associated array. Now. the logic goes like this. Associated array(dict in python) is an key-value store. It is a method to store data in the format of a key  mapped to a value. it is usually implemented  with the help of a good hash function.

Anyway, the only constraint being any new insertion has to be of the format key, value , where both the key, value are hashable* values.

Add one more condition that the key values have to be in the increasing order of whole numbers(numbers starting at 0) and you have an array/list.

This discussion/lecture got me thinking about how it would be implemented at the python core language level. I promise to actually check the open-source code base and write a summary later, but for now here’s my thought process/guesses after some pruning over a walk.

1. A list by virtue of having whole numbers for key values will be easier to access. i.e: it can be stored in constant interval  locations in the memory. (I know python being dynamic typed and python lists being capable of storing different type of values in a list, complicates things, but only the implementation. In theory, i can just add a pointer in that memory segment to the real memory where it is stored(you know in case of a huge wall of text that doesn’t fit in the memory intervals.)).   Effect? Accessing a list can be done in Const. Time O(1).**

2. A dictionary since it can have an arbitrary data type as key, cannot be assumed to have const. memory spaces. but then we have a hash function. i.e , we pass our key to a function that is guaranteed to return a unique hash value for any  unique key.  Now the lookup becomes two fold. First

a, a Hashing to get the hash key,

b, Search the table for an entry with the same key as the hashed key value.

Now what is the Big O time for this. My first thought is well it depends on

a. the hashing function implementation

b.  The table size or rather the hashed value size w.r.to  the dictionary size.

Anyway, this reminded me of an older post i had made and the excursions i had made into the cpython source at that time. And i clearly remember some comment from that Objects/dictobject.c/h file about the hashing function being special enough to make O(1) look up. Now i did not really get it at that time and will need to check the code + comment again in context. but the basic reasoning as i remember is by avoiding most of the outlier cases and assuming most simplest/popularly used distribution of the keys, they can simplify the hashing function to O(1).  Will update more some time.

 

 

** — Turns out not exactly const time, but invariable time with respect to the number of elements in the list.  In cases of pointers, there will be variation in time depending on the size of the element stored, but the first lookup is a simple const. time lookup table.

* — by our hash function. but generally, not a file stream, socket handler, port handler , device file,  IPC pipe etc…

 

ERP s/w & money as pain-aversion tool

A few weeks ago, i wrote on rivers of money as a metaphor as businesses and how that impacts ERP s/w market.

Anyway this post by venkatesh rao on how money spending is usually a spend on relieving pain made me think about the traditionally high prices of erp market s/w and the recent  saas based online market opening up various parts of ERP tremendously.

Traditionally, the market for ERP has been a pain relief for big companies struggling to keep track of their accounts, orders, inventory, people, and other resources. So the major pain point was and has been the fact that tracking a resource tended to involve calling up people across the globe in multiple timezones.  And there wasn’t much of a solution in those days, ergo the world’s first ERP s/w was born.**

 

Now, once it had some amount confirmation from the market that it works, it started generating money quick. Organizations were having a huge pain in the ass tracking their resources across multiple locations and time zones, and here was a s/w that made it easier. Sure there was the pain of getting their employees to enter the data regularly, training their employees to learn to use the erp s/w, setting up their existing data on the ERP s/w in the format it required, etc…, but hey it solved a huge amount of pain for the executive management people. it made it simple for them to backtrack. It made it simple for them to find out where it was delayed do an RCA, and assign blame  :-P .  But with the amount of money and profit margins that an ERP s/w implementation and customization provided, there came the service mindset and more importantly competition + aggressive selling from the ERP consultants (Technical/functional).   The employees(aka staff hierarchy) of big organization, were happy to sit back and let the consultants teach them, spoon feed them on how to use their ERP, show them how their ERP makes their work easier, prove the cost cutting effectiveness of an ERP s/w to the organizational top management etc..

 

There’s recently been a trend of new companies/organizations coming up. Infact if you buy venkat’s argument, we are going through a phase of the decline of big organizations and the rise of a lot of smaller organizations.

So what happens to ERP s/w. the profit margins ought to be going down, because the big companies want to spend less and less on the ERP costs. On the other hand, there’s a growing market for small-scale, low cost, less training/implementation oriented ERP s/w.

One effect of this is that the big ERP service providers start a cheaper version of their as a SAAS product. see ramcomicrosoft dynamicsSAP.   There’s a handicap they enter the market with though. i.e: the high salaries of the consultants in their payroll limits the price effectiveness they can provide to the customers, which they compensate with the attention of those consultants in helping the business run more efficient.

But before, they offered these solutions, there were a few technology geeks that found an oppurtunity and started jumping and hacking away to create cheap, products. Majority of them, being the geeks and being obsessed with simplicity, decide to focus on a subset of the functions of the ERP system. or what would usually be called a module inside a ERP s/w. (think accounting, order processing, CRM, invoicing, inventory management etc..) . I view the rise of niche products like projectplace, salesforce,basecamp,zoho,quickbooks etc… as being in these lines.

Do note while these SAAS providers provide cheap, online versions of some of the ERP modules, none of these provide or strive to cover all of the functions involved in an ERP s/w.  Also note they are mostly DIY. i.e: they focus on delivering few, specific functionalities efficiently and effectively.  They usually stay away from customizations and trying to provide customer specific trainings or data conversions etc.

 

Also note that, given the biased human minds, the customers (usually founders or starters of SMEs) who flock to these products come with unrealistic expectations. (i.e: the customizations, individual attention,training, data setup etc provided by the large organizations at a cheaper cost). So part of the work of these startups is to create a market(nay, educate(in the educate and learn meaning/sense) the market) about the trade-offs between cost and the amount of spoon-feeding they’ll receive about optimizing their business. I do not envy the job of founders of these companies, it’s quite a challenging task. They have to focus on improving their product, educating their customers about the trade-off they make by choosing them, and perhaps more importantly find the actual trade-off ratio that can help them grow into a big corporation 4. Infact it is one of the reasons it is easier for them to restrict their products’ functionality to clear simple functions.

 

The difference in which erpnext stands out is that, it aims to provide all the modules of a standard ERP s/w. I am probably not the right person to comment on the coverage functionality, but i haven’t seen any other online SaaS provider trying to provide modules for accounting, inventory management(aka warehousing), CRM, order management(aka invoicing) etc.  For a better chart and/or comparison see here. 

 

4 – Some argue they prefer to have a lifestyle business and there by avoid the problem of experimenting with the trade-off to become a big s/w product company.

Disclaimer: I was an employee at erpnext from the time period (Jul-2010 to May-2011.)

 

P.S: Infact venkat goes on to say this is why marketing is hard. Without commenting on the hardness part, i derive from the money spend as pain relief hypothesis that marketing has to not only be unquantifiable, but also immeasurably inefficient.  It must be one of the reasons so many new measures come up and go down. because once you have some measure and start focussing on optimizing it, you’re restricting yourself to a narrow target group, which is fine, but the more you optimize, the more narrower you are making your target group.

 

**  — I know that’s an oversimplification of the reality, but hey, am not qualified to write about the history  of ERP s/w and my narrative reads smoother :-P

the ramanujan number- john cook’s exercise

Disclaimer: OK, it’s midnight here and am too sleepy to actually write error-free code, but let me try the algorithm/logic of solving the problem john cook posed here.. It may be incomplete.

Ok. it’s clear that we need 4 distinct numbers that can be paired up to form the same sum.
It’s also clear that these summation pairs need to be relatively prime. (no that’s not right. it’s that all these 4 numbers need to relatively prime.) because of the condition the smallest number. if these 4 numbers have a common factor, you can take out the factor and ergo, you have a

— Aargh that’s some faulty reasoning. there.

Ok let me work at the other end. there’s a number ‘n’. that can be split into sum of two numbers. If we consider whole numbers, for any n, there can be n combinations of number pairs.

Our goal is to find two number pairs(tuple, if you will) (a,b) and (c,d) such that
1. a,b,c,d are all 4th powers of some number. i.e: 4th root of a,b,c,d are integers.
2. a+b = n = c +d.
3. n is the smallest such number.

Now the first two conditions are reasonably straightforward operations in computation.*
3. How do i codify that or translate into arithmetic operators, without having to iterate through all numbers 1 .. n?

ok, it does mean that a,b,c,d have to be relatively prime to each other**. (Oops wrong again the fourth roots of a,b,c,d, let’s call them a4,b4,c4,d4 they should be relatively prime)

*– I can’t recall the actual implementation of sqrt/pow(1/4) at the moment, but not really a problem.
** — allowing for 1 being treated very ambiguously and the multiplicative identity property of 1 ignored. (hmm.. i don’t like the sound/look/feel/smell:-P of that)

Anyway, let me try to write out a pseudo code sort of thing. with haskell type system.

prime_numbers = infinite list via one of the sieve implementation**

relative_prime:: Int -> list(size n)* -> Bool
-- A function that takes an integer and a list of integers with size of that integer and returns Boolean.
-- it would be a function that takes a list of integers and returns a boolean. the size of list is extraneous, in python or rather imperative programming, it can be used as a error check/ sanity check on input values. but not necessary in haskell.

integers = infinite list generated by adding 1 to the prev. result starting at 0 or 1?? recursive definition.

is_fourth_power :: integer -> Bool

is_sum :: [Integer] ->Integer -> Bool

* — however the hell that is represented in haskell.
**– which one, and what the hell to do with that annoying odd 1 ??

PyCon2012–India, Bangalore— Raw brain dump of notes from the two days.

Disclaimer: Warning, loong rant ahead. Some of it may sound harsh and probably is harsh. Comments are likely to be a result of personal biases, state of mind, attention(awareness vs arousal levels/ratio) etc.. No effort has been put into making these even objective. I am against making anything politically correct in principle though.

TLDR summary: I can’t learn in any talks presented to groups more than 10. It’s just usually over-simplified, and over-slowed, and over-general. My learning works in short, quick feedback cycles. So what i took away, was good idea to attend these conferences, if i need a confidence/arrogance boost, but otherwise, better to just get the slides + code and try them out yourself. only if you don’t assign the time block,you’ll never get around to trying them out on your own, which is a problem on it’s own.

29-Sep-2012 Saturday, Bangalore, India
Saturday 29 September 2012 10:04:10 AM IST
Am sitting here at David Keynote address, was a little late in arriving so can get only a seat at the back. Can’t hear very well because of the acoustics, but he seems to be talking about Python Software Foundation(PSF) ‘functions finances etc.. Not interested to try and listen. Another thing, i notice is that he too has pauses and umm, not unlike PG at that Pycon (qualitatively,quantatively lot less frequent than PG), but overall, it affirms my belief, only experienced salesmen, can avoid those, even without preparation, us nerds, on the other hand can’t avoid it even if we are well prepared, no point trying or beating yourself up about it, after a botched presentaion.

Saturday 29 September 2012 11:43:34 AM IST
Now at Nick Coghlan’s highlights of Python 3.3 I love te changesg thaey haev done to error reporting on this one. i have been burned so many times by having to read through stacks and stacks of pytho code to figure out teh source of the error. No more of it.. too bad centOS, which is used in most produciton systems has just come to python2.6 And they have cleaned up the import lib mess now cool.. Unicode byte size are smaller.
Namespace changes: __init__.py are option. in the absence of __init__.py whole sys.path for packages

Sunday 30 September 2012 11:20:16 AM IST
Ok, i skipped a lot of other sessions, kinda sat in the scikit-learn and left halfway through it. This session is titled New kids on the scipy block. I was hoping to learn some more about scipy, by playing around with it, but it turned out to be more of a explanation/demo/promotion of the ipython editor for python programming. I was disappointed, but then another epiphany struck me, it’s hard to have an interactive session on any of these scipy, machine learning in an hour talk. basically because they just all simply take a loong time to run and the speaker can’t cover much if he’s waiting for the students to complete coding and running the script.
Oh well, time to break out and work on the coursera, courses. this was more of a ipython notebook IDE promotion talk. So here i am instead, going and reading up on the unix hater’s book.

Another epiphany, is i hav ebecome very good at installing and settingu pstuff on my unx box. that’s the biggest advantage of having gone completely on linux in the last 3 years. I guess, i mis-understood my motivations when i switched to linux. And now that freedom, from proprietary manuals, designed to hide IP- sensitive info and handhold/spoo-fed manuals.

Sunday 30 September 2012 11:40:50 AM IST
second part of this session is about llvm it is interesting so far .but we’r ejust a couple of minutes in. one reason is this presenter/speaker seems more iconfient. it focuses more on llvm-py and i have been biased about the looks of the presenters.
am finally spending some time not wondering if the 1.5k is a waset efo times.
Sunday 30 September 2012 12:36:10 PM IST
Now at that Text mining with pytohn tola.k
Sunday 30 September 2012 12:52:44 PM IST
And am already out of that talk.. today’s been a good day so far as far as energy,focus and motivation levels are considered. i haven’t had any pills so far.. the early morning was slow due to the cold i guess and the confusion, but otherwise it’s been fine. am back to my old self, dissing others :-P
and realized, i don’t have nltk installed on my python setup. What the Fuck??

Sunday 30 September 2012 02:14:42 PM IST
And now at a mysql talk from oracle. the first 15 mins gone on a talk by someone from the sales division, generally fluff about how mysql and oracle are treated and how the accusations about oracle killing mysql being false.
Kinda, boring and interesting. boring from the technical viewpoint, but interesting from the viewpoint corporate strategical communications and morketing methods. Ofcourse, it doesn’t make sense to talk about oracle killing myssql. neither are living entities or it’s not even clear how effectively we can attribute agency to a group of people made of corporations. Closing the source doesn’t make any sense as i can see oracle not making much money from the move. and just risk losing a image.

5.6 — upcoming features.
apparently , this guy used to be with mysql. and claims mysql never spent improving innodb.
the tech team’s claim is 5.6 is competent with other forks like mariadb,perconadb and nosql. And a huge performance. that vimal guy did say,percona has 3 machine/server synchronization. and that mysql’s core optimization team is gone, but oracle has hired more people as they have a lot more resources. well, given that it’s a new team that optimizes 5.6, i am still wary of it still it gets adopted by a lot of people. mainly because, the implicit knowledge the original team had is probably gone. and this new team is left with test cases, documentations.
anyway, new feature
1. ability to sue 48 cores(up from 32)
2. Better optimizer(that scares me)
3. Full text search– new
4. FB & Google contributed optimizations for SSD
Replication
1. Crash-safe masters ??
2. Replication checksums
3. Automatic recovery using transactional positioning??

Memcached is a built in plugin. and has an api. and the plugin runs on top of the innodb storage engine. that’s cool. and interesting, now it’s less of an either or decision between.
memcached api mapped onto the native InnoDB.
Ok, mysql 5.6+ is clearly, on the list of the DBs to consider when i get aronud to that idea(startup–global)
mysql is the blond, nice sister to the big brother of oracle.

on an unrelated note:
mysqltest, i was expecting something along the lines of ab(apache benchmarking), but it’s a little more work to test. hmm..

Sunday 30 September 2012 03:21:51 PM IST
GlusterFS: Distributed file server
— no meta-data server
— userspace driver
— latency
— POSIX compliant
— hasching on filenames to decide the location of the file
— striping

Most engaged session so far.. that gamble to ditch the regular scheduled sessions and go for that open space place has worked. Hmm.. i guess the planned ones simplify theirs for the sake of a more general/vaguer audience or atleast assumel lower audience knowledge level.

Sunday 30 September 2012 04:30:39 PM IST
Cryptanalysis using python
hot girl apparently ex-infoscion.. really follows the agenda/scripting.
clearly int the planned -organized talk idea.
assumes no knowledge of cryptography in the audience, may be i should have gone to big data. Was hoping to catch up some math i haven’t come across before.

Monome Donome Cipher — monoaphabetic multiliteral substitution. each char can be replaced multiple chars 24-letter is an example — blur the relationship
Ok am beginning to think a little better.
How many of you know how to write classes in python?? — Oooh fun.o
OK am out of that presentation, i can pickup that stuff from reading the presentations and get it online. So much for learning unfamiliar math.

Sunday 30 September 2012 04:43:13 PM IST

Now am at this Big data using python one. a
OK there’s a slide list ing
Hive,
Hbase,
Flume,
Avro,
Sqoop

Ok, this is more about hadoop or so it seems.
one slide the compute moves to data, while data is moved to compute in other parallel data architectures???

I understand since it’s map reduce, the reduction and mapping are done on the nodes/data servers itself before communicating the result to the central source. But seriously, that’s an inaccurate, marketing jargon.

components?
— Map Reduce
— Job tracker
— Task Tracker
— Name node

Ok this one also assumes the audience knows nothing about map-reduce and tries to simplify it for them.
Sunday 30 September 2012 05:03:58 PM IST
Am out back in the openspace.

Crap, now i have to figure out what to do till 1045 pm. i need to find a place to hang out preferably with internet. Ah well, looks like it’s going to be a cafe day nearby.

Triggered by GlusterFS plans to have two way master-master synchronization. what if both the files have changed?, they will have to merge like git does. but then git assumes text/character files. I wonder what happens when you have git merge to binary files.. hmm, that’s another interesting experiments to try.

Damn it i should have not missed scipy 2011.