It’s just data

Key + Data

Anant Jhingran The freebase folks do not reveal much about their scaling.  The scaleout models for google and wikipedia (where partitioning/replication strategies work quite well) do not quite work in such a networked graph (after all, a query on person="anant" with one or two pointer chases would end up pinging a few nodes under any partition model), so the question is, if we have billions of pieces of information in a dense graph, how does the query load on the system scale?

I, too, have found precious little about the internals of freebase, and likewise I’m interested in the question at the end of the above paragraph.  But this post is about the stuff in the middle.

For starters: what’s this about the scaleout models for google do not quite work in such a networked graph?  To me, the web is the quintessential networked graph, one that is massively partitioned, and yet PageRank™ seems to scale just fine.

A similar approach could conceivably work for Freebase.  Data would be organized into pages, and then relations would be either embedded in, or attached to, these documents.  Mining this data can be done via MapReduce jobs.

Whether this guess is right or wrong or someplace in between, I’m continuing to see a pattern.  One that Amazon’s Dynamo reinforces.  What I am seeing is that the interesting thing isn’t the first two columns in this table.  Or in the next three columns, or in the next five columns after that.  Nor even in the next two columns.  The most interesting thing may very well be the last column: memcached.

What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?  Namely that everything is accessed by a primary key, and that metadata is either attached to, or embedded within, that data.