It’s just data

Key + Data

Anant Jhingran The freebase folks do not reveal much about their scaling.  The scaleout models for google and wikipedia (where partitioning/replication strategies work quite well) do not quite work in such a networked graph (after all, a query on person="anant" with one or two pointer chases would end up pinging a few nodes under any partition model), so the question is, if we have billions of pieces of information in a dense graph, how does the query load on the system scale?

I, too, have found precious little about the internals of freebase, and likewise I’m interested in the question at the end of the above paragraph.  But this post is about the stuff in the middle.

For starters: what’s this about the scaleout models for google do not quite work in such a networked graph?  To me, the web is the quintessential networked graph, one that is massively partitioned, and yet PageRank™ seems to scale just fine.

A similar approach could conceivably work for Freebase.  Data would be organized into pages, and then relations would be either embedded in, or attached to, these documents.  Mining this data can be done via MapReduce jobs.

Whether this guess is right or wrong or someplace in between, I’m continuing to see a pattern.  One that Amazon’s Dynamo reinforces.  What I am seeing is that the interesting thing isn’t the first two columns in this table.  Or in the next three columns, or in the next five columns after that.  Nor even in the next two columns.  The most interesting thing may very well be the last column: memcached.

What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?  Namely that everything is accessed by a primary key, and that metadata is either attached to, or embedded within, that data.


So the future is key-value lookups or MapReduce batch jobs, nothing in between? App developers writing boilerplate code to maintain indices?

Posted by Wes Felter at

With CouchDB, the vision is that the there will be both temporary and persistent views, and both are defined by map and optional reduce jobs.

For persistent views, the output of map jobs will be stored and indexed.  This demo and basura explores portions of these ideas... and anything that could be built on BDB could certainly be built on top of Dynamo.

Posted by Sam Ruby at

I disagree about memcached. The first thing I noted was that LAMP is the platform of choice. The second, but far more interesting thing I noted was the Poisson distribution of languages. Programming languages become incidental.

Posted by Bill de hOra at

I’ll see your Poisson distribution and raise you a Selection Bias.  See these comments for some good counterexamples.

Posted by Sam Ruby at

Not all keys created equal

Sam Ruby on Key + Data: What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...

Excerpt from Labnotes at

Dynamo

If you care about distributed systems, you need to read the paper about Amazon’s Dynamo. Comments: Making node joining/leaving an administrative command is not something most academics consider, but it significantly reduces complexity.  We made a...

Excerpt from Paul's Journal at

Amazon reveals its secret key-data overlords from the planet Cloud

Only the barest of glances at Dynamo so far, and by far the most interesting pieces are going to be how they do the scalable high availability, and of course we’re talking about “Werner Vogels Scalability(tm)“, but I was immediately struck, as Sam...

Excerpt from Laughing Meme at

Sam Ruby - Key + Data : "What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...

Excerpt from Tim's Weblog at

Sam, you have your terminology wrong. Those technologies use a surrogate key, not a primary key, for the primary method of resource identification.

[link]

Posted by Noah Slater at

Sam Ruby - Key + Data : "What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and...

Excerpt from Tim's Weblog at

the web is the quintessential networked graph, one that is massively partitioned, and yet PageRank™ seems to scale just fine.

That argument would be more convincing if PageRank were being run against the web directly, instead of the copy (presumably normalized, denormalized, or otherwise transformed in various ways for performance) that lives in Google’s datacenters.

But perhaps you’re suggesting that two or more copies of the data optimized for different purposes (analogous to OLTP/OLAP) should now be assumed?

Posted by Michael R. Bernstein at

[from jzawodn] Key + Data

“What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?”...

Excerpt from del.icio.us/network/telliott at

But perhaps you’re suggesting that two or more copies of the data optimized for different purposes (analogous to OLTP/OLAP) should now be assumed?

I think I’m suggesting that and more.

Data is often naturally partitioned.  Not just for performance and reliability reasons, but for control reasons.  Much of the data you want to query, you can’t control.  There’s also the pesky fallacies of distributed computing issues to deal with.

The solution is often pull and subscribe.  That’s how your feed reader works, how the web works, and how Google works.  When a given site that planet intertwingly subscribes to goes down, the data from the previous successful fetch is used.

I could even see this working in an enterprise setting.  Different departments running their own private servers, with a few common map/reduce jobs that contribute to an overall read-only view of the data.  Note: that’s different than OLTP/OLAP; and the inverse of what you were suggesting: one copy of the data; contributing to a distributed implementation of a view.

Posted by Sam Ruby at

Jeremy Zawodny : Key + Data - Key + Data: “What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www?” Tags : links...

Excerpt from HotLinks - Level 1 at

“I’ll see your Poisson distribution and raise you a Selection Bias. ”

The more data the merrier.  It’ll reinforce the idea that the programming language is incidental.

Posted by Bill de hOra at

A URI by any other name...

Many people are buzzing about Amazon’s Dynamo , and for good reason. But the buzz is almost dual in nature, because not only is it very cool technology, but also because of the real and perceived impacts on other architectural designs. After all,...

Excerpt from Jim's Ramblings at

Links - 10.05.2007

Future of Web Startups If there were real money in startups then everybody would be doing them. Key + Data The thing is, these systems aren’t databases at all. They are big distributed caching systems. But it’s not realistic to offer MapReduce as...

Excerpt from discipline and punish at

I could even see this working in an enterprise setting.  Different departments running their own private servers, with a few common map/reduce jobs that contribute to an overall read-only view of the data.

Hmm. With the right sort of commodity infrastructure available, and a some common integration patterns, this approach could lead to a drastic lowering of coordination costs, affecting both the ROI of post-M&A integration efforts and shifting the transaction cost boundary that defines the Coase ‘Nature of the Firm’ in ways that both lift the upper boundary on the size of corporations and reduce the need for hierarchical command-and-control within them to the point that the largest corporations may become federated networks, rather than feudal.

Definitely food for thought.

Posted by Michael R. Bernstein at

Thoughts on Amazon’s Internal Storage System (Dynamo)

... [more]

Trackback from Dare Obasanjo aka Carnage4Life

at

What do dynamo, memcached, Berkley DB, and couc...

What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and that metadata is either...

Excerpt from (cons 'ider 'this) by Mark McGranaghan at

Add your comment