RegisterSign In
By Peter Buchmann (pbberlin) on Nov 29, 2009 9:50 AM.
how are the changes distributed
Dear Toby,
First let me say that I found McKoi by your post at http://highscalability.com/blog/2009/9/16/paper-a-practical-scalable-distributed-b-tree.html.

I read everything on www.mckoi.com, and I am EXCITED.

I have been interested in databases for a long time,
but after reading all the stuff, there is one issue I would like to understand better:

>> "Reading or writing data ... is handled by the clients."
>> "The Consensus Function is ... for ... moving data around ... for high performance merge operations."
>> "MckoiDDB can partition a database instance into multiple instances (a feature called sharding), splitting the Consensus Function load over multiple servers."

Please give me some architectural and or conceptual detail.

I am hugely curious:
Lets forget about contention - just consider a scenario with three huge tables each with many columns - and many references between the data in the tables.
Lets assume, we need to access the data in several ways, thus needing a lot of columns indexed.

1. Would your concept accomodate the many huge indexes in the RAM of the Java VM ?

2. How would a merge operation distribute the inserts/updates accross all the nodes?

3. How would all the other clients get their local indexes updated?

4. How would partitioning/sharding come in?

5. How would a "mass update" behave?
You know - when you do not WANT to keep a transaction open for a long time, but you HAVE to
because you first insert all the data, and then insert references between them.

If you are busy implementing SQL access or all the other mountains of stuff needed in such an undertaking ... please tell me only,
whether I have to reverse engineer if out of the code for myself - or whether it is covered conceptually somewhere else.

Maybe I can contribute a conceptual draft for the above scenario to your project.
And if it checks out, I would go on and try a big test case and maybe document it.

Regards
Peter
By Tobias Downer (toby) on Nov 29, 2009 4:51 PM.
Thanks Peter,

Please don't waste time reverse engineering! I'll be more than happy to explain how MckoiDDB works under the hood and the design decisions behind the system. I know the documentation isn't as detailed as it could be and some aspects go no further than a simple tutorial. I apologize about that, but I intend to make another stab at the documentation when the SQL module is released.

Here's my answer to your questions;

1) An index is just data therefore a very large index is handled no differently than a very large data file. A page of the index data is loaded into memory on the client that is accessing that part of the index. As far as persistent data is concerned, RAM is only used for caches. Certainly though, this is not hard wired into the design and I'm interested in exploring more memory oriented persistence especially in regards to the low seek time world of SSD technology.

2) A commit operation is not inherently an operation that can be performed in a parallel way because it's managing data consistency rules (how can you determine that a record delete is consistent if another request to delete the same could happen). For this reason, I attempted to make the work of the consensus function as small as possible. It so happens that this problem can be most easily explained as a 'transaction merge' - that is you have multiple transaction commits hitting the consensus function and you must merge the transactions together into a new snapshot that represents the consistent view (or reject an inconsistent merge). To make this merge operation efficient, much of the work such as loading all the data into network's storage resources (preparing indexes, etc) can be distributed over nodes before the 'commit'. Also, the consensus function only needs to deal with meta-data changes in many cases therefore copying data from one transaction in the merge operation can be done efficiently avoiding a byte for byte copy.

3) A client makes a request to the root server for the most current snapshot version of the database. The root node replies to this request with the location in the network where the current root node can be found. If a client needs immediate updates on any changes to the database then the client can pole the root server frequently. If the client is lazy then it can pole less frequently. Because of transactions, the snapshot a client gets of a database regardless of how frequently it poles will be consistent to the schema.

4) Partitioning/Sharding is setup by creating a new database through the interface. A database shard has its own consensus function therefore it is one way to scale out if the consensus function is a bottleneck. What's nice about MckoiDDB's shards is that you can freely share as much data as you want between shards with no consequences. For example, making a shard that becomes an independent branch of another working dataset is no problem and can be set up instantly regardless of the dataset size. Or if, for example, you need to move a very large index from one shard to another you only need to link to the data in your new shard which is a simple meta-data operation rather than copying the data byte for byte (uses exactly the same function to copy information I described in answer 2).

5) MckoiDDB does not mind how long you keep a transaction open for. You could keep one open for weeks and nothing bad will happen. No other client will be stopped or slowed down from reading or writing data if you keep a transaction open somewhere. The transaction state is entirely managed by the client. The difficulty with mass updates comes when you need to verify that the update is consistent in the consensus function. You can make your consensus function trust that an update from a client will be consistent without needing to check. This is probably quite appropriate for a mass update type scenario when you are building or rebuilding a very large index and many clients are involved in the process.

I hope this is useful information. Please feel free to post any other questions you have or if you want clarification.

Toby.
Please sign in or register to post in this topic.
The text on this page is licensed under the Creative Commons Attribution 3.0 License. Java is a registered trademark of Oracle and/or its affiliates.
Mckoi is Copyright © 2000 - 2019 Diehl and Associates, Inc. All rights reserved.