Monday, July 26, 2010

Parallel/distributed databases (raw notes)

Some raw notes about distributed databases.

If you want to start with parallel databases then you need some good knowledge about traditional (non-distributed) data bases. Here are some books I personally recommend.:

Books about regular DB (For true beginners. You may not want to start working with parallel DB systems if this is the book you are reading now :) )


Comparison between several parallel DB systems

Hive (HBase)
Developer: Facebook/Apache
Source code available: Yes
Free: Yes
Link :

PostgreSQL to HBase Replication
Developer: ?
Source code available: ?

Developer: Daniel Adabi - Yale
Source code available: Yes
Free: Yes
Quote: "Is just an academic prototype"

Yahoo! Data
Developer: ?
Source code available: ?
Free: Yes

Developer: Google
Source code available: ?
Free: ?
Link : see BigTable paper 2006


More about HadoopDB
  1. A hybrid of DBMS and MapReduce technologies targeting analytical query workloads
  2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud
  3. An attempt to fill the gap in the market for a free and open source parallel DBMS
  4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).
  5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads


  • HadoopDB is primarily focused on high scalability and the required availability at scale.  Daniel questions current MPP’s ability to truly scale past 100 nodes whereas Hadoop has real examples on 3000+ nodes.
  • HadoopDB like many MPP analytical database platforms uses shared nothing relational database as processing units. HadoopDB uses Postgres.  Unlike other MPP databases, HadoopDB uses Hadoop as the distributed mechanism.
  • Daniel doesn’t dispute DeWitt & Stonebrakers (and his) paper which claims Map/Reduce underperforms when compared to current MPP DBMSHadoopDB however is focused on massive scale, hundreds or thousands of nodes.  Currently the largest MPP database we know of is 96 nodes.
  • Early benchmarking shows HadoopDB outperforms Hadoop but is slower than current MPP databases under normal circumstances.  However when simulating node failure mid query HadoopDB outperformed current MPP databases significantly.
  • The higher the scalability the higher the possibility of node failure mid query.  Very large Hadoop deployments may experience at least 1 node failure per query (job).
  • HadoopDB is usable today, but should not be considered an “out of the box” solution.  HadoopDB is an outcome from a database research initiative, not a commercial venture.  Anyone planning to use HapoopDB will require the appropriate systems & development skills to effectively deploy.

Hadoop DB - How it works

Database Connector
The Database Connector is the interface between independent database systems residing on nodes in the cluster and TaskTrackers.

The catalog maintains metainformation about the databases. This includes the following: (i) connection parameters such as database location, driver class and credentials, (ii) metadata such as data sets contained in the cluster, replica locations, and data partitioning properties.

Data Loader
The Data Loader is responsible for (i) globally repartitioning data on a given partition key upon loading, (ii) breaking apart single node data into multiple smaller partitions or chunks and (iii) finally bulk-loading the single-node databases with the chunks.

SQL to MapReduce to SQL (SMS) Planner
HadoopDB provides a parallel database front-end to data analysts enabling them to process SQL queries. The SMS planner extends Hive. Hive transforms HiveQL, a variant of SQL, into MapReduce jobs that connect to tables stored as files in HDFS.

Since each table is stored as a separate file in HDFS, Hive assumes no collocation of tables on nodes. Therefore, operations that involve multiple tables usually require most of the processing to occur in the Reduce phase of a MapReduce job. This assumption does not completely hold in HadoopDB as some tables are collocated and if partitioned on the same attribute, the join operation can be pushed entirely into the database layer.

Quote: "Hadoop simply scales better than any currently available parallel DBMS product."

Final words.
So you need to use a parallel database? Here were the choices I had for my project:

1. Purchase a parallel DB like Greenplum and Vertica 
Price: $250K.
Thoughts: Everything about this solution is nice except the price.

2. Reduce the amount of data that DB system must process. For this: Use the existent DB (MySQL). Write the results from Blast MapReduce jobs to disk and then use a script to upload them to DB. This way we won't flood the DB with too much data. 
Thoughts:  Cheap, some programming required. Not a definitive solution.

3. Use the DB engine to perform the SQL searches then throw away the data from DB.
Thoughts:  Cheap, smart, some programming required. Not a definitive solution.

4. Use the DB provided by Hadoop -> HBase/Hive. It is slower but more computers can be used to improve speed.
Thoughts: Cheap (actually free). Unstable (Hadoop is early beta). Difficult to install and maintain. 

No comments:

Post a Comment