Wednesday, April 15, 2015

MongoDB v3.0 Brings Pluggable Storage Engine, and More!

MongoDB v3.0 brings a set of new features. Some of the major roll-outs includes:
  • Pluggable Storage Engine
  • Document-level Locking
  • Compression
For a complete list of features, refer to MongoDB's release note.

Pluggable Storage Engine

Prior to v3.0, MongoDB runs only on  MMAPv1 storage engine. Since acquiring WiredTiger, MongoDB had developed a pluggable storage engine API, which enables it to run on different storage engines. 

List of storage engines:

Storage Engines Status Developed By
MMAPv1 Supported MongoDB
WiredTiger Supported MongoDB
In-Memory In development MongoDB
RocksDB In development RocksDB
InnoDB In development InnoDB
FusionIO In development FusionIO
HDFS In development Hadoop
... ... ...

Pluggable storage engines opens up new possibilities in replica set distribution. Each member of a replica set can run on different storage engine, while sharing the same JSON data model. In an example replica set, different members can run on:
  • WiredTiger for write-heavy workload
  • In-memory for extreme high throughtput
  • HDFS integrates in Hadoop cluster
  • FusionIO, backup engine, etc.
Document-level Locking

MongoDB was notorious for locking at database-level for all write activities. MongoDB suffers data throughput with write-heavy workloads. It had to refer to alternative methods to accommodate write-heavy workload by distribute writes to multiple databases, or distribute on a sharded cluster. With v3.0 WiredTiger engine, MongoDB is able to write at document-level. WiredTiger engine provides improvement to write-heavy application.

In addition, MongoDB v3.0 with the default MMAPv1 engine is able to lock at collection-level. It is also an improvment to the previous database-level lock.

WiredTiger shipped with default Btree algorithm, however, LSM algorithm is available as a configurable option.

  • Read heavy use case: Btree > LSM
  • Write heavy use case: LSM > Btree

Compression

Compression does not exist prior to v3.0. MongoDB v3.0 with WiredTiger engine can compress data in two flavors: Snappy, or Zlib.

  • Snappy - 70% compression ratio, low CPU overhead, default option
  • Zlib - 80% compression ratio, higher CPU overhead, non-default option
Zlib is suitable for archival purpose, as it uses higher CPU overhead and compress at a higher ratio.

Snappy and Zlib compression work on documents and the journal file, while indexes use Prefix compression, compresses indexes at ~50% ratio.

Note: compression ratio may vary depend on use case, on average a 70% compression ratio is observed.

Here's a look at compression size comparison between different storage configuration:

Test load: 1 collection, 500,000 docs, 20KB/doc




















Compare to MMAPv1 which has no compression option, WiredTiger with snappy compression and zlib compression do a good job at compressing data size to about ~84% ratio. However, my question is, does compression affect performance? Look at next blog which we'll benchmark MongoDB with these storage configurations.

Tuesday, August 27, 2013

Migration, notes from Craigslist


Craigslist uses MongoDB for archiving purpose
  • Data in archive is accessed differently than data in production
    • Update schema in archive takes a month, not including slave.
  • MySQL concepts carries over to MongoDB
    • Indexes
    • Master - Slave
    • Binary log = oplog
  • Shard key selection is easy due to unique postID
  • Data stored on spinning disks take long time to access, especially when it grows larger than data on ram.
  • It is wise to test on the same machine spec you will be deploying on.
  • Automatic failovers in replica sets work great. Instead of manually go in to MySQL database, reset configuration files, system admins can simply watch MongoDB elect new primary.
  • Know you data
    • Migrating from relational model to document model might cause sizing issue. What happens when data is larger than you think? There are always out-liers.
    • String vs Integer, MongoDB is sensitive to the data type stored, for indexing purpose.
  • Balancer can your friend, also your enemy
    • Insert rate could drop by 40x if too much I/O going on
    • Turning off balancer, use pre-splitting if possible.
  • If a slave is down too long and can't catch up using oplog, it needs to resync with master and copy data over. Most painful part is index rebuild might take days.
    • Solution: having a larger oplog?


Friday, August 23, 2013

Primary down, all things gone haywire?

What happens if Primary went down while data not replicated to secondaries? Read the following:
http://aphyr.com/posts/284-call-me-maybe-mongodb

Solution might be "WriteConcern.MAJORITY", but it does raise concern with performance.

6000 total 5700 acknowledged 3319 survivors 2381 acknowledged writes lost! (╯°□°)╯︵ ┻━┻

See this? (╯°□°)╯︵ ┻━┻

I thought it was funny animation..

Aggregate, now things are moving in a pipeline

MongoDB 2.2 introduces "aggregation framework", official definition - operations that process data records and return computed results. For easier understanding, imagine a bucket contains good and bad apples and oranges, I can use aggregation framework to get a quick result on a total of good apples and oranges.

From a SQL query point of view:
SQL query
SELECT cust_id, SUM(price) FROM orders
WHERE active=true
GROUP BY cust_id

MongoDB Aggregation
db.orders.aggregate( [
  { $group: { _id: "$cust_id", total: {$sum: "$price" } }
} ] )

Thursday, August 22, 2013

A quick brain twist from SQL to MongoDB

Underlying features of SQL work differently in MongoDB, but this chart can give you a quick boost on how the commands translate:
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
SQL Terms, Functions, and ConceptsMongoDB Aggregation Operators
WHERE$match
GROUP BY$group
HAVING$match
SELECT$project
ORDER BY$sort
LIMIT$limit
SUM()$sum
COUNT()$sum
joinNo direct corresponding operator; however, the $unwind operator allows for somewhat similar functionality, but with fields embedded within the document.

Head to the page to see examples.

Useful MongoDB DBA commands

Common MongoDB Server commands:

  • Server
    • isMaster
    • serverStatus
    • logout
    • getLastError
  • DB
    • dropDatabase
    • repairDatabase
    • close
    • copydb
    • dbStats
  • Collection
    • DBA
      • create
      • drop 
      • collstats
      • renameCollection
    • User
      • count
      • aggregate
      • mapReduce
      • findAndModify
  • Index
    • ensureIndex
    • dropIndex

Tuesday, August 20, 2013

DDD (Domain Driven Design)

This thread introduced me the idea of DDD (Domain Driven Design).
http://programmers.stackexchange.com/questions/158790/best-practices-for-nosql-database-design

Friend at 6th floor works on payment system, which involves Cassandra+Hive+Hadoop, which leads me to this interesting article comparing 4 different DBs (MongoDB, RiaK, HBase, Cassandra):
http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/