Journey with MongoDB

Wednesday, April 15, 2015

MongoDB v3.0 Brings Pluggable Storage Engine, and More!

MongoDB v3.0 brings a set of new features. Some of the major roll-outs includes:

Pluggable Storage Engine
Document-level Locking
Compression

For a complete list of features, refer to MongoDB's release note.

Pluggable Storage Engine

Prior to v3.0, MongoDB runs only on MMAPv1 storage engine. Since acquiring WiredTiger, MongoDB had developed a pluggable storage engine API, which enables it to run on different storage engines.

List of storage engines:

Storage Engines	Status	Developed By
MMAPv1	Supported	MongoDB
WiredTiger	Supported	MongoDB
In-Memory	In development	MongoDB
RocksDB	In development	RocksDB
InnoDB	In development	InnoDB
FusionIO	In development	FusionIO
HDFS	In development	Hadoop
...	...	...

Pluggable storage engines opens up new possibilities in replica set distribution. Each member of a replica set can run on different storage engine, while sharing the same JSON data model. In an example replica set, different members can run on:

WiredTiger for write-heavy workload
In-memory for extreme high throughtput
HDFS integrates in Hadoop cluster
FusionIO, backup engine, etc.

Document-level Locking

MongoDB was notorious for locking at database-level for all write activities. MongoDB suffers data throughput with write-heavy workloads. It had to refer to alternative methods to accommodate write-heavy workload by distribute writes to multiple databases, or distribute on a sharded cluster. With v3.0 WiredTiger engine, MongoDB is able to write at document-level. WiredTiger engine provides improvement to write-heavy application.

In addition, MongoDB v3.0 with the default MMAPv1 engine is able to lock at collection-level. It is also an improvment to the previous database-level lock.

WiredTiger shipped with default Btree algorithm, however, LSM algorithm is available as a configurable option.

Read heavy use case: Btree > LSM
Write heavy use case: LSM > Btree

Compression

Compression does not exist prior to v3.0. MongoDB v3.0 with WiredTiger engine can compress data in two flavors: Snappy, or Zlib.

Snappy - 70% compression ratio, low CPU overhead, default option
Zlib - 80% compression ratio, higher CPU overhead, non-default option

Zlib is suitable for archival purpose, as it uses higher CPU overhead and compress at a higher ratio.

Snappy and Zlib compression work on documents and the journal file, while indexes use Prefix compression, compresses indexes at ~50% ratio.

Note: compression ratio may vary depend on use case, on average a 70% compression ratio is observed.

Here's a look at compression size comparison between different storage configuration:

Test load: 1 collection, 500,000 docs, 20KB/doc

Compare to MMAPv1 which has no compression option, WiredTiger with snappy compression and zlib compression do a good job at compressing data size to about ~84% ratio. However, my question is, does compression affect performance? Look at next blog which we'll benchmark MongoDB with these storage configurations.

Tuesday, August 27, 2013

Migration, notes from Craigslist

Craigslist uses MongoDB for archiving purpose

http://www.mongodb.com/presentations/lessons-learned-migrating-2-billion-documents-craigslist

Data in archive is accessed differently than data in production

Update schema in archive takes a month, not including slave.

MySQL concepts carries over to MongoDB

Indexes
Master - Slave
Binary log = oplog

Shard key selection is easy due to unique postID
Data stored on spinning disks take long time to access, especially when it grows larger than data on ram.
It is wise to test on the same machine spec you will be deploying on.
Automatic failovers in replica sets work great. Instead of manually go in to MySQL database, reset configuration files, system admins can simply watch MongoDB elect new primary.
Know you data

Migrating from relational model to document model might cause sizing issue. What happens when data is larger than you think? There are always out-liers.
String vs Integer, MongoDB is sensitive to the data type stored, for indexing purpose.

Balancer can your friend, also your enemy

Insert rate could drop by 40x if too much I/O going on
Turning off balancer, use pre-splitting if possible.

If a slave is down too long and can't catch up using oplog, it needs to resync with master and copy data over. Most painful part is index rebuild might take days.

Solution: having a larger oplog?

Friday, August 23, 2013

Primary down, all things gone haywire?

What happens if Primary went down while data not replicated to secondaries? Read the following:
http://aphyr.com/posts/284-call-me-maybe-mongodb

Solution might be "WriteConcern.MAJORITY", but it does raise concern with performance.

6000 total 5700 acknowledged 3319 survivors 2381 acknowledged writes lost! (╯°□°）╯︵ ┻━┻

See this? (╯°□°）╯︵ ┻━┻

I thought it was funny animation..

Aggregate, now things are moving in a pipeline

MongoDB 2.2 introduces "aggregation framework", official definition - operations that process data records and return computed results. For easier understanding, imagine a bucket contains good and bad apples and oranges, I can use aggregation framework to get a quick result on a total of good apples and oranges.

From a SQL query point of view:
SQL query
SELECT cust_id, SUM(price) FROM orders
WHERE active=true
GROUP BY cust_id

MongoDB Aggregation
db.orders.aggregate( [
{ $group: { _id: "$cust_id", total: {$sum: "$price" } }
} ] )

Thursday, August 22, 2013

A quick brain twist from SQL to MongoDB

Underlying features of SQL work differently in MongoDB, but this chart can give you a quick boost on how the commands translate:
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

SQL Terms, Functions, and Concepts	MongoDB Aggregation Operators
WHERE	`$match`
GROUP BY	`$group`
HAVING	`$match`
SELECT	`$project`
ORDER BY	`$sort`
LIMIT	`$limit`
SUM()	`$sum`
COUNT()	`$sum`
join	No direct corresponding operator; however, the `$unwind` operator allows for somewhat similar functionality, but with fields embedded within the document.

Head to the page to see examples.

Useful MongoDB DBA commands

Common MongoDB Server commands:

Server

isMaster
serverStatus
logout
getLastError

dropDatabase
repairDatabase
close
copydb
dbStats

Collection

create
drop
collstats
renameCollection

User

count
aggregate
mapReduce
findAndModify

Index

ensureIndex
dropIndex

Tuesday, August 20, 2013

DDD (Domain Driven Design)

This thread introduced me the idea of DDD (Domain Driven Design).
http://programmers.stackexchange.com/questions/158790/best-practices-for-nosql-database-design

Friend at 6th floor works on payment system, which involves Cassandra+Hive+Hadoop, which leads me to this interesting article comparing 4 different DBs (MongoDB, RiaK, HBase, Cassandra):
http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/