Journey with MongoDB: August 2013

Tuesday, August 27, 2013

Migration, notes from Craigslist

Craigslist uses MongoDB for archiving purpose

http://www.mongodb.com/presentations/lessons-learned-migrating-2-billion-documents-craigslist

Data in archive is accessed differently than data in production

Update schema in archive takes a month, not including slave.

MySQL concepts carries over to MongoDB

Indexes
Master - Slave
Binary log = oplog

Shard key selection is easy due to unique postID
Data stored on spinning disks take long time to access, especially when it grows larger than data on ram.
It is wise to test on the same machine spec you will be deploying on.
Automatic failovers in replica sets work great. Instead of manually go in to MySQL database, reset configuration files, system admins can simply watch MongoDB elect new primary.
Know you data

Migrating from relational model to document model might cause sizing issue. What happens when data is larger than you think? There are always out-liers.
String vs Integer, MongoDB is sensitive to the data type stored, for indexing purpose.

Balancer can your friend, also your enemy

Insert rate could drop by 40x if too much I/O going on
Turning off balancer, use pre-splitting if possible.

If a slave is down too long and can't catch up using oplog, it needs to resync with master and copy data over. Most painful part is index rebuild might take days.

Solution: having a larger oplog?

Friday, August 23, 2013

Primary down, all things gone haywire?

What happens if Primary went down while data not replicated to secondaries? Read the following:
http://aphyr.com/posts/284-call-me-maybe-mongodb

Solution might be "WriteConcern.MAJORITY", but it does raise concern with performance.

6000 total 5700 acknowledged 3319 survivors 2381 acknowledged writes lost! (╯°□°）╯︵ ┻━┻

See this? (╯°□°）╯︵ ┻━┻

I thought it was funny animation..

Aggregate, now things are moving in a pipeline

MongoDB 2.2 introduces "aggregation framework", official definition - operations that process data records and return computed results. For easier understanding, imagine a bucket contains good and bad apples and oranges, I can use aggregation framework to get a quick result on a total of good apples and oranges.

From a SQL query point of view:
SQL query
SELECT cust_id, SUM(price) FROM orders
WHERE active=true
GROUP BY cust_id

MongoDB Aggregation
db.orders.aggregate( [
{ $group: { _id: "$cust_id", total: {$sum: "$price" } }
} ] )

Thursday, August 22, 2013

A quick brain twist from SQL to MongoDB

Underlying features of SQL work differently in MongoDB, but this chart can give you a quick boost on how the commands translate:
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

SQL Terms, Functions, and Concepts	MongoDB Aggregation Operators
WHERE	`$match`
GROUP BY	`$group`
HAVING	`$match`
SELECT	`$project`
ORDER BY	`$sort`
LIMIT	`$limit`
SUM()	`$sum`
COUNT()	`$sum`
join	No direct corresponding operator; however, the `$unwind` operator allows for somewhat similar functionality, but with fields embedded within the document.

Head to the page to see examples.

Useful MongoDB DBA commands

Common MongoDB Server commands:

Server

isMaster
serverStatus
logout
getLastError

dropDatabase
repairDatabase
close
copydb
dbStats

Collection

create
drop
collstats
renameCollection

User

count
aggregate
mapReduce
findAndModify

Index

ensureIndex
dropIndex

Tuesday, August 20, 2013

DDD (Domain Driven Design)

This thread introduced me the idea of DDD (Domain Driven Design).
http://programmers.stackexchange.com/questions/158790/best-practices-for-nosql-database-design

Friend at 6th floor works on payment system, which involves Cassandra+Hive+Hadoop, which leads me to this interesting article comparing 4 different DBs (MongoDB, RiaK, HBase, Cassandra):
http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/

Indexes, just like relational database!

Week 4 on M101J - Indexes

Indexes take addtional space, but provide much faster data retrieval.
Creating indexes

.ensureIndex( )
e.g. "db.students.ensureIndex( { student_id : 1 } );
e.g. "db.students.ensureIndex( { student_id : 1, "class" : -1 } );

Multi-key index

index key can be array
e.g. db.bbb.insert( { a: [1, 2, 3], b: 1 } );
keys cannot both be arrays
e.g. ~~db.bbb.insert( { a: [1, 2, 3], b: [4, 5, 6] } )~~;

Unique index

index key has to be unique, duplicate not allowed
e.g. db.students.ensureIndex( { student_id: 1, name: 1 }, {unique: true} );

Foreground vs background indexing

Foreground: fast, but blocks writes. Suitable for DBA. Can lock a replica while it's being indexed.
Background: slow (2~4x slower), concurrent with writes. Suitable for developers in a production setting.

.explained( )

Useful to examine a query to see if indexing is utlilized
Important keys: "cursor" (did it use BtreeCursor?), "nscannedObjects" (how many objects actually queried?)

.hint( )

$natural: returns result in its natural order
.hint( { $natural: 1 } ) will use BasicCursor instead of BTreeCursor

Monday, August 19, 2013

Nice course from 10gen

10gen (maker and distributor of MongoDB) offers some very nice courses on MongoDB:
https://education.10gen.com/‎

Notes from M101J - week3:

Cool stuff about MongoDB schema

Rich documents

Store array of data

Pre-join data (embed data)

Fast access
No "Mongo Joins"
No constraints

No primary key / foreign key

Atomic ~~transaction~~ operation

Within one document

No declared schema

Similar structure in documents

Living without transactions

Atomic operation

In order to accomplish it:

restructure code to work within same document.
Implement locking mechanism / semaphore
Tolerate inconsistency

One to one relationship

Embed or not to embed depends on:

Freq of access
sSize of items ( > 16MB? )
Atomicity of data

Benefits of embedding

Improved read performance
One round trip to DB
High latency: 1ms
High bandwidth
"Write" latency can be sig. improved by embedding data

Decision to denormalize

1:1 - Embed
1: many - Embed (from many to 1)
many : many - Link (using array of _id)

Multi-dimentional skills needed

After two weeks into all things MongoDB, this is what I think needed to be an expert:

Setup, monitor, and administer MongoDB on servers.
Understand / use MongoDB in application development.
Troubleshoot issues as they arise.
Database migration, from other dbs to MongoDB
"Sharding" - understand deeper scope, when does it occur and how?
Distinguish differences between MongoDB and other dbs, pros / cons.
Understand how MongoDB performs / reacts on different storage technologies (SAS vs SSD vs PCI).
Understand advanced inner-working of MongoDB.