Tuesday, August 27, 2013

Migration, notes from Craigslist


Craigslist uses MongoDB for archiving purpose
  • Data in archive is accessed differently than data in production
    • Update schema in archive takes a month, not including slave.
  • MySQL concepts carries over to MongoDB
    • Indexes
    • Master - Slave
    • Binary log = oplog
  • Shard key selection is easy due to unique postID
  • Data stored on spinning disks take long time to access, especially when it grows larger than data on ram.
  • It is wise to test on the same machine spec you will be deploying on.
  • Automatic failovers in replica sets work great. Instead of manually go in to MySQL database, reset configuration files, system admins can simply watch MongoDB elect new primary.
  • Know you data
    • Migrating from relational model to document model might cause sizing issue. What happens when data is larger than you think? There are always out-liers.
    • String vs Integer, MongoDB is sensitive to the data type stored, for indexing purpose.
  • Balancer can your friend, also your enemy
    • Insert rate could drop by 40x if too much I/O going on
    • Turning off balancer, use pre-splitting if possible.
  • If a slave is down too long and can't catch up using oplog, it needs to resync with master and copy data over. Most painful part is index rebuild might take days.
    • Solution: having a larger oplog?


Friday, August 23, 2013

Primary down, all things gone haywire?

What happens if Primary went down while data not replicated to secondaries? Read the following:
http://aphyr.com/posts/284-call-me-maybe-mongodb

Solution might be "WriteConcern.MAJORITY", but it does raise concern with performance.

6000 total 5700 acknowledged 3319 survivors 2381 acknowledged writes lost! (╯°□°)╯︵ ┻━┻

See this? (╯°□°)╯︵ ┻━┻

I thought it was funny animation..

Aggregate, now things are moving in a pipeline

MongoDB 2.2 introduces "aggregation framework", official definition - operations that process data records and return computed results. For easier understanding, imagine a bucket contains good and bad apples and oranges, I can use aggregation framework to get a quick result on a total of good apples and oranges.

From a SQL query point of view:
SQL query
SELECT cust_id, SUM(price) FROM orders
WHERE active=true
GROUP BY cust_id

MongoDB Aggregation
db.orders.aggregate( [
  { $group: { _id: "$cust_id", total: {$sum: "$price" } }
} ] )

Thursday, August 22, 2013

A quick brain twist from SQL to MongoDB

Underlying features of SQL work differently in MongoDB, but this chart can give you a quick boost on how the commands translate:
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
SQL Terms, Functions, and ConceptsMongoDB Aggregation Operators
WHERE$match
GROUP BY$group
HAVING$match
SELECT$project
ORDER BY$sort
LIMIT$limit
SUM()$sum
COUNT()$sum
joinNo direct corresponding operator; however, the $unwind operator allows for somewhat similar functionality, but with fields embedded within the document.

Head to the page to see examples.

Useful MongoDB DBA commands

Common MongoDB Server commands:

  • Server
    • isMaster
    • serverStatus
    • logout
    • getLastError
  • DB
    • dropDatabase
    • repairDatabase
    • close
    • copydb
    • dbStats
  • Collection
    • DBA
      • create
      • drop 
      • collstats
      • renameCollection
    • User
      • count
      • aggregate
      • mapReduce
      • findAndModify
  • Index
    • ensureIndex
    • dropIndex

Tuesday, August 20, 2013

DDD (Domain Driven Design)

This thread introduced me the idea of DDD (Domain Driven Design).
http://programmers.stackexchange.com/questions/158790/best-practices-for-nosql-database-design

Friend at 6th floor works on payment system, which involves Cassandra+Hive+Hadoop, which leads me to this interesting article comparing 4 different DBs (MongoDB, RiaK, HBase, Cassandra):
http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/

Indexes, just like relational database!

Week 4 on M101J - Indexes

  • Indexes take addtional space, but provide much faster data retrieval.
  • Creating indexes
    • .ensureIndex( )
    • e.g. "db.students.ensureIndex( { student_id : 1 } );
    • e.g. "db.students.ensureIndex( { student_id : 1, "class" : -1 } );
  • Multi-key index
    • index key can be array
    • e.g. db.bbb.insert( { a: [1, 2, 3], b: 1 } );
    • keys cannot both be arrays
    • e.g. db.bbb.insert( { a: [1, 2, 3], b: [4, 5, 6] } );
  • Unique index
    • index key has to be unique, duplicate not allowed
    • e.g. db.students.ensureIndex( { student_id: 1, name: 1 }, {unique: true} );
  • Foreground vs background indexing
    • Foreground: fast, but blocks writes. Suitable for DBA. Can lock a replica while it's being indexed.
    • Background: slow (2~4x slower), concurrent with writes. Suitable for developers in a production setting.
  • .explained( )
    • Useful to examine a query to see if indexing is utlilized
    • Important keys: "cursor" (did it use BtreeCursor?), "nscannedObjects" (how many objects actually queried?)
  • .hint( )
    • $natural: returns result in its natural order
    • .hint( { $natural: 1 } ) will use BasicCursor instead of BTreeCursor


Monday, August 19, 2013

Nice course from 10gen

10gen (maker and distributor of MongoDB) offers some very nice courses on MongoDB:
https://education.10gen.com/

Notes from M101J - week3:

Cool stuff about MongoDB schema

  • Rich documents
    • Store array of data
  • Pre-join data (embed data)
    • Fast access
    • No "Mongo Joins"
    • No constraints
      • No primary key / foreign key
  • Atomic transaction operation
    • Within one document
  • No declared schema
    • Similar structure in documents
Living without transactions
  • Atomic operation
    • In order to accomplish it:
      • restructure code to work within same document.
      • Implement locking mechanism / semaphore
      • Tolerate inconsistency
  • One to one relationship
    • Embed or not to embed depends on:
      • Freq of access
      • sSize of items ( > 16MB? )
      • Atomicity of data
  • Benefits of embedding
    • Improved read performance
    • One round trip to DB
    • High latency: 1ms
    • High bandwidth
    • "Write" latency can be sig. improved by embedding data
  • Decision to denormalize
    • 1:1 - Embed
    • 1: many - Embed (from many to 1)
    • many : many - Link (using array of _id)

Multi-dimentional skills needed

After two weeks into all things MongoDB, this is what I think needed to be an expert:

  • Setup, monitor, and administer MongoDB on servers.
  • Understand / use MongoDB in application development.
  • Troubleshoot issues as they arise.
  • Database migration, from other dbs to MongoDB
  • "Sharding" - understand deeper scope, when does it occur and how?
  • Distinguish differences between MongoDB and other dbs, pros / cons.
  • Understand how MongoDB performs / reacts on different storage technologies (SAS vs SSD vs PCI).
  • Understand advanced inner-working of MongoDB.