I’ll Give MongoDB Another Try. In Ten Years.

A few weeks ago I wrote a small app that fetches JSON documents from app.net’s API and draws a word cloud. At first I wasn’t keeping the content around after generating the images. Later I thought of other things I’d like to do with the documents, so I decided to start storing them.

I’d never used MongoDB, and I have little interest in the NoSQL hype (particularly for my own toy projects). However, it seemed like a good fit for what I wanted to do: store and query JSON documents without worrying about schemas. I followed the MongoDB Ruby tutorial, which shows you how simple it is to insert documents into Mongo:

doc = {"name" => "MongoDB", "type" => "database", "count" => 1,
       "info" => {"x" => 203, "y" => '102'}}
coll.insert(doc)

So, one gem install and two lines of code later I was happily inserting documents into a MongoDB server on my puny AWS Micro instance somewhere in Oregon. It worked just fine for all of three weeks.

Yesterday I decided to compute some stats, and I discovered that the most recent document was four days old. Hmmm. I checked my script that fetches and inserts documents; it was running and there were no errors in the log. MongoDB seemed fine too. What could be the problem? Long story short, this blog post from three years ago explains it:

32-bit MongoDB processes are limited to about 2 gb of data.  This has come as a surprise to a lot of people who are used to not having to worry about that.  The reason for this is that the MongoDB storage engine uses memory-mapped files for performance.

By not supporting more than 2gb on 32-bit, we’ve been able to keep our code much simpler and cleaner.  This greatly reduces the number of bugs, and reduces the time that we need to release a 1.0 product. The world is moving toward all 64-bit very quickly.  Right now there aren’t too many people for whom 64-bit is a problem, and in the long term, we think this will be a non-issue.

Sure enough, my database had reached 2GB in size and the inserts started failing silently. WTF zomg LOL zombie sandwiches!

This is a horrendous design flaw for a piece of software that calls itself a database. From the Zen of Python:

Errors should never pass silently.
    Unless explicitly silenced.

There is a post on Hacker News by someone who doesn’t like the Go Language because you have to check errors in return values (that’s where I got the quote above). This is worse because like I just said, MongoDB is a database (or at least it plays one on the web). If you tell a database to store something, and it doesn’t complain, you should safely assume that it was stored. In fact, the Ruby tutorial never tells you to check any error codes. That’s what exceptions are for.

This gave me a nasty feeling about MongoDB. If something so elementary can be so wrong, what other problems could be lurking in there? I immediately switched to CouchDB (once again because it was pretty trivial), but if this were a serious project I’d be using Postgres. I’d spend the extra hour figuring out the right schema, or maybe I’d even try the new JSON support in Postgres 9.2.

Wait a second, maybe I should reconsider. After all, relational databases were not designed for the web. And MongoDB is Web Scale.

Slap me silly on Hacker News or maybe Reddit Programming  🙂

66 Replies to “I’ll Give MongoDB Another Try. In Ten Years.”

  1. “inserts started failing silently”…because you are not checking for errors.
    Using getLastError() or using safe mode is your friend. Don’t complain about incompetent usage of a technology where you obviously did not read half of the tutorial.

    1. Something so counterintuitive should be featured prominently in the tutorial. It isn’t. In fact, I can’t find it either here:

      http://api.mongodb.org/wiki/current/Ruby%20Language%20Center.html

      or here:

      https://github.com/mongodb/mongo-ruby-driver

      BTW, your comment is unnecessarily aggressive. The point of the post is that the *DEFAULT* should be to throw an error. It’s a database. Your point is like saying that you don’t know how to drive because you expected a new car to have a steering wheel, and instead it has two arrow keys in the back seat.

      1. It is clearly documented and explained on IRC, lists, slides etc. a trillion times a day that the default behavior is fire-and-forget (means no check by default).

          1. I think it’s better to first check the features of a db, before adopting it or looking for drivers in your preferred language.
            AFAIK, the fire-and-forget policy is a design choice for performance reasons. perhaps it’s not what you were looking for, but that is 😉

          2. You decided to adapt new technology without reading anything beyond the tutorial. You think MongoDB is your typical DB, yet you forget the NoSQL part which clearly an indication that it is not your typical DB but yet you expect it should behaviour like one

            Like others already stated, it is in the manual, api doc, forum and countless blog post. You didn’t RTFM and decided to blame it on MongoDB because you didn’t do your own research? I guess you must be young too.

        1. “It is clearly documented and explained on IRC, …”

          “Documented on IRC” is about as meaningful as “motorbike helmet made from ice cream”.

          However, I see that there’s a “note” link beside the 32-bit download links for OS X and Windows, which takes you to a footnote explaining the problem.

          It’s such a serious and dangerous limitation though, that it should be in bold red writing just above the links.
          By serious and dangerous, I mean that sensible, decent programmers can and have failed to notice the footnote (or not happened upon it in “random reading”) and end up puzzled at the silent failures, which nobody expects from any database system.
          In fact the DB drivers should probably print loud warnings on stderr, just to be sure.

      2. I have never even used MongoDB and I knew that you can’t insert more than 2GB in a 32-bit process because of the random reading I’ve stumbled into. Read the documentation.

      3. MongoDB does that to save time which script will spend in waiting for return status from MongoDB.
        This is one of the reason I use Mongo. I push non critical data like views (as in to calculate view count), log, likes etc to Mongo.
        My script continues without getting response from server. Thus page renders faster.

        If you are storing critical data which should not be missed. You should use getLastError to check if data was inserted properly, or better go for RDMBS which maintains all ACID properties.

  2. I was excited when I first learned about MongoDB, then less excited once I started using it. I entered a command into the shell to update a bunch of records. I got no feedback. Did it work? I have no idea. Every said I should query it again to see if my data changed. I changed databases instead.

  3. It is definitive the wrong default. If something say it is a database and just says OK if it could not store a value, it is a critical mistake.

    Otherwise this will be a correct implementation for a really fast database:

    store(String something){
    if(shallIThrowAnError){
    throw CannotStore(Exception);
    }
    //noop (will be implemented later)
    }

  4. In contrast to MongoDB, many of the other NoSQL databases have a strong focus on deploy-ability and reliability. So don’t throw the baby out with the bath water.

    1. Because it fits his worldview well? Do you really need your question answered, or you could’ve arrived to such a simple answer with your own horsepower, but decided to post anyway?

    2. s/you don’t like/that makes no sense;

      There’s a vast difference between these two, much like HTTP POST requests being cached. Is it possible? Sure! Is it a good idea? Not in the slightest!

      1. @Vincent – yes, i just wanted to post and did not take the time to really think about why this post upset me

        I have used mongoDB for the past 2 years in production code and have had no issues with mongo itself. yes there are certain limitations, as there are with any database you choose.

        i’ve just seen a lot of posts bashing mongo, and they seem to be amateurish. “i’ll try it again in ten years” seems a bit over the top. that title would lead you to believe the post is gonna go on to say how mongoDB is not ready for primetime.

        the post should have been titled “Watch out for mongoDBs 2GB limit on 32bit machines” and re-tells the lesson to read the docs so hopefully someone else doesn’t

  5. If you need an alternative, I know of one, slightly used, document content store (redundant & distributed, built on MySQL and the FS) that might be available. I don’t think that Wink is using it any more.

    Throws and error, which you might choose to ignore, when the document is not, in fact, stored.

    1. Re: the app, would also like to peek at the code – I had problems getting wordcram to generate jpegs (was trying to use xvfb and some other nastiness) mostly because I’m dumb and gave up. So would like to see how you got that to work!

      1. I use xvfb too, like this:

        xvfb-run –server-num=3 –server-args=”-screen 0, 800x600x24″ java -cp [my stuff].

        I started with pngs, I tried jpegs once out of curiosity but I couldn’t get the colors right. The code is really straightforward, basically the same as the snippets on this page:

        http://wordcram.org/page/2/

  6. Probably in ten years you will become more careful. At least you have learned that you can’t blindly trust your assumptions – you need to test and study best practices.
    Another lesson – developer should know deployment details/practices/limitations in addition to coding.

  7. Interesting you switched to Couch. I was hesitant to recommend it reading the post because I feared you were turned off JSON stores entirely, glad to hear that’s not the case.

    In general it feels like Couch actually takes storing data seriously. Append-only and whatnot. It’s slower and a little bulkier than Mongo, but it does the important things right (1.0 bugs notwithstanding.)

    I’d love a follow-up blog post on your experience with Couch.

    (x-posted from HN)

  8. MongoDB is the MySQL of NoSQL. Widely available, well-known, somewhat popular, and all the early versions have horrible deficiencies and insanities which render them unusable for anything actually serious. Maybe some year Mongo will be as quasi-usable as MySQL has finally managed to become… maybe.

    I have heard, however, that some jokers have implemented the Mongo protocol on top of Apache Cassandra – which is a *real* NoSQL database.

  9. Mongo does not fail silently – you did not catch the errors properly via GetLastError(). Do you seriously think CouchDB does not have similar gotchas if you don’t read the docs and use it properly (I know because I used CouchDB on two production projects that had serious problems in both)? The fact that you toss PostgresSQL JSON support out as something to switch to shows you don’t bother really reading any docs – Postgres JSON support at this point is nothing more than storing a blob (it will be enhanced over the next several years). You could use any RDBMS and save “JSON” to it as blobs/text documents and get the same benefits that you would using Postgres’ JSON datatype.

    And yes, this is an aggressive post (as you complain in another reply). Expect it if you publicly slam something when its obvious you started using it in production when you did not even bother reading rudimentary documentation about it. That’s a problem on so many levels.

    1. 1) Of course I know that CouchDB has gotchas. This is not a production project. Even with Mongo I lost nothing, as I was backing up my content into text files just in case. I haven’t been a programmer for 30 years for nothing.

      2) Did you read the “maybe even try” part next to Postgres JSON?

      3) I’m tired of repeating that I’ve read a fair amount of documentation, too bad people read selectively.

      4) I expect the Spanish Inquisition.

      5) The WHOLE POINT of this post is that the default of requiring people to do extra work in order to know about errors is a bad design decision for a database. Like I said above, it’s like a new car with the brake and gas pedals switched. You’d better have a GREAT reason to do that, and you should document it EVERYWHERE.

      1. I fully agree the point you are spotting in this article.
        Having to do a GetLastError() where the calls shall return or throw errors is anyway a sign of poor software design.
        Maybe in ten years…

  10. MySQL truncates a varchar without an explicit error if you add one that is larger than the column size. The only way to find the error is to check the warnings.

  11. How the hell did this get to the top of Hacker News? If you know anything about MongoDB you’re aware of these circumstances and these are in no way hidden issues. It’s very clearly stated as a limitation of 32-bit systems.

    While I understand you’re new to MongoDB, the fact you’re writing a blog post about your experience and will try it again in 100 years is ridiculous, lol 🙂

    1. “very clearly…” riiiiight.

      “There’s no point in acting surprised about it. All the planning charts and demolition orders have been on display at your local planning department in Alpha Centauri for 50 of your Earth years, so you’ve had plenty of time to lodge any formal complaint and it’s far too late to start making a fuss about it now. … What do you mean you’ve never been to Alpha Centauri? Oh, for heaven’s sake, mankind, it’s only four light years away, you know. I’m sorry, but if you can’t be bothered to take an interest in local affairs, that’s your own lookout. Energize the demolition beams.”

    2. Well I learned something – I would not expect that behaviour in a database – nosql or otherwise. Inserts are the whole point of a storage system – retrievals I could live with failing silently – but updates/inserts – those are WRITE operations and are a whole different level of critical. Once that data is gone/lost – it’s gone.

      Seems more than a little counter intuitive to me that an insert would fail silently in such a way. Smacks of a toy data store and a pile of issues waiting to happen – e.g. someone forgets to separately check the error in one location in the app = lost data.

      I think the comment about mysql is appropriate – it took a long time to be what you’d consider a “proper database” whereas postgres/oracle etc all did that from the word go.

      Sorry, but there’s performance focus and then there’s “we no longer give a shit about safety of your data”. Operating in the hacky- don’t care about errors mode should be the exception (when you conscientiously choose to sacrifice knowing about errors) rather than the default.

  12. While I agree that the Mongo behavior is completely idiotic, using some piece of sotware in production, without reading the manual is not really a good idea either. Mongo has a lot of drawbacks and bizzare cases, but the thing is, they are throughly documented.

    1. I think we’re all in agreement about that, in fact my project is just a toy (it runs on a free AWS Micro, for Jebus).

      Also, the documentation leaves a lot to be desired in some respects. Check out my follow-up post. Then think about this analogy: the first cars had no fuel gauge. Today nobody would make a car without one. That’s how technology matures.

      1. Right, but the thing is, whole NoSQL thing is new, and there is a a lot of room for experiments like unacknowledged writes. Even though common sense says something’s not really a good idea, our common sense is tainted by years of RDBMS knowledge and might not apply well here.

        I’m just all about giving these things some chance (more than once in ten years) 🙂

        BTW, tag cloud looks good.

    2. I agreee: most of the MongoDB bullshit and its misfeatures are documented. The problem is and remains the MongoDB marketing. MongoDB is announced as database that fits all your “webscale” needs. In fact it is scruffy hand grenade that will explode in your hand – sooner or later.

  13. MongoDB is a nice software, and used it in several learning projects. It’s already a lot of change when you migrate from one SQL DB to another. And it’s a lot more when you cross to NoSQL.

  14. I did not read all the comments, just to much bashing going on but there are just 2 Simple things going wrong here:

    – How can you not have read of the 2 GB-Limitation? Its pretty much the first thing every 10gem Engineer says at every chance he has. Its all over the docs, the first location is right on the downloads page with a red *note* next to all the 32bit builds.
    Saying this issue was not documented in the ruby tutorial ist just plain wrong – the ruby driver docs are how to use the ruby driver not how to install the server. And the docs for that would have had the information. The problem here was jumping right to the “apt-get install mongodb” (or what ever you used) without checking what it actually wants/needs from the system to work properly

    – You started using a tool because it was *cool and new* (can be the only reason i can think of that makes you use something straigt away without checking it thorougly out first) without knowing what it actually does. Its all over the internet that Mongodb is known to be evil for losing your data by default. And they also say so wherever they can in their documentation or slides. Just simply use the damn safe mode features they have if you need them. By default Mongodb is build to be fast and not as reliable as your typical mysql/postgres when it comes to data-consistency. This is not because its crap at what it does but what it was meant to do from the beginning. It has come a long way and has some really nice features ensuring consistency of your data already but thats still just a nice to have feature here where to core is to keep the system fast and sacrifice writes for that if it has to.

    In a little addition: How would you have had mongodb fail your writes? Crash the database because its full and by that taking the whole site down with it? Not Any better. By logging to some log-files? Better but who checks those unless there already is something wrong he knows of?
    Mongodb gives you the feature to check if a write was successfully which is a very powerfull tool used the right way – but one has to use it to know about a failed write.

    I’ve been reading so many horror-stories about mongodb already, but still i’m using it since the 1.4 version in production and NEVER had any issues that ware not completely my fault. And using mysql for long before i also had my issues there too. Still dont understand how a disk can get full without someone noticing it in a production system these days. Its like driving with your “gas empty” sign and complaining your car manufacturer should have automatically sent you a gas refill.

    (I know that last part does not have anything to do with the issue here but is a nice addition to what others find wrong about mongodb)

    1. Dude, you’re not adding anything to the discussion. All of your questions / points have been addressed in the comments. What’s the point of writing several paragraphs if you haven’t read what was said before? Get your own blog!

      1. My bad for mistaking the comments section for a place to comment on your post 😉

        Yeah it turned out longer than i thought at first, got caught in writing. And of course it was having redundant parts but i really did not feel like reading through hundreds of comments mostly bashing at each other.
        I just read your follow up post (after seeing on twitter that there is one), which just confirms my point that you simple installed by using your systems packet manager. That there is no obvious information on the 32bit Mongodb starting up is definitely a shortcoming but does not excuse missing research before using it as the heart of a production system.

        1. That was the research part. Do you think my puny toy app running on an AWS Micro is a production system? I guess it must be for some readers of this blog who are young aspiring programmers.

          There’s nothing wrong with starting your research by writing a toy application. Especially if you’re using someone else’s data that you can recover from the source of truth.

          If I had used MongoDB at IndexTank or LinkedIn, you’d have a point. At IndexTank we evaluated Cassandra for a month before deciding it wasn’t for us. It was a long story, and we had already ruled out other alternatives very quickly. In the end we had to write our own LogStorage.

          At Inktomi it took us three years (1999 to 2002) to switch from Solaris to Linux. The larger the company, the more thorough you need to be when analyzing technology.

          The fact that MongoDB doesn’t have a caveat in the right place simply wastes people time during the research phase. This is a sign of an immature product, and is the point of my post.

          For example, this is what you see when you install ZeroMQ through Homebrew:

          ==> Caveats
          To install the zmq gem on 10.6 with the system Ruby on a 64-bit machine,
          you may need to do:

          ARCHFLAGS=”-arch x86_64″ gem install zmq — –with-zmq-dir=/usr/local

          If you want to build the Java bindings from https://github.com/zeromq/jzmq
          you will need the Java Developer Package from http://connect.apple.com/

          —-

          This is the kind of knowledge in the right places that accumulates as a product matures. The subtlety of the point is lost in most of the people who got here through Hacker News.

  15. It’s a shame most of the people try MongoDB with no success and put a label “#fail” on top of the entire NoSQL movement. Please stop following the trends. It’s like if everybody try Oracle as RDBMS and then say: “it’s too heavy/costly/slow for my use case”.

    Look at OrientDB as the next NoSQL product to try. It’s a document-graph DBMS, but supports Transactions, has an extended SQL, Relationships are super fast and now has a multi-master architecture. It’s licensed as Apache2 and has an active Open Source community.

  16. To add to my last comment: I think the *actual* flaw is the driver not defaulting to calling getLastError after each update. I don’t think MongoDB is to blame but the driver-implementors. It should be possible to turn *off* calling getLastError after each action but the *drivers* should default to *on*.

  17. From a blog post on mongodb.org on November 27th:

    Today we are releasing updated versions of most of the officially supported MongoDB drivers with new error checking and reporting defaults. See below for more information on these changes, and check your driver docs for specifics.

    Over the past several years, it’s become evident that MongoDB’s previous default behavior (where write messages did not wait for a return code from the server by default) wasn’t intuitive and has caused confusion for MongoDB users. We want to rectify that with minimal disruption to the MongoDB apps already in production.

    http://blog.mongodb.org/post/36666163412/introducing-mongoclient

  18. I think if you adopt completely new technology like NOSQL vs RDBMS you have to familiarize yourself first with what it’s designed for.

    When you move from Java to Scala you probably become aware of the fact that there are no checked exceptions any more, so it’s up to you to check if any were thrown. Scala will not force you like Java does. The same goes for Mongo. It’s designed with completely different goals in mind and the main reason write concern is fire and forget style is it’s distributed nature (replication).

    They clearly say that they had to drop some of the SQL features like transactions (not atomicity) to achieve product goals, thus you have to be aware of such design decisions in the product.

    Conclusion – nobody to blame but yourself for not spending enough time learning about basic features before using the product.

  19. NoSQL systems have certain limitations (e.g. CAP-related design trade-offs), but silently losing data is not expected in any of them.

    Many tools have gotchas, like silently dropping your data (I’m talking to you Scribe), or being relatively painful to maintain (you HBase), but there are alternatives which are easy to use because they were created by system engineers who had to live with their tool. I won’t name the ones I mean, but they’re out there.

  20. :cough: So it’s been 10 years.

    > This is a horrendous design flaw for a piece of software that calls itself a database.

    Ignoring that 64-bit is hardly new or controversial, even at that time, or that it even represents any form of onus on IT…

    In relation to errors passing silently, there’s a giant warning displayed if you attempt to utilize the 32-bit version, and it’s been there for a very long time, plus red flags of warning on the download page and documentation relating to it. Similar to the warning emitted if it detects you have THP (Transparent Huge Pages) enabled in your kernel, which violates assumptions over memory pages being 4KiB and the performance guarantees of allocation operations.

    > If you tell a database to store something, and it doesn’t complain, you should safely assume that it was stored.

    Or you should check your assumptions before relying on them! By default, writes only confirm the primary heard your request, not that the request was completed or even written to disk-backed journal. If you want to use things safely, invest in learning the properties of their operation!

    I’m very sorry but developers can’t abandon all responsibility for their technical choices. You make your bed, you have to lie in it. Or get fired for failing to RTFM and leaking your company’s data (failing to secure) or encouraging data loss (failing to set a write concern).

    From the comments…

    > Many tools have gotchas… [calls out specific tools]… but there are alternatives which are easy to use… [refuses to name them]

    Thanks for coming out. Back when this article was originally published, I was pushing 3.8 million inserts/second on one RPC collection, and storing hundreds of millions of records totalling ~30TB of personal data in a tag-based non–hierarchical GridFS collection. Today that dataset is up to 56TB. On the professional front, my production platform applications are processing 15K requests per day or so, storing a hair shy of 15M records.

    Sure can use alternatives that are easier to use… I guess?

Leave a Reply to Phil Y Cancel reply

Your email address will not be published. Required fields are marked *