A CouchDB User Story: chatting with Assaf

In our interview with Assaf, he talked to us about how his usage of CouchDB for an internal project for his organization’s intranet. Assaf’s challenge was unique in that his project could not use clustering effectively as it had to be entirely in one machine.

Assaf’s machine supported nearly 6TB with around 2 billion documents across ~20 DBs, serving right around 100k reads/day and 20-50GB writes/day. This led him to “debug the hell out of it” resulting in this document: Linux tuning for better CouchDB performance.

Assaf went on the tell us more of why he chose to use CouchDB and how it has best helped support his project’s needs.

How did you hear about CouchDB, and why did you choose to use it?

I initially encountered CouchDB on Google.

I had inherited a project that was using Apache SOLR as its main database, but back then (April 2016) it had about 100GB of data, so all was well. The only person with write access to the database was me, so all we needed from SOLR is to be very quick while reading, and it was.

But then, I got 1.2TB of zipped, highly nested, schemaless JSONs to index. SOLR has this neat feature: “Schemaless Mode” which basically just creates an index (=schema entry) for each new field it discovers.

I had to use this mode because all fields with a value of sha1 string had to be fast to query, and the field names were randomly generated (weird, I know).

Because the field names were random, SOLR would create new schema entries all the time, which led it to be extremely slow and unstable.

SOLR would also flatten the input JSONs (e.g. {"a":{"b":1}} => {"a.b":1}) which was very annoying for us. After a couple of weeks and not a lot of GBs indexed, we experienced a big power outage. SOLR took 5 days to recover from this incident (checksum on init? data recovery?), so our systems wasn’t operational for that time span. This was UNACCEPTABLE!

I started googling for a schemaless DB that could support deeply nested JSONs. I ruled out MongoDB because of bad past experience, very slow queries on a 10GB collection with indexes. I also ruled out Elasticsearch because of Lucene. I figured Lucene’s many files and file edits is what caused the long recovery time after the power outage.

I specifically googled “schemaless db” and “mongodb vs”, it was here that I came across CouchDB.

I started reading the documentation and it got me hooked on the “just relax”, “there is no turn off switch, just kill the process” and the ability to build indexes programmatically, so I could recurse into the objects and emit values that match the sha1 regex.

What would you say is the top benefit of using CouchDB?

Durability. Since the SOLR saga, we’ve experienced a few more power outages, hard disk failures and filesystem corruptions (at least 2 of each; yeah, our infrastructure can be better).

Amongst all the panic and horror, I was smiling.

After power outages CouchDB has a zero recovery time. If a hard disk had died or the filesystem got corrupted, CouchDB would just reacquire the lost data by synchronizing from a replica or replicating a backup.

What tools are you using in addition for your infrastructure? Have you discovered anything that pairs well with CouchDB?

  • couchimport
  • jq.
  • curl.

What are your future plans with your project? Any cool plans or developments you want to promote?

Yes, I have found a neat trick to import an archive full of JSON files.

I also plan to add a section about client http keep-alives, to my document detailing my results for seeking better CouchDB performance on Linux systems. I’ve found out that using HTTP keep-alives to access CouchDB can drastically improve CouchDB’s performance, as it doesn’t need to build and destroy TCP connections between interactions with clients. For example, while using Node.js’ request or request-promise package we’d turn on "require ('http').globalAgent.keepAlive = true" and pass "forever: true" with each request.

 

Use cases are a great avenue for sharing useful technical information, let us know how you use CouchDB! Additionally, if there’s something you’d like to see covered on the CouchDB blog, we would love to accommodate. Email us!

For more about CouchDB visit couchdb.org or follow us on Twitter at @couchdb

Replication makes CouchDB the single best solution for Hoptree

We are really enjoying all the great use cases we are encountering through the interviews we’ve been doing over the past few weeks (hint, hint). Patrick Wolf and his team at Hoptree were no exception. They even introduced us to this cute video prior to explaining how they’ve leveraged CouchDB for their SaaS application.

Hoptree offers companies the capacity to increase efficiency and customer interaction by sharing the responsibility of customer texting with an entire team.

How did you hear about CouchDB, and why did you choose to use it?

We had researched Cloudant at the time of the IBM acquisition and learned more about CouchDB. Since at that time we were primarily focused on mobile development, Cloudant and CouchDB were interesting to us because they enabled offline mobile applications. Prior to that, we had tried out several other offline sync solutions which never worked well.

After using CouchDB, we liked it not just because of its replication capabilities but because it’s a great NoSQL database. The fact that it enabled offline replication was a bonus. When it came time to pick a database for Hoptree, CouchDB seemed like the best fit.

Did you have a specific problem that CouchDB solved?

Being that Hoptree is a multi-tenant application, and given some of my past experience building multi-tenant applications, it was very important that we keep customer data as segregated as possible. That meant creating a database per customer. While that’s certainly possible with other database management systems (DBMS), we found that their connectors weren’t as well suited for querying many different databases at a time. They typically create pools of persistent connections per database. Because CouchDB uses HTTP, things are greatly simplified. There’s no pooling and no persistent connections. This has also worked well for us as we’ve transitioned to serverless computing because it allows database access with very little overhead.

For the folks who are unsure of how they could use CouchDB–because there are a lot of databases out there—could you explain the use case?

The replication in CouchDB is really the killer feature that sets it apart from other databases. There are a lot of use cases for using tools like PouchDB to enable offline support in mobile applications. However, we also found it useful server-side, as well.

For example, we use PouchDB to replicate configuration data onto each of our servers. We avoid making additional requests every time we need to read a configuration value, and we don’t have to think about how to cache that data. There’s always an up-to-date version of the configuration available locally.

What would you say is the top benefit of using CouchDB?

  • Replication – as mentioned before, this is the feature that sets CouchDB apart.
  • Optimistic Concurrency – I’ll admit that when I first started using CouchDB, dealing with revision IDs seemed like an annoyance. But now every time I see Document Update Conflict error, I realize that CouchDB just prevented someone’s data from unknowingly being clobbered. It allows us to confront a problem that we might not think about otherwise until it’s too late.
  • Web technologies – CouchDB fits easily into just about any environment because of its use of common web technologies like HTTP, JavaScript, and JSON.

What tools are you using in addition for your infrastructure? Have you discovered anything that pairs well with CouchDB?

We’re running a Node.js stack. All our REST APIs are backed by Swagger. Because CouchDB stores pure JSON documents, it’s easy to use the JSON Schema models within the Swagger definition to validate the documents we store in CouchDB.

We also make use of AWS Lambda for some services, which works well with CouchDB because of the low overhead in making HTTP calls from a Lambda function.

What are your future plans with your project? Any cool plans or developments you want to promote?

Our two-way messaging service has been live for a few months, and we’re still busy adding new features. When I do get some downtime I would like to start converting our codebase to Typescript. There should be some interesting ways to integrate it with CouchDB, but perhaps the open source community will beat me to it.

 

Use cases are a great avenue for sharing technical content and information with the rest of the community. Please consider joining the fun! Additionally, if there’s something you’d like to see covered on the CouchDB blog, we would love to accommodate. Email us!

For more about CouchDB visit couchdb.org or follow us on Twitter at @couchdb