Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Do you test in production?
60 points by bradwood on Jan 14, 2023 | hide | past | favorite | 73 comments
There are a lot of blog posts talking about the fact that testing in prod should not be a taboo like it may have been in the 90s. I've read some of these [1] [2], I get the arguments in favour of it, and I want to try some experiments.

My question is -- how does one go about doing it _safely_? In particular, I'm thinking about data. Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?

Ultimately, I think this will end up looking like implementing SLIs and SLOs in PROD, but for some of my SLOs, I think I need to actually _fake_ the data in order to get the SLIs I need, so how to do this?

Suggestions appreciated -- thanks.

[1] https://increment.com/testing/i-test-in-production/

[2] https://segment.com/blog/we-test-in-production-you-should-too/




Lots of ways to test in production. IMO the way you are suggesting – injecting synthetic data into prod – is the worst of both worlds. You aren't actually testing real world use cases, and end up polluting your prod environment.

Some common ways to go about it:

- Feature flags: every new change goes into your codebase behind a flag. You can flip the flag for a limited set of users and do a broader rollout when ready.

- Staged rollouts: have staging/canary etc. environments and roll out new deployments to them first. Observe metrics and alerts to check if something is wrong.

- Beta releases: have a group of internal/external power users test your features before they go out to the world.


And from real-world experience:

- Feature flags should be easy to create or even automatic. Otherwise your team will bypass or forget to create them, or start reusing the same ones until the flag doesn't make sense anymore

- Don't let devs name the flags, or agree and enforce a convention. They should also make sense without context (NOT "test-fix-error")

- Remove flags over time or they will lose all meaning and context

- Categorize your flags to automatically grant them to certain roles/groups


You aren't actually testing real world use cases

You can mock them from actual data. The last system I worked with and tested in production processed 1..0.1 live requests per minute. Hot-testing it was a torture. There was a test site which could produce requests, but using it was a torture as well, also it worked only on even days of week and could only be set up on request (as in IM, phone call, not HTTP) due to administrative cf.

Before anyone asks, no, it couldn’t be tested off-production. I could spin up a “staging” server with all proper infra, but (1) after all the modifications it would just turn into production, (2) incoming data was live anyway, and was an obligation to the company (i.e. all my fails closer to a response went to support and ate limits). Peer services had no such thing as test mode.


IMO your first two suggestions aren't actually a form of testing.

If you've rolled out a feature to real paying customers (not beta testers), and find that it's broken, you don't get a pass because you were only "testing".


Everyone tests in production. Some people also test before production!

Some people try to NOT test in production, but everyone does test in prod in a very real sense because dependencies and environments are different in prod.

I think the question was "Do you INTENTIONALLY test in production"


Yeah. What some people mean by “we don’t test in prod” is that we have very high confidence in all code before it goes to prod. And good for them! But what most people mean is “code goes from zero to affecting all users and developers have little or no access to data about how that’s going.” Bad. No matter how many unit tests you have.


I work for a B2E company that has a structure similar to Salesforce. We test in production all the time even for our secure environments where the data is highly sensitive.

Re: data, it’s a somewhat common practice to notionalize data (think isomorphically faking data). We regularly do this and will often designate rows as notional to hide them from users who aren’t admins. I’ve found this to work exceptionally well; we do this 1-2 times a week, ensure there’s a closed circuit for notional data, and for more critical systems we’ll inform our customers that testing will occur.

I’m sure there are more complex and automated solutions but when it comes to testing, simple and flexible is often the way to go.


>notionalize data (think isomorphically faking data)

Are these just $5 words for setting a fake data flag on the records?


We do that too but notionalizing for us is usually creating data that looks and behaves realistically but is actually fake. (A side benefit to this is that we can then use it for demos!)


So you mock data and then flag it as fake.


Essentially yes! We usually try to follow some sort of theoretical user story/paint some sort of narrative but at the end of the day it’s just adjusting the mocking.

Just now realizing notionalizing isn’t a widely accepted term for this


Thanks. This sounds interesting.

Can you give a bit more colour on "notionalizing" and "isomorphically faking" please.


Essentially creating fake data that looks very realistic and creates narratives that would span real use cases. Some of this is simple (fake names with faker), some of it is a bit more manually guided (customer-specific terminology and specific business logic).

The goal here is for the data to both be useful for testing and provide coverage not just at a software level, but at a user story level. This helps test things like cross-application interactions; is also doubly helpful since we can use it for demos without screwing up production data.


Anytime you need to talk to a third party API, you need to test in prod.

Some people have sandbox apis. They are generally broken and not worth it. See eBay for super in depth sandbox API that never works.

You can read the docs 100 times over. At the end of the day, the API is going to work like it works. So you kind of “have to” test in prod for these guys.


Ditto regarding Paypal: you need a sandbox API token to get started with it. Their sandbox token generator was broken for MONTHS, I could not believe it. By the time we got the token, we already fixed all bugs on our side the hard way - by testing in prod - and moved on.


cough CME cough


Hah did we even pretend to have a sandbox at CME? Outside like Saturday testing parties.


There was a sandbox but it never behaved like the prod environments. To the point I had to have a flag to tell my order entry code if we were being certified in the sandbox or actually trading in prod.

I’ve heard that got better after they replaced the gateways with fpga but don’t know for sure.


Ah yes, it's always great when you need special code to pass the certification test after which you disable for real prod.


A/B tests and feature flags are basically testing in prod. And yes, some of those features sometimes run as a "well, it should work, but we're not entirely sure until we get a significant number of users using the system". It could be an edge case failing or scalability requirements being wrong.

Another variation on the same theme is rewriting systems when you run production data through both systems. Quite often that's the only way of doin migrations to a new platform, or a new database, or yes, a newly re-written system.

> Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?

A very common practice is to run a snapshot of prod data (e.g. last hour, or last 24 hours, or even a week/month/year) through a system in staging (or cooking, or pre-cooking, or whatever name you give the system that's just about to be released). However, doing it properly may not be easy, and depends on the systems involved.


snapshotting for us is a PITA -- there are lots of distributed elements to the system, data pipelines, async back-end processing, batch jobs, etc. Add to that, we have data masking requirements that mean we need to (for compliance reasons) obfuscate (datamask) sensitive data.

This doesn't mean it's not possible to do, but it's a pretty big lift to get right, and keep right.


We run canaries in Prod, it isn’t as extensive as our integration tests that run in our test stages but it still tests happy paths for most of our APIs.


I think it depends on how your application works. If you have the concept of customers, then you can have a test customer in production with test data that doesn't affect real customers for example. You can reset the test customer data each time you want to test.


Good testing is an exercise in pushing I/O to the fringes, as that's what has stateful side-effects. (Some might even argue that anything that tests I/O is an integration test. The term "integration test" is not well defined and not worth getting hung up over IME.)

Once you're into testing I/O, which is ultimately unavoidable no matter how hard you try not to, you either need cooperative third parties who can give you truly representative test systems (rare) or a certain amount of test-in-prod.

Testing database stuff remains hard. You either wrap things in a some kind of layer you can mock out, or dupe prod or some subset of it into a staging environment with a daily snapshot or similar and hope any differences (scale, normally) aren't too bad.

Copy-on-write systems or those with time-travel and/or immutability help immensely with test-in-prod, especially if you can effectively branch your data. If it's your own systems you are testing against, things like lakefs.io look pretty useful in this regard.

And yes, feature flags, good metrics, and load balancers that you can send a small percentage of traffic through a new version (if your traffic/system allows such things) all help.


My org has done a bunch of what's already covered here. We have a bunch of customers (SAAS), and though we have a good idea of what's going well in aggregate through observability, it's hard to gauge that any single org is having exactly repeatable results at they should expect vs. statistically acceptable volume for everyone. Because of this, we also setup synthetic accounts for test customers and regularly drive test scenarios through them to make sure the single customer doing the same old bring workloads are also doing alright. It tends to catch large issues that are caused changes that affect outputs without changing the volumes/latency. It's like end to end testing very common hot paths running forever in a real customer account flagged without billing. It tends to catch regressions way more often than it rightly should've.


In electronic trading, most new systems are tested in production by running with smaller capital allocation first. It is hard to flatten out all bugs unless you are on the real market with real money and real effects (of course, simulations testing and unit testing are heavily employed too).



Hi! I wrote the referenced Segment post! Happy to answer any questions.

The way we did it safely is just as you say: creating fabricated users/organizations/configurations with data generators injecting into the system.

Faking data to look realistic is always challenging, but we used this cool library written by an early segment engineer: https://github.com/yields/phony

Not perfect but works well enough! And it's super simple. :-)


thanks -- great post. I think i'm coming round to the idea of faking some data, but need to think through how to do this well. We use the faker lib in JS, but phony looks pretty tight also - thanks for the suggestion.


Sometimes one cannot get the exact same specs on test hardware versus production, yet a rollout depends on simulating system load to shake out issues.

Performance testing needs a schedule, visibility, timebox, known scope, backout plan, data revert plan, pre- and post-graphs.

  Schedule. Folks are clearly tagged in a
            table with times down the side.

  Visibility. Folks who should know know
              when it's going to happen,
              are invited to the session, 
              and are mentioned in the 
              distributed schedule.

  Timebox.    It's going to start at a
              defined time and end on a
              defined time.

  Known scope. Is it going to fulfill an
               order? How many accounts
               created?

  Backout plan. DBA and DevOps on standby
                for stopping the test.

  Data revert plan. We know what rows to
                    delete or update after
                    testing.

  Pretty pictures.  You want to show graphs
                    during the test, so
                    that you know what to
                    improve and everyone's
                    time wasn't wasted.
Reference: observing successful runs that didn't result in problems later.


With certain kinds of reporting/BI tools, I've generally found it's not that risky to test in production, provided certain conditions apply, and it comes with a number of advantages where the QA environments don't truly mimic what happens in production (or the time for updates in QA is way too slow, so you don't see varied output cases appearing fast enough to give a good test).

A common dev concern (usually raised by people who have no idea how users actually use stuff!) is that someone might pick up the report and then do something awful based on it, which would be awful^2 - I then explain that users can't find/use the updated reports till we tell them where they are and grant access permissions etc etc, so it's going to be fine and there's no need to panic, which calms them down till they forget by the time this comes up again!

On a side note, these people seem to get much more wound up about principal based worries (it would be bad to test in Prod being a prime example) compared to concerns based on their own weaknesses (ie they rush, forget a whole section of requirements, make mistakes, can't spot obvious bugs) which they seem to imagine are way less likely to cause problems than experience demonstrates.


these people seem to get much more wound up about principal based worries (it would be bad to test in Prod being a prime example) compared to concerns based on their own weaknesses

That’s because they aren’t aware of their fails before they happen, but are able to predict bad situations which could happen. It’s important to communicate usage patterns and risks to them regularly, otherwise they will be anxious of making that “bold move” and reinforce their anxiety from more developer memes.


As others have said, injecting fabricated data into prod wont give you any value. The only reason to test in prod these theys are to try your new feature on data with the breadth and level of detail that prod data has that you can never fabricate. (Hardware differences between prod and other envs really should not be a problem these days).

In almost every case you cannot test new functionality on actual prod data, at least not anything thats not strictly “read only”-functionality. If you have a new feature to send automated mail to someone foreclosing on their property you just do not test that on a real live system.

What you can do is setup a staging environment that is as close to prod config as possible, and then copy the prod database to the staging env. Do your tests in staging. It doesnt matter if data in stg is messed ud. There may well be legal or company policy or security restrictions preventing you from doing this, but its the only way to test on real life data without the risk of f**ing up data in the live system.

Then there are integration tests - to other systems that is, which are a much harder problem


What I've done in the past is to write a test that runs every five minutes in production, accessing the APIs like a user for the most common app flows. It provided a great way to be sure the app was genuinely working.

That did require having multi-tenancy support, and there was a need to suppress some security features by whitelisting the IP of the test app.


If you are lucky you have internal users who use the product in production. Then if you are lucky you have a group of external power users who appreciate getting features first and understand that there is a risk of bugs.

Most software probably falls in this category of “we could test more but at some point our users would rather get the product with bugs than wait”.

Whether it’s staged rollouts, feature flags, it’s the same thing. It’s mitigating risk when testing in prod. It’s the best bet.

Some software obviously falls into the category that can’t have serious bugs for any users. Then you just have to keep the software so simple that you can be confident it works.


Testing in production will trend upwards among companies because everybody's workload is shifting towards the cloud and/or making use of external SaaS services. There are a number of cloud services that are not open source that can't be run locally, or run at the same scale as production.

It is not a good use of time to mock everything, because you have no control of external systems. The only reason I'd see it being important is if these external systems are tightly coupled to complex local logic that should be tested locally. However, there are a number of strategies to deal with such "tight coupling" in such cases.


One approach I recall taking as the in-house developer for a company's warehouse management system was to designate the warehouse I was in to be the "experimental" warehouse: new features (or new systems entirely!) would be developed/configured in coordination with that warehouse's team, and they were generally comfortable with the idea that they would get those new features first (with the risks that entails). Once my local site had used the new feature/system for a few weeks without major incident, it would then get rolled out to the other sites.


In most of my projects I made sure that there was always available a recent (1) copy of the whole Production DB - this is mostly used be able to replicate erroneous behaviour in a controlled environment.

But it is also useful to get very close to "test in prod" without actually risking anything.

Actually executing data-changing code for testing is actively discouraged, though.

1) the current system takes a snapshot of the production db at the end of the day and uses it to repopulate from scratch this "staging" environment. In past cases I had to accept less frequent updates, though.


Currently not, but there was a project where we had to "develop" in production. I was coding a IoT adapter for a building automation system. We sure had development machines but when we first tested our code in real env we noticed they were slightly different version. So there was no other way than to ssh into our machines and use Vim to make code work and then replicate the changes on own computer. Fun times but don't really miss the stress of messing up something on real building.


Note: if you plan on accurate financial planning and metrics (esp. if going public), you need to be able to separate your test prod stats from the real prod stats for reporting.


I tried this for a while, marking such tests as production compatible. They relied on test records made for the purpose, sometimes copied to other environments to make the tests.

For 3rd parties with test modes like Stripe you can get E2E, or if the cost of the test is low.

Some safety controls to avoid running non-prod-safe tests are wise.

Another alternative is using anonymized prod copies outside prod. Possibly even mocking 3rd parties to behave like prod, happy, sad, etc.


One box testing works well for some scenarios. It's not completely safe but the risks are low. If there is an issue then it only impacts a very small number of customers. If they retry they'll likely hit one of the thousands of stable instances. Comparing metrics between the one box and normal instances is helpful and can be tied in to CI/CD for automatic rollbacks if necessary.


At my workplace which deals with millions of customers. What we have is essentially a segmented clone of prod. It's a 1:1 copy of prod with real data flowing through.

We use feature flags to enable-disable features. This way when our devs ship code to prod - it first lands in this segmented clone.

Then incrementally changes are propagated from this segmented area into actual prod-prod.


Is this different from a staging environment?


I put my new feature behind a beta flag or experiment flag. If the flag is off for a user, they don't see it.

Then I turn it on just for the user I test with in prod. Then I test in prod.

When it's time to enable the feature for the rest of the users, the same system let's me slowly dial up which users can see the feature. This separates deployment from launch, which is also a great best practice.


The taboo is when you only test in production. At the very least you should manually try out your app after deploying a change. As far as automated integration tests in production, it is simple as identifying which tests are prod safe, and marking them. That really depends on the app, but in a web app it generally means all the GET requests, plus some of the others.


In a multi tennant system one of the accounts can be a test account. Within that you can run integration tests. You might need special cases: test payment accounts and credit cards, test pricing plans and so on.

Some basic ping tests and other checks before swapping (as in preparing, initiating, and pointing the load balancer) to a new version into production would be smart.


Actually never seen a team that hasn't tested their system in prod. Just be careful with fake data, you might be testing something that does not mirror actual application usage. Feature flags, betas, etc. can be safer than fake data.


Generally, no. I have been known to point my local instance to production database where I am now as it's easier to get the dataset where an error occurs. I don't do anything that requires changing the data, strictly selects and views. I make a point to switch it off production ASAP.

I would prefer not having to do that at all though.


A significant source of frustration at $dayjob recently has been the _inability_ to test in production. We've just deployed Stripe, and if you're using prod API keys, there's no testing possible without spending real money. Deploy to production and pray to the tech gods I guess.


In most of the systems I worked with the ACID database is the source of truth. So I carefully (there is framework support to reduce errors) run tests without committing the open transaction. Not recommended, but sometimes database copies don't surface the actual problem.


yes, I do testing on production usually. I feel much comfortable if can perform enough testing in production, before inform end-user to start use it.

In my experience, it starts from DESIGN PHARSE, which should be awared to make it possible to test in production without impact end-user. CODING PHARSE shall make some arrangement to provide more possibility for testing on prodcution environment. DEPLOYMENT process shall be able to provide a time-gap like "pre-launch". Then will be happy to test in the "pre-launch" period in production, and feel confidence to infrom end-user for a release.


I see a lot of suggestions in the comments for feature flags -- we've been using these from the beginning, to very good effect.

However flags turn on/off code, not data, and my main area of interest here is how to deal with the test data problem in prod.


What a coincidence! Just right now. Definitely I don't consider this as a normal situation. However, the crisis in progress leave no other way to find a root cause of production database sudden degradation. So here am I.


Just to play Devil's Advocate and be argumentative, what is the point of testing in production when your development/staging environment is guaranteed to be identical to your production environment?


Is it? I would argue if you have live data in some large amount, you’re not going to be keeping that in sync with your dev environment (and it should be sanitized, which has its own fun oddities at times). I would also argue that your staging environment probably isn’t running the same hardware specs as live. Inevitably you’ll find race conditions, odd data structures and other fun things which simply aren’t apparent on systems prior to live.


In large systems an enormous amount of energy and dev time is wasted in trying to maintain staging/dev environments. Accepting that they won't ever be the same and building out tooling for safe testing/rollouts in prod is a far better use of resources


Another commenter made a good point that in many domains/systems, environment parity is by definition impossible because in production the system spends (non-trivial amounts of) money.


Yes

Just because you have staging doesn't mean you don't need unit tests. Similarly, test in stage, then test in prod. Ideally in a way isolated from real prod users (eg, in an insurance system we had fake dealer accounts for testing)


Depends a lot on your application and how big the changes are. IF you're an online store and you're pushing out incremental changes to a subset of users its a good strategy. If its aircraft auto-pilot not so much.


I wouldn't personally inject fabricated data into prod just for testing. I use feature flags and test internally in prod before rolling out to real users.


It's more about handling production error quickly, than testing in production. Feature flag is a good way.


I've been testing in prod for 20+ years, here are the best practices I suggest:

tl;dr: Safety comes in the form of confidence that you will know right away when something has gone wrong and can quickly recover from it back to the last known good state.

1) Observability is key. You can't test in prod unless you have really good metrics and monitoring in case you break something. It's also the only way you'll know the test worked. So that has to come first.

2) Automated deployment and rollback. You need your deployments to be fully automated as well as rollback. That way if something goes wrong you can quickly back out the change. It also means that devs can roll out smaller changes, because they don't have to amortize any deployment overhead. If a dev knows it will take 30 minutes minimum to deploy, they won't do it as often. Smaller deployments more often mean smaller blast radii.

3) Automated canaries. Once you have 1 and 2, you can fairly easily build 3. When code is checked in, have it automatically deploy and receive a small portion of traffic. Then have it automatically monitored and compare metrics. If the metrics are worse on the canary, roll it back.

You don't need to automate step 3, it's just a lot easier. But you can totally do step 3 by hand as long as you have 1 and 2.

These steps apply to stateless systems, but they can easily be applied to stateful systems with some small changes. With stateful systems you can still do canaries. But you have to add an abstraction layer between your business functions and their datastore (but you're doing that already right?). In that abstraction layer is where you add the coordination to keep data in sync during transitions from one data store to another (when doing schema changes for example). Or if you're changing the way you write to the data store in any way, so that you can write to both new and old and read from new and old without the code being different between them.

And then lastly you start adding in chaos engineering [0]. If your systems can automatically recover from errors in production, then it can automatically recover from bad deployments.

[0] https://principlesofchaos.org


Good answer. The interesting notion for me here, is the distinction between observability (including SLIs, SLOs, etc, not just ad-hoc observability a la honeycomb.io) and testing in production -- they almost feel like 2 sides of the same coin. As really, the test in production is an experiment that is designed to measure something, which feels like it's just an SLI of some sort.

Observability, I agree, is essential, but rather than thinking of it as a pre-requisite as you suggest, I'm thinking of the tests as a form of observability.


yes you can do so with a canary tier . assuming your code is well instrumented to distinguish performance and quality regressions , a canary tier served to customers will catch more regressions than synthetic testing


Anyone with CI/CD is testing each deployment in production, right?


well yes, testing the deployment of some code is not the same as testing the subsequent operation of it, after that deployment


I don’t recall testing in production ever being taboo in the 90’s.


Testing is called verification when done in production ;)


Is there another way?


Feature flags.


I also code in production


I've never done much testing in production. A long time ago I was too lazy to put my website into git, so I would just ssh into the webserver and edit the HTML files with mg. Not particularly productive or enjoyable, honestly. I am sure the search engines also liked index.html~ being very similar to index.html; was also too lazy to turn off backup files ;)

My priorities with production are getting as much information recorded as possible; if there is ever a bug that occurs and isn't detected by monitoring and debuggable by looking at the telemetry, that's a big problem that is a priority to fix. It is always a work in progress, but something that you can chip away at gradually over time. (Add them as postmortem action items.)

The provided articles mention weird quirks that only happen in production, like network card firmware issues that drop a particular bit pattern. I've definitely seen things like this (at a higher level); I add the bit patterns to my test suite and make the test suite runnable as an "application" in the production environment and then collect my data. As for straight-up hardware problems, that's happened exactly once in my career. I used to maintain a several-thousand replica application; one day one replica was crash looping. I looked at the stack traces, different each time, and couldn't figure out what was possibly wrong with the code. A nearby coworker suggested "just restart that replica with --avoid_parent to schedule it on a different machine". The problem went away and never came back. Shrug. Sometimes the computer doesn't faithfully run the instructions that you put into memory, but it is pretty rare. Detect it and remove the faulty computer, I guess.

For less quirky things, I like the ability to simulate resource constraints, rather than trying to run into them with physical hardware. For example, it's pretty hard to write a load test that makes S3 slow, but it's pretty easy to hack up Minio to sleep for a second every MB of data and now your load tests can see what blows up when S3 is slow. Then you can edit your code to be resilient against that. (etcd on low iops disks has also been a problem in my work; that is easy enough to simulate without changing the code, cgroups provides a mechanism. Now you don't actually have to generate enough load to make your disk slow.) Adjusting network latency with "tc qdisc add dev X netem ..." has also been useful for debugging slow file uploads over high-latency links without actually going through the hassle of renting a server far away to upload things to. I will say the disadvantage there is that the less you know about the full stack, the less you trust your simulations. You'll end up with a lot of pushback along the lines of "that's not a real scenario", and it is true that calling Write() slowly versus the OS not returning from the write() syscall because the disk is busy is a slightly different codepath and there can always be side effects that you're missing. But often the black box model is a worthwhile tradeoff for improved development cycle times; just make sure you add the instrumentation to real production so you can get data about how good your simulation is.

I'm willing to use error/latency budget for unusual production deployments to collect real-world data, for example, running 10% of requests through a build with the race detector enabled. That now accounts for your worst 10% response latency (and errors if you do have data races in hot paths!), but if it's within the budget, it's worth it because you get a stack trace pointing at a critical correctness error in your code, and you can go add that case to your unit tests and never have the problem again. Sometimes you can't think of everything, which is why telemetry from production is so important to me. (This kind of data is important for more than just the mechanics of the code, of course. Talk to your users and see if they like the new icon set. If they don't, your test in production failed and you should fix your app.) Finally, I also like fuzz testing on top of all of this; have a beefy computer generating the most corrupt possible data billions of times a second and see how your app behaves. Every fuzz test I've ever written has exposed a boneheaded subtle mistake in the code, even in code with 100% test coverage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact