Table of Contents
Cavalcanti: I’m here to talk about architecting software for leverage. To start, I’ll just define what I mean by leverage here. Here is a Google definition. Leverage is the amount of value you can get relative to the depth of the investment you made. We expect to get a higher value than the amount of investment that you did. In the software context it is the decisions you take, the choices that you made, or the technical debt that you acquired relative to the amount of value that you could create. I want to take a look at some examples of architectural decisions that we took throughout the Nubank trajectory that were aimed at taking the highest possible leverage at the time. You may be in a similar position in your company, or in a future company in a phase that you will be taking those decisions. You can take us as an example, or at least have a mindset.
I’m Lucas Cavalcanti. I’m a principal software engineer at Nubank since the end of 2013. A little bit over seven years now. I’m based in Sao Paulo, Brazil.
Growing Rapidly in a Complex Domain
Nubank is the leading FinTech in Latin America, and was ranked the biggest digital bank in the world. In Times Magazine, we got one of the 100 most influential companies in the world. We were featured on the Times Magazine as well. A big achievement, from a company that has a little bit over seven years. This is an example of our growth curve. Here is an actual number of customers plotted in this graph. We’ve got to 35 million customers right now. We process billions of Kafka messages and HTTP requests every day, in a system that has hundreds of microservices and signed by hundreds of engineers. It’s a pretty big scale, which was not always at this size.
I’ll walk through some phases of the company. The first one, which was the startup time, when we valued time to market and feedback. Going to the growth time when we changed the focus to resilience and adaptability. Going next to the consolidation time, when the most important aspects were reliability and observability. Going last to expansion time when we value flexibility and extensibility. These were the values that were important to us at those phases.
Startup Time (2013 – 2015)
The startup time is this magical time where anything is possible, including failing and not having a company. In our case, this happened late 2013 to early 2015. We had that incredible change, that magical time when you had a Greenfield project and you get to choose any technology that you like to. You have to have a good reason to do that. In a small office, which was actually a little house, in a friendly neighborhood in Sao Paulo. When we launched the first product, which was a digital first credit card with no fees, and real time experience on that, which was unheard of at the time. We were the first ones to do that in Brazil, at least. Too many unknowns, we don’t know what direction the company will take or if it’s going to succeed or not. Limited resource, just a dozen people running the whole company, and we need to make that work. In our case, we had a license deadline. If we were not operating by May 2014, we will have to apply for a license that will take up to two years to be granted, which will be basically death for the company.
The first lever that we need to do is the technology choices. The value here is time to market. We need to launch as fast as we could. The leverage type here is maximizing the amount of work you don’t have to do. Don’t create a thing that’s more complex that you need to do at that stage. We chose as a database, Datomic, a very niche database, which is an immutable ledger of facts. You get auditing for free. Every update preserves history, so you don’t lose previous values. You can query the database at any point in time, so this is very helpful for both auditing and debugging later on. We chose Clojure, which is a functional programming language that runs on JVM. We can leverage the whole Java ecosystem. Everything that’s written in Java, we could use in Clojure. We get immutability by default. Pretty much every decision that the language took led us to simplicity. “Simple Made Easy,” is a good talk by Rich Hickey. This is true, we use that in production. We have functional programming that’s close to finance, that’s why we chose Clojure as well. It’s easier to map finance logic in a functional programming language.
We chose to use the hexagonal architecture, so we can have an organized way to see our code. We chose Kafka as the messaging technology, which was booming at the time, which has a persistence log of message with a TTL. It’s not forever, but during a period of time, you can inspect and look at all the messages that were produced. We have the ability to reset offsets, so you can reprocess old messages if necessary. We had to do it a few times at the beginning. We get partitioning by default as well. Scaling Kafka was also a little bit easier at the moment. The debt that we had at that point is that we chose some of the technologies that were very niche, and somewhat unconsolidated. They were not established yet. It was very hard to find people that had some experience on those languages. We had basically to not ask for that to teach people that in the beginning of the company.
The next lever is vendors. When you’re looking at time to market, several times, buying instead of building is the best choice. The first one is using the cloud. You don’t want to be managing your own machines at this phase of the company. We use AWS with CloudFormation for deploy automation, from the very beginning. We used DynamoDB as a backend database for Datomic, which was also easy to operate. We chose to buy an off-the-shelf credit card solution. We didn’t start building the whole credit card system, we started using a company that already processed credit card transactions. We could leverage instead of building, we just integrate with that company, and we can create the first product faster. The debt here is that now, by using those vendors, we are limited by their ability to grow and to scale and to respond to our problems, which is not always ideal.
The final one here on the startup time is practice. This time the value is getting fast and early feedback. To get that we need to build a good foundation on top, so we can build on top of it faster over time. The thing here is that building foundation takes time, which is not always available when you’re in startup mode when you want to launch as fast as you can. By luck, we could do that. We had some time. We took that chance to build a good CI/CD environment as well, so continuous deployment was very important for us at the time. We set up some practice to do continuous deployment. We had a very rudimentary fault tolerance but still present. We, from the very beginning, had immutable infrastructure. Every time we deployed, we created a new instance on EC2 and destroyed the old ones, so you don’t have the complexity of dealing with infrastructure changes. We chose from the very beginning to use microservices because we knew that the financial domain is very complex. Containing that complexity in small pieces, in this case, a smaller service, was very important to us at the time. We already started with that.
Growth Time (2015 – 2016)
Moving a little bit further, if we are lucky and we are successful, the company goes to a growth phase, which in our case was between 2015, 2016, when we experienced way faster growth than expected. We were expecting to get to 1 million customers by 5 years, and we got to that in something like 18 months. We needed to respond to that. At first, the office didn’t scale and we had to move to a bigger place. The vendor was not scaling. The credit card processor was not scaling so we needed to keep the system working, even though the vendor was not scaling. The technology, the decisions we took at the very beginning, were starting to not scale as well. We started to see the first bottlenecks, which would be very hard to fix in this hypergrowth scenario.
The first lever here on the growth time is practice. With the value for scalability or fault tolerance, we can and we should, as much as we can, avoid optimization or at least delay optimizations. Because optimized code is way more complex than regular code. In a complex domain, this can get off the track very rapidly. To do that, we used infrastructure sharding, instead of sharding just the database or just a piece of the infrastructure. We had several copies of the whole Nubank system. Each shard was a copy of the whole infrastructure, which were scalability units. We could have a limit on the number of customers running on that copy, and go to the next copy when we reached the new set of customers, and keep creating copies as the base grows. If the shards are small enough, you don’t have to optimize code, or you can delay it as much as you can. To do that, we had to improve our CI/CD. We needed frequent and automatic deploys.
The end-to-end test we set out at the very beginning, started to not scale as well, so they started to take more than one hour to run. We had to replace them with consumer driven contract tests, which would run way faster with a little bit less guarantees but still better to keep deploying frequently than waiting too much to deploy. We started migrating to Docker, instead of using EC2. The investment here is the project run with sharding was more than one year long, which was a very big project for that time in the company. We had to design new tools to accommodate for that. The debt here is that the project took way longer than expected and the customer base grew way faster than expected. We end up with the first sharding way bigger than the other shards. For a very long time, this shard was a special shard that was basically the canary for any performance problem that we could get in the system, was this first shard. Also, each shard has a minimum cost no matter how many customers you have in it. We started to spend a lot of money also running each copy of the shard.
The next lever is in-housing. Especially, because our vendors were not scaling, so we started to own our own destiny on the most important aspects of the business. We started processing credit cards in-house, bringing feature by feature of the credit card in-house so we could control our own scaling. The same thing for customer support. Delighting our customers was the biggest differentiator that we had at the company, so we also brought the customer support tools in-house and the customer support people in-house as well. We also had to design for that. The biggest investment here is that bringing those features of the credit card took more than 18 months. Every little feature was a migration that we had to do. It was a huge investment that paid off hugely. The vendor would not scale to 35 million customers, and we could. The debt here is that because of bringing in-house several features, we took a long time without having any major product changes. That was a little bit bad.
Consolidation Time – (2017 and 2018)
If we’re lucky enough, we got go to the next phase, which is the consolidation time. When we reached cruise mode in the company, between 2017, 2018, when we were scalable but not in a stable way, sharding helped a lot with the scaling. We reached a scale where every little corner case that affected 0.1% of the customers were happening to thousands of customers. We had to have a way more stable product or system than we were expecting to need to have. The office also didn’t scale, so we needed to move to a way bigger office that would accommodate 1000 people, near Avenida Paulista, which is one of the most famous streets in Sao Paulo. We launched our second product, which was the checking account, by this time. On this phase, we were already generating huge amounts of data, so we started to have to analyze that data. This was also a very important point for us.
The first leverage here on the technology side, we were aiming for scalability and adaptability. The leverage here is at that scale, we need to be able to do infrastructure changes more easily, so we migrated to Kubernetes, which was also booming at the time. It comes with an ecosystem of several infrastructure tools. With the number of services that we were getting, it was scaling way better than AWS CloudFormation. We also started to set up better monitoring tools, so collecting real-time metrics with Prometheus plus Grafana. These metrics were also being used by other tools like Opsgenie, Slack, or CI/CD for canary deploys. This was very important for us to scale. The investment here was another like year-long project that we had to set up Kubernetes and also migrate shard by shard to Kubernetes, while the system is already running for millions of customer. They were pretty complex operations that we were able to pull off. The debt here is that while we didn’t fully migrate, we started hitting AWS limits for creating resources or for number of resources, and spending a lot of money on duplicated infrastructure until the project is finished. It was a very big deal here.
We also had to invest a lot in internal tools for resilience and observability. We needed to make it easy for the engineers to operate on the system, especially with that number of services and people. We created a command line repository that we called NuCLI, with the most common operations like restarting a service, or sending HTTP requests to a service with our credentials, that were one command away to be run. Also, a tool for declarative infra. A repository when you can describe the resources you can get on your services, and it got applied automagically by the tool. The investment here is we needed a dedicated team to curate, maintain, and ensure that all these changes were applied.
At that point, when we are consolidating, data becomes very important. The amount of data we have is not processable by regular tools. You need data to do pretty much every decision on your company, so we used Scala plus Spark to process all the data, by extracting the data from all the services’ database across all the shards. We have an ETL process that is a repository of dataset definitions that pretty much everyone in the company or pretty much every place in the company contributes to, which outputs everything to a data warehouse so everyone can access later. Which is integrated with some AI tools, and we can use for supporting our machine learning models. We can use that as a consistency tool as well. With that amount of data and using a distributed architecture, failed distributed transactions become a very big deal. We use also the ETL to check against the consistence, against the systems as well. Investment, again, it was another big project to create the initial ETL versions and start iterating on it. We also have a dedicated team to make sure that this works well.
Expansion Time (2019 to Present)
Finally, we get to the expansion time, where we are now. From 2019 to now, we started to have products for everyone. That’s why you see this inflection point on the curve. We stopped saying no to customers that were asking for credit cards, for example, and we started offering products for everyone. We started launching in many countries, in many offices, and many other products that we were building. We started acquiring companies, so also the interface between these companies became important.
The first lever here on the expansion time is for extensibility and productivity, what I’m calling here, horizontal platforms. There’s basically a specialized technology team that builds abstracted tools for every other team to use. For example, mobile and web, we have a team building tools in Flutter, and design systems, component libraries so the regular engineer, the engineer that’s not a specialist can still evolve and use that system. Same for infrastructure, instead of everyone knowing how to operate Kubernetes, we create abstract tools so every engineer can do that. We also built an experimentation platform, because at this size, we want to run experiments, we want to improve your product. Having a platform that allows for you to do that easily while monitoring KPIs and whatever is relevant for the test is very key. We now need dedicated teams to create, maintain, and operate those platforms to make sure that every engineer can be productive on any technology we used.
Finally, here, business platforms, which is for domain specialist teams, for building abstracted APIs for all the other product teams to use. Here is the point that we can get to innovations, to endless possibilities of every possible platform product that we can build. Examples of it is banking as a service, so creating platforms to run the basic operations of a bank. For example, the credit platforms. You don’t have to figure out how to issue a loan, or how to report it, how to account for it, we have a platform where every product can issue a loan. As soon as that’s done, the product can do whatever it wants to do. The same for assets and payments platforms in open banking. These are building blocks that we can use for building a bank. Also, on the credit card side, we had to launch credit cards in other countries, so we had to make the system more generic. To do that we broke the system into the most relevant parts of credit card. In this case, for example, we’re handling credit limits or closing bills every month, or processing credit card transactions in different ways. If the customer doesn’t pay us the full bill, also generating the debts and renegotiating the customer debt. Also, when you get too many products, having a flexible acquisition process is also very important, so we use that. The investment here is that for the domain platforms, we need to have a very deep understanding of the domain to do that. We went through a long period of discussions to design what is the right break in the system on the platform side, because creating the wrong abstraction here leads you to failure as well. You have to be sure that you’re creating the breadth of abstraction because the cost to rebuild an abstraction is very high.
We were, on the startup time, trying to delay writing code as much as we could while building the foundation for growing. When we get to growth, we started bringing in-house all the core features that we needed to, and started doing sharding so we could scale faster. The consolidation time was all about maturing our infrastructure and creating a data environment, so everyone could use data to make the company grow. Finally, the expansion time, it was about building horizontal and business platforms, so the possibilities will be multiplied and the productivity of everyone will be multiplied.
Questions and Answers
Porcelli: You started with the idea to use microservices. Usually, when people pick the microservice architecture, it’s more Conway’s Law justification, when you were small, and you didn’t have, but you switched to that, did you consider going for a monolith? How was the thought process about that?
Cavalcanti: I think the main reason was because of the domain complexity. We knew that the finance domain was very complex, so we had some concerns on the system already at that time that were very different, like handling credit card transactions, versus handling customer data, versus handling the acquisition process. We started with very few services. I think it was like four or five, but already broken into services because we knew that if we were going to succeed and scale, we would have to go to that architecture, eventually.
Porcelli: This is a challenge, because distributed systems are way more complex than a single monolith.
That goes towards some questions also about sharding. When you start scaling, you mentioned sharding. You have somehow balanced sharding and the challenge to create sharding. Did you have to deal with outliers on this? It required a larger amount of resources in this sharding?
Cavalcanti: I think we do have some people that run 10x transactions, or 100x transactions more than the other ones. It eventually just affects that particular customer, like their bill may not open in a good time, because there are too many transactions there. It doesn’t affect the other ones, because most of our operations are in batch already. It doesn’t affect us that much for that. We did have that problem on the first shard, because we took longer than expected to build the shard infrastructure. The first shard got too long with customers that were too old. It’s the customers that have the longest time at Nubank, and then the highest amount of customers. That created some challenges for us. It still does sometimes, but now all the shards have about the same size. We are managing them better.
Porcelli: Then, connecting to sharding. How was the migration to Kubernetes? Some questions around the Kubernetes move. Did you adopt a Kubernetes service offering like EKS, or something like that, or you went to Kubernetes? How was this transition from your infrastructure? You already mentioned that it was immutable.
Cavalcanti: I think the main thing to take into account here is that this was seven, eight years ago from now. Kubernetes was just launched, in 2014 it became publicly available. We didn’t have those tools when we started the company. We didn’t even have Docker when we started the company, so we had to migrate to Docker, and then migrate to Kubernetes. Then eventually migrate to EKS, because EKS also didn’t exist at the time. It started existing in Sao Paulo region, I think, this year, or something like that. I think if we were starting the company today, we’ll probably use EKS on Amazon, and that’s it. We didn’t have those tools at the time.
Porcelli: Another question that I’m seeing quite frequently here is the use of Clojure, and the stack that you mentioned. That’s not that common, I’d say. You take advantage of the ecosystem. You also mentioned the JVM. Can you help us understand a little bit this dichotomy? You’re starting easy, but you pick something that very few are picking to start with.
Cavalcanti: The main thing for us on Clojure is that the language pushes you into simplicity from the very beginning, in a sense that the amount of the language you have to learn to be successful at it, it’s very small. You can be productive in Clojure by studying for a couple weeks maybe. All of our services look about the same. As soon as you get rid of those parentheses, like you can switch from braces to parentheses and be fine with it. Everything else is simpler. We don’t have that cognitive load that most languages have to learn syntax and the constructions of the language. Clojure is Lisp’s parentheses and symbols. You don’t have to learn that many language features to be able to use it or to copy paste the other code, like we all do.
Porcelli: One question now regarding more business aspects is, how did you cost factor into your architecture decisions? When you’re operating in finance, you follow a lot of regulations. How do you deal with this distinction, infrastructure, business code, and all the challenges of regulations?
Cavalcanti: We do have specific teams, like when we reach a certain size, it pays off to have a focused team on the specific aspects. We have a team just managing operational risk, or a team just doing platforms for the regulated part of the company. Giving a loan is very regulated in the market, at least in Brazil, so we created a platform around it. Every time we need to build a loan, the platform handles it, and we have many teams that can issue loans easily. That was mostly the way we handled it, is by having this specialized team that knows a lot about the regulations, of how to assess risk. Then, periodic assessments from other teams that might be impacted by it.
Porcelli: You’ve mentioned a few times, there is specialized teams and specialists. What do team sizes look like at Nubank?
Cavalcanti: The team is not just engineers, usually. We have teams with BAs, with product managers, with business product managers, sometimes data scientists. For the engineering part, we usually have a tech manager, and two to six engineers per team. It varies a lot on the context, but it’s around that size per team.
Porcelli: The two-pizza rule.
Let’s now switch gears to data. You mentioned ETL, what tools do you use for ETL?
Cavalcanti: We use Spark for building all the infrastructure to transform the data. We had to build a lot of tools internally to extract Datomic data and transform it in a way that is consumable by the ETL. I’m not a specialist in the data part, so I don’t know very much about the specifics. I know we used Mesos clusters to run the clusters that will run the ETL process. We use some BI tools like Databricks, like Looker. I think we have several tools that this whole ecosystem uses. We have our own repository of dataset definitions that the whole company contributes. It’s Scala with Spark with some abstractions that we created, so it’ll be easier for people to use.
Porcelli: You started with Kafka, and today, lots of things are about streaming, but you picked the batch processing ETL route. Any reason to not go full heads-on on streaming? It wasn’t that popular at that time, and you need to adjust?
Cavalcanti: The main reason was that when we needed ETL, Kafka Streams was not stable or released at the time. We already had data on Datomic from dozens of microservices at the time. It’ll be hard for us to migrate to a Kafka Streams mode on the architecture we chose by then. If we were to start today, we would go to streaming. We do have some use cases where streams were more appropriate. We use it in real time collecting metrics, for example. We do use Kafka Streams. For regular database part, we don’t.
See more presentations with transcripts