Challenge
With more than 300 million active users and total 2017 revenue of more than $55 billion, JD.com is China's largest retailer, and its operations are the epitome of hyperscale. For example, there are more than a trillion images in JD.com's product databases—with 100 million being added daily—and this enormous amount of data needs to be instantly accessible. In 2014, JD.com moved its applications to containers running on bare metal machines using OpenStack and Docker to "speed up the delivery of our computing resources and make the operations much simpler," says Haifeng Liu, JD.com's Chief Architect. But by the end of 2015, with tens of thousands of nodes running in multiple data centers, "we encountered a lot of problems because our platform was not strong enough, and we suffered from bottlenecks and scalability issues," says Liu. "We needed infrastructure for the next five years of development, now."
Solution
JD.com turned to Kubernetes to accommodate its clusters. At the beginning of 2016, the company began to transition from OpenStack to Kubernetes, and today, JD.com runs the world's largest Kubernetes cluster. "Kubernetes has provided a strong foundation on top of which we have customized the solution to suit our needs as China's largest retailer."
Impact
"We have greater data center efficiency, better managed resources, and smarter deployment with the Kubernetes platform," says Liu. Deployment time went from several hours to tens of seconds. Efficiency has improved by 20-30%, measured in IT costs. With the further optimizations the team is working on, Liu believes there is the potential to save hundreds of millions of dollars a year. But perhaps the best indication of success was the annual Singles Day shopping event, which ran on the Kubernetes platform for the first time in 2018. Over 11 days, transaction volume on JD.com was $23 billion, and "our e-commerce platforms did great," says Liu. "Infrastructure led the way to prep for 11.11. We took the approach of predicting volume, emulating the behavior of customers to prepare beforehand, and drilled for malfunctions. Because of Kubernetes's scalability, we were able to handle an extremely high level of demand."
For example, there are more than a trillion images in JD.com's product databases for customers, with 100 million being added daily. And this enormous amount of data needs to be instantly accessible to enable a smooth online customer experience.
In 2014, JD.com moved its applications to containers running on bare metal machines using OpenStack and Docker to "speed up the delivery of our computing resources and make the operations much simpler," says Haifeng Liu, JD.com's Chief Architect. But by the end of 2015, with hundreds of thousands of nodes in multiple data centers, "we encountered a lot of problems because our platform was not strong enough, and we suffered from bottlenecks and scalability issues," Liu adds. "We needed infrastructure for the next five years of development, now."
After considering a number of orchestration technologies, JD.com decided to adopt Kubernetes to accommodate its ever-growing clusters. "The main reason is because Kubernetes can give us more efficient, scalable and much simpler application deployments, plus we can leverage it to do flexible platform scheduling," says Liu.
The fact that Kubernetes is based on Google's Borg also gave the company confidence. The team liked that Kubernetes has a clear and simple architecture, and that it's developed mostly in Go, which is a popular language within JD.com. Though he felt that at the time Kubernetes "was not mature enough," Liu says, "we adopted it anyway."
The team spent a year developing the new container engine platform based on Kubernetes, and at the end of 2016, began promoting it within the company. "We wanted the cluster to be the default way for creating services, so scalability is easier," says Liu. "We talked to developers, interest grew, and we solved problems together." Some of these problems included networking performance and etcd scalability. "But during the past two years, Kubernetes has become more mature and very stable," he adds.
Today, the company runs the world's largest Kubernetes cluster. "We customized Kubernetes and built a modern system on top of it," says Liu. "This entire ecosystem of Kubernetes plus our own optimizations have helped us save costs and time. We have greater data center efficiency, better managed resources, and smarter deployment with the Kubernetes platform."
The results are clear: Deployment time went from several hours to tens of seconds. Efficiency has improved by 20-30%, measured in IT costs. But perhaps the best indication of success was the annual Singles Day shopping event, which ran on the Kubernetes platform for the first time in 2018. Over 11 days, transaction volume on JD.com was $23 billion, and "our e-commerce platforms did great," says Liu. "Infrastructure led the way to prep for 11.11. We took the approach of predicting volume, emulating the behavior of customers to prepare beforehand, and drilled for malfunctions. Because of Kubernetes's scalability, we were able to handle an extremely high level of demand."
JD.com is now in its second stage with Kubernetes: The platform is already stable, scalable, and flexible, so the focus is on how to run things much more efficiently to further reduce costs. With the optimizations the team is working on with resource management, Liu believes there is the potential to save hundreds of millions of dollars a year.
"We run Kubernetes and container clusters on roughly tens of thousands of physical bare metal nodes," he says. "Using Kubernetes and leveraging our own machine learning pipeline to predict how many resources we need for each application we use, and our own intelligent scaling algorithm, we can improve our resource usage. If we boost the resource usage, for example, by several percent, that means we can reduce huge hardware costs. Then we don't need that many servers to get that same amount of workload. That can save us a lot of resources."
JD.com, which won CNCF's 2018 End User Award, is also using Helm, CNI, Harbor, and Vitess on its platform. JD.com developers have made considerable contributions to Vitess, the CNCF project for scalable MySQL cluster management, and the company hopes to donate its own project to CNCF in the near future. Community participation is a priority for JD.com. "We have a good partnership with this community," says Liu. "We can share our successful experience with the community, and we also receive good feedback from others. So it's mutually beneficial."
To that end, Liu offers this advice for other companies considering adopting cloud native technology. "First you need to combine this technology with your own businesses, and the second is you need clear goals," he says. "You cannot just use the technology because others are using it. You need to consider your own objectives."
For JD.com's objectives, these cloud native technologies have been an ideal fit with the company's own homegrown innovation. "Kubernetes helped us reduce the complexity of operations to make distributed systems stable and scalable," says Liu. "Most importantly, we can leverage Kubernetes for scheduling resources to reduce hardware costs. That's the big win."