« Prev 1 2 3 4 5 6 7 8 9 10 Next »

What Lessons Have You Learned?

Create Date: November 27, 2019 at 02:23 PM         Tag: DESIGN PATTERN         Author Name: Sun, Charles

Egnyte Architecture: Lessons Learned In Building And Scaling A Multi Petabyte Content Platform

What Lessons Have You Learned? 

New Comment

Entries In Example (241)

Create Date: November 27, 2019 at 09:34 AM         Tag: DISTRIBUTED SYSTEMS         Author Name: Sun, Charles

Entries In Example (241)

Egnyte Architecture: Lessons Learned In Building And Scaling A Multi Petabyte Content Platform

 

 

This is a guest post by Kalpesh Patel, an Engineer, who for  Egnyte from home. Egnyte is a Secure Content Platform built specifically for businesses. He and his colleagues spend their productive hours scaling large distributed file systems. You can reach him at @kpatelwork.

Introduction

Your Laptop has a filesystem used by hundreds of processes. There are a couple of downsides though in case you are looking to use it to support tens of thousands of users working on hundreds of millions of files simultaneously containing petabytes of data. It is limited by the disk space; it can’t expand storage elastically; it chokes if you run few I/O intensive processes or try collaborating  with 100 other users. Let’s take this problem and transform it to a cloud-native file system used by millions of paid users spread across the globe and you get an idea of our roller coaster ride of scaling this system to meet monthly growth and SLA requirements while providing stringent consistency and durability characteristics we all have come to expect from our laptops. 

Egnyte is a secure Content Collaboration and Data Governance platform, founded in 2007 when Google drive wasn't born and AWS S3 was cost-prohibitive. Our only option was to roll up our sleeves and build basic cloud file system components such as object store ourselves. Over time, costs for S3 and GCS became reasonable and with Egnyte’s storage plugin architecture, our customers can now bring in any storage backend of their choice. To help our customers manage ongoing data explosion, we have designed many of the core components over the last few years. In this article, I will share the current architecture and some of the lessons we learned scaling it along with some of the things we are looking to improve upon in the near future.

Egnyte Connect Platform

 

Click to read more ...

 

From Bare-Metal To Kubernetes

 

Auth0 Architecture: Running In Multiple Cloud Providers And Regions

 

 

This is article was written by Dirceu Pereira Tiegs, Site Reliability Engineer at Auth0, and originally was originally published in Auth0.

Auth0 provides authentication, authorization, and single sign-on services for apps of any type (mobile, web, native) on any stack. Authentication is critical for the vast majority of apps. We designed Auth0 from the beginning so that it could run anywhere: on our cloud, on your cloud, or even on your own private infrastructure.

In this post, we'll talk more about our public SaaS deployments and provide a brief introduction to the infrastructure behind auth0.com and the strategies we use to keep it up and running with high availability. 

A lot has changed since then in Auth0. These are some of the highlights:

  • We went from processing a couple of million logins per month to 1.5+ billion logins per month, serving thousands of customers, including FuboTVMozillaJetPrivilege, and more.

  • We implemented new features like custom domainsscaled bcrypt operations, vastly improved user search, and much more.

  • The number of services that compose our product in order to scale our organization and handle the increases in traffic went from under 10 to over 30 services.

  • The number of cloud resources grew immensely as well; we used to have a couple dozen nodes in one environment (US), now we have more than a thousand over four environments (US, US-2, EU, AU).

  • We doubled-down decided to use a single cloud provider for each of our environments and moved all our public cloud infrastructure to AWS.

Core Service Architecture

 

Click to read more ...

 

Give Meaning To 100 Billion Events A Day - The Analytics Pipeline At Teads

 

This is a guest post by Alban Perillat-Merceroz, Software Engineer at Teads.tv.

In this article, we describe how we orchestrate Kafka, Dataflow and BigQuery together to ingest and transform a large stream of events. When adding scale and latency constraints, reconciling and reordering them becomes a challenge, here is how we tackle it.

 

 
Teads for Publisher, one of the webapps powered by Analytics

 

In digital advertising, day-to-day operations generate a lot of events we need to track in order to transparently report campaign’s performances. These events come from:

  • Users’ interactions with the ads, sent by the browser. These events are called tracking events and can be standard (start, complete, pause, resume, etc.) or custom events coming from interactive creatives built with Teads Studio. We receive about 10 billion tracking events a day.
  • Events coming from our back-ends, regarding ad auctions’ details for the most part (real-time bidding processes). We generate more than 60 billion of these events daily, before sampling, and should double this number in 2018.

In the article we focus on tracking events as they are on the most critical path of our business.

 
Simplified overview of our technical context with the two main event sources

 

Tracking events are sent by the browser over HTTP to a dedicated component that, amongst other things, enqueues them in a Kafka topic. Analytics is one of the consumers of these events (more on that below).

We have an Analytics team whose mission is to take care of these events and is defined as follows:

We ingest the growing amount of logs,
We transform them into business-oriented data,
Which we serve efficiently and tailored for each audience.

To fulfill this mission, we build and maintain a set of processing tools and pipelines. Due to the organic growth of the company and new products requirements, we regularly challenge our architecture.

Why We Moved To BigQuery

 

Click to read more ...

 

How Ipdata Serves 25M API Calls From 10 Infinitely Scalable Global Endpoints For $150 A Month

 

This is a guest post by Jonathan Kosgei, founder of ipdata, an IP Geolocation API. 

I woke up on Black Friday last year to a barrage of emails from users reporting 503 errors from the ipdata API.

Our users typically call our API on each page request on their websites to geolocate their users and localize their content. So this particular failure was directly impacting our users’ websites on the biggest sales day of the year. 

I only lost one user that day but I came close to losing many more.

This sequence of events and their inexplicable nature — cpu, mem and i/o were nowhere near capacity. As well as concerns on how well (if at all) we would scale, given our outage, were a big wake up call to rethink our existing infrastructure.

Our Tech Stack At The Time

 

Click to read more ...

 

Netflix: What Happens When You Press Play?

 

 

This article is a chapter from my new book Explain the Cloud Like I'm 10. The first release was written specifically for cloud newbies. I've made some updates and added a few chapters—Netflix: What Happens When You Press Play? and What is Cloud Computing?—that level it up to a couple ticks past beginner. I think even fairly experienced people might get something out of it.

So if you are looking for a good introduction to the cloud or know someone who is, please take a look. I think you'll like it. I'm pretty proud of how it turned out. 

I pulled this chapter together from dozens of sources that were at times somewhat contradictory. Facts on the ground change over time and depend who is telling the story and what audience they're addressing. I tried to create as coherent a narrative as I could. If there are any errors I'd be more than happy to fix them. Keep in mind this article is not a technical deep dive. It's a big picture type article. For example, I don't mention the word microservice even once :-)

 

Netflix seems so simple. Press play and video magically appears. Easy, right? Not so much.

 

Given our discussion in the What is Cloud Computing? chapter, you might expect Netflix to serve video using AWS. Press play in a Netflix application and video stored in S3 would be streamed from S3, over the internet, directly to your device. 

A completely sensible approach…for a much smaller service. 

But that’s not how Netflix works at all. It’s far more complicated and interesting than you might imagine.

To see why let’s look at some impressive Netflix statistics for 2017.

  • Netflix has more than 110 million subscribers.
  • Netflix operates in more than 200 countries. 
  • Netflix has nearly $3 billion in revenue per quarter.
  • Netflix adds more than 5 million new subscribers per quarter.
  • Netflix plays more than 1 billion hours of video each week. As a comparison, YouTube streams 1 billion hours of video every day while Facebook streams 110 million hours of video every day.
  • Netflix played 250 million hours of video on a single day in 2017.
  • Netflix accounts for over 37% of peak internet traffic in the United States.
  • Netflix plans to spend $7 billion on new content in 2018. 

What have we learned? 

Netflix is huge. They’re global, they have a lot of members, they play a lot of videos, and they have a lot of money.

Another relevant factoid is Netflix is subscription based. Members pay Netflix monthly and can cancel at any time. When you press play to chill on Netflix, it had better work. Unhappy members unsubscribe.

Netflix operates in two clouds: AWS and Open Connect.

How does Netflix keep their members happy? With the cloud of course. Actually, Netflix uses two different clouds: AWS and Open Connect. 

Both clouds must work together seamlessly to deliver endless hours of customer-pleasing video.

The three parts of Netflix: client, backend, CDN.

You can think of Netflix as being divided into three parts: the client, the backend, and the CDN. 

The client is the user interface on any device used to browse and play Netflix videos. It could be an app on your iPhone, a website on your desktop computer, or even an app on your Smart TV. Netflix controls each and every client for each and every device. 

Everything that happens before you hit play happens in the backend, which runs in AWS. That includes things like preparing all new incoming video and handling requests from all apps, websites, TVs, and other devices.

Everything that happens after you hit play is handled by Open Connect. Open Connect is Netflix’s custom global content delivery network (CDN). When you press play the video is served from Open Connect. Don’t worry; we’ll talk about what this means later.

Interestingly, at Netflix they don’t actually say hit play on video, they say clicking start on a title. Every industry has its own lingo.

By controlling all three areas—client, backend, CDN— Netflix has achieved complete vertical integration. 

Netflix controls your video viewing experience from beginning to end. That’s why it just works when you click play from anywhere in the world. You reliably get the content you want to watch when you want to watch it. 

Let’s see how Netflix makes that happen.

In 2008 Netflix Started Moving To AWS

 

Click to read more ...

 

ButterCMS Architecture: A Mission-Critical API Serving Millions Of Requests Per Month

This is a guest post by Jake Lumetta, co-founder and CEO of ButterCMS.

ButterCMS lets developers add a content management system to any website in minutes. Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other strategies to make sure we keep our customers’ websites up and running.

At its core, ButterCMS offers:

ButterCMS Tech Stack

 

Click to read more ...

 

Why Morningstar Moved To The Cloud: 97% Cost Reduction

 

 

Enterprises won't move to the cloud. If they do, it's tantamount to admitting your IT group sucks. That has been the common wisdom. Morningstar, an investment research provider, is moving to the cloud and they're about as enterprisey as it gets. And they don't strike me as incompetent, they just don't want to worry about all the low level IT stuff anymore. 

Mitch Shue, Morningstar's CTO, gave a short talk at AWS Summit Series 2017 on their move to AWS. It's not full of nitty gritty technical details. That's not the interesting part. The talk is more about their motivations, the process they used to make the move, and some of the results they've experienced. While that's more interesting, we've heard a lot of it before.

What I found most interesting was the idea of Morningstar as a canary test. If Morningstar succeeds, the damn might bust and we'll see a lot more adoption of the cloud by stodgy mainstream enterprises. It's a copy cat world. That sort of precedent gives other CTOs the cover they need to make the same decision.

The most important idea in the whole talk: the cost savings of moving to the cloud are nice, but what they were more interested in is "creating a frictionless development experience to spur innovation and creativity."

Software is eating the world. Morningstar is no doubt looking at the future and sees the winners will be those who can develop the best software, the fastest. They need to get better at developing software. Owning your own infrastructure is a form of technical debt. Time to pay down the debt and get to the real work of innovating, not plumbing.

Here's my gloss of the talk:

 

Click to read more ...

 

The AdStage Migration From Heroku To AWS

 

This is a guest repost by G Gordon Worley III, Head of Site Reliability Engineering at AdStage.

When I joined AdStage in the Fall of 2013 we were already running on Heroku. It was the obvious choice: super easy to get started with, less expensive than full-sized virtual servers, and flexible enough to grow with our business. And grow we did. Heroku let us focus exclusively on building a compelling product without the distraction of managing infrastructure, so by late 2015 we were running thousands of dynos (containers) simultaneously to keep up with our customers.

We needed all those dynos because, on the backend, we look a lot like Segment, and like them many of our costs scale linearly with the number of users. At $25/dyno/month, our growth projections put us breaking $1 million in annual infrastructure expenses by mid-2016 when factored in with other technical costs, and that made up such a large proportion of COGS that it would take years to reach profitability. The situation was, to be frank, unsustainable. The engineering team met to discuss our options, and some quick calculations showed us we were paying more than $10,000 a month for the convenience of Heroku over what similar resources would cost directly on AWS. That was enough to justify an engineer working full-time on infrastructure if we migrated off Heroku, so I was tasked to become our first Head of Operations and spearhead our migration to AWS.

It was good timing, too, because Heroku had become our biggest constraint. Our engineering team had adopted a Kanban approach, so ideally we would have a constant flow of stories moving from conception to completion. At the time, though, we were generating lots of work-in-progress that routinely clogged our release pipeline. Work was slow to move through QA and often got sent back for bug fixes. Too often things “worked on my machine” but would fail when exposed to our staging environment. Because AdStage is a complex mix of interdependent services written on different tech stacks, it was hard for each developer to keep their workstation up-to-date with production, and this also made deploying to staging and production a slow process requiring lots of manual intervention. We had little choice in the matter, though, because we had to deploy each service as its own Heroku application, limiting our opportunities for automation. We desperately needed to find an alternative that would permit us to automate deployments and give developers earlier access to reliable test environments.

So in addition to cutting costs by moving off Heroku, we also needed to clear the QA constraint. I otherwise had free reign in designing our AWS deployment so long as it ran all our existing services with minimal code changes, but I added several desiderata:

Click to read more ...

 

Architecture Of Probot - My Slack And Messenger Bot For Answering Questions

 

I programmed a thing. It’s called Probot. Probot is a quick and easy way to get high quality answers to your accounting and tax questions. Probot will find a real live expert to answer your question and handle all the details. You can get your questions answered over Facebook Messenger, Slack, or the web. Answers start at $10. That’s the pitch.

Seems like a natural in this new age of bots, doesn’t it? I thought so anyway. Not so much (so far), but more on that later.

I think Probot is interesting enough to cover because it’s a good example of how one programmer--me---can accomplish quite a lot using today’s infrastructure.

All this newfangled cloud/serverless/services stuff does in fact work. I was able to program a system spanning Messenger, Slack, and the web, in a way that is relatively scalabile, available, and affordable, while requiring minimal devops.

Gone are the days of worrying about VPS limits, driving down to a colo site to check on a sick server, or even worrying about auto-scaling clusters of containers/VMs. At least for many use cases.

Many years of programming experience and writing this blog is no protection against making mistakes. I made a lot of stupid stupid mistakes along the way, but I’m happy with what I came up with in the end.

Here’s how Probot works....

Platform

 

Click to read more ...

 

New Comment

What is the best way to handle 25000 concurrent requests toward the database?

Create Date: November 27, 2019 at 08:25 AM         Tag: DISTRIBUTED SYSTEMS         Author Name: Sun, Charles

What is the best way to handle 25000 concurrent requests toward the database?

First, 25,000 concurrent requests are truly world class. To put it in perspective, Amazon Prime Day 2016 resulted in 600 items sold per second.

Companies that support that kind of traffic also do not dispatch every HTTP request as an individual database access; these instead go through several layers of service performing caching and other business logic that ultimately result in far fewer actual database connections than the raw traffic numbers would imply. So I would question any architecture requiring that volume of database accesses first and foremost.

That aside —

The number of connections isn't as important as what these connections are doing.

You'd have to understand exactly what each connection is doing and design a scalable architecture for that. These designs are rarely just the database - but also incorporate the cache layer (if any), service interface, load balancer setup...

Here's a partial listing of strategies to mitigate the load on a database:

  1. Optimize the schema and query for your database. That comes first and foremost. If, for instance, you're doing an uncacheable regular expression search on terabytes of data with three tables joined together, you're in for a challenge to say the least. Key goal: minimize extraneous rows touched in a query.
  2. Replication. Replicate the data on multiple servers. However, there is a necessary tradeoff in how writes can be reflected in other replica. Perhaps you've picked something like HBase, and your writes won't return until it's been written to all the HDFS buckets. Perhaps you've picked something like Cassandra, and there's a lag in the data freshness as the gossip protocols resolve.
  3. Partitioning. Figure a way to evenly distribute queries and split data on multiple nodes accordingly.

I would not ever have 2,500 servers contacting a single database. There would be no point to it. You obviously have a distributed system but the database leads to a single point of failure. This would lead to a very unbalanced and unstable system.

You should look into using a DDBMS for Database replication and use multiple database servers.

It would be a small drop in the bucket at that level to spin up 10 additional servers so each server only had to handle 250 servers each. Then use Database Replication to handle the data distribution between them. If you know every 2 minutes your servers will be requesting data you can have data propagate between them every minute between the calls.

I would also use private networking if all droplets are located in the same datacenter. I mention this again below because it is a great idea in your case.

The biggest unknown here is how you plan to handle database requests and how complex each object is. You told us the properties but not what data is in them. a bit is easier to handle than an int which is easier to handle than a string.

For Json MongoDB is a great database to use. I would test it and scale up. Using Digital Ocean you have virtual CPU cores not real CPU's. So how many connections you can handle at once will be determined by the complexity of the data and how well the virtual cores handle the data.

I would try for a DDBMS as I mention above, or look at aws and some of their database solutions.

Doing some napkin math..

If each property was 1 bit... just 1 bit..

7*7000 *2,500

so your passing 122 Mb every 2 minutes

Then 3.4 Gbps an hour

82 gigabit a day

2.646 Tb per 30 day month.

They do have plans that allow for this traffic. So your looking at a 20 dollar a month plan for the database at the very least just from that

Edit:

This is some good reading for that problem as well

 

New Comment
« Prev 1 2 3 4 5 6 7 8 9 10 Next »