How big MNC’s like Google, Facebook, Instagram, etc Stores, Manages and Manipulate Thousands of Terabytes of data with High Speed and High Efficiency.

Siddhant Sharma
5 min readSep 17, 2020

Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it was online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook was built around big data from the beginning. They didn’t have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didn’t have those traditional forms. They didn’t have to merge big data technologies with their traditional IT infrastructures because those infrastructures didn’t exist. Big data could stand alone, big data analytics could be the only focus of analytics, and big data technology architectures could be the only architecture. Consider, however, the position of large, well-established businesses. Big data in those environments shouldn’t be separate but must be integrated with everything else that’s going on in the company. Analytics on big data have to coexist with analytics on other types of data. Hadoop clusters have to do their work alongside IBM mainframes. Data scientists must somehow get along and work jointly with mere quantitative analysts. In order to understand this coexistence, we interviewed 20 large organizations in the early months of 2013 about how big data fit into their overall data and analytics environments. Overall, we found the expected co-existence; in not a single one of these large organizations was big data being managed separately from other types of data and analytics. The integration was in fact leading to a new management perspective on analytics, which we’ll call “Analytics 3.0.” In this paper, we’ll describe the overall context for how organizations think about big data, the organizational structure, and the skills required for it…etc. We’ll conclude by describing the Analytics 3.0 era.

Big data may be new for startups and for online firms, but many large firms view it as something they have been wrestling with for a while. Some managers appreciate the innovative nature of big data, but more find it “business as usual” or part of a continuing evolution toward more data. They have been adding new forms of data to their systems and models for many years, and don’t see anything revolutionary about big data. Put another way, many were pursuing big data before big data was big.

“It’s About Variety, not Volume: The survey indicates companies are focused on the variety of data, not its volume, both today and in three years. The most important goal and potential reward of Big Data initiatives is the ability to analyze diverse data sources and new data types, not managing very large data sets.”

One of the great, and less-than-heralded-except-in-the-tech-press innovations of the past decade or so, has been the growth of a huge infrastructure of hardware and software which enables a startup like Instagram to create a simple app, and have it service millions of users from day one, without spending a small fortune on servers and technicians to maintain them, not to mention the space the keep them.

Take Pinterest another ‘hot’ social media app. It handles 18 million users and 410 terabytes of data (that’s over 4 lakh GB), and, as of December 2011, had all of 12 people. How do they do it?

Amazon’s Hidden World

Amazon is better known to the vast majority of us as the world’s largest online retailer, but to the tech community it is also the equivalent of an electric utility. Both Instagram and Pinterest installed and ran their software on Amazon’s ‘cloud’ computing platform.

All their data is stored on servers and in data centres (essentially vast warehouses with hundreds if not thousands of servers loaded with hard disks) owned and operated by Amazon and rented to companies like Instagram and Pinterest by the hour.

But Amazon provides not just storage but applications that companies can run in the ‘cloud’ as well. It’s as if you had nothing but a keyboard, a screen, a mouse, and an internet connection but could run Windows and MS Office without noticing the difference. So ubiquitous has Amazon become, that it has been estimated that one of three internet users visits a site run off Amazon’s cloud service at least once every day.

Google to Hadoop

When a user types in a search term, the task is farmed out to not one but thousands of machines. As Baseline, a trade magazine, described it in a study of Google’s technology, asking a single person to search out all the occurrences of a term in a magazine would take a long time. But farm that task out to hundreds of people each searching through one page, and the time taken to get a result falls sharply.

But the metaphor can be taken further. If one of those persons drops out, their work can be reassigned to one of the others. Since commodity hardware can and does fail quite often, Google designs its software specifically to work around those failures.

In 2006, the then-CIO of Google, Douglas Merrill told a conference (quoted in Baseline) that at then prevalent market conditions, “I can get about a 1,000-fold computer power increase at about 33 times lower cost if I go to the failure-prone infrastructure. So if can do that, I will.”

The broad principle is to take a task (like an individual search), break it down into smaller tasks, have hundreds if not thousands of individual computers chew away at those smaller tasks, put the results together and serve them up to the user.

Such a brief description doesn’t begin to describe this system that Google built. But the way Google has managed the millions of gigabytes that it stores has inspired a range of other software projects, such as the open source Hadoop, which is specifically used to handle enormous (millions of GB) sets of data. Companies who need to process such volumes of data (such as pharma companies doing drug research) can use Amazon to store all that data and Hadoop to process it.

Hadoop is just one element of the continuing revolution in data management, just as G-Drive and Dropbox and others represent the consumer side. Expect more radical innovations and more tricky questions.

--

--