Friday, January 23, 2015

Data Systems

Generally all Data Systems have below components:

1. Data Collection:
Enables feeding data into the system. Copy log files, sync with external database etc

2. Data Storage
Stores data. Can be disk based, or Database too.

3. Processing
Processes the large volume of data. Data can be stored on disk or database.

4. Querying
Fetch the data from Data Storage with appropriate filters.

5. Reporting
Display/Visualize/Generate reports Information derived from the Data.

Data Systems components

What differentiates BigData systems with Traditional Data Systems is Scalability. Traditional systems satisfied the requirement of the past era where a single installation of database could meet the requirement of the whole organization. In the age of internet, Such systems do not cater to process webscale of data.

Difference b/w Traditional data systems and BigData systems

BigData systems:

There are many BigData systems available, below is the list of few well known systems.

1. Hadoop: Developed and opensourced by Yahoo!
2. Cosmos: Microsoft BigData system
3. Oracle Big Data Appliance: Hardware and Software combo from Oracle.

Among these, Hadoop is well known opensource BigData ecosystem.

Thursday, June 19, 2014

Introduction to BigData

The "Big" in BigData literally signifies size of the data at hand.

1. Data generated due to Billions of likes by Facebook users every day
2. Data generated due to Millions of tweets

This data will result in huge amount of back end data sizing up to terabytes, Hence the name BigData. Along with Size, The rate at which the data is generated also is high. Also the variety of data that is generated, May be unstructured log files, or structured ones. So 3 V's or properties which describe the BigData are, Volume, Velocity and Variety. Although BigData pundits define it with 3 V's, The first 'V' i.e Volume itself is enough for it to be called as BigData, Velocity and Variety are result of the source of the data and should not be a prerequisite for considering any BigData solutions.

Lets look at processing BigData:

Processing BigData requires "Divide and Conquer" approach.

For example, Suppose we have 1 K records to search for some specific word(grep), and to check 1 K line it takes around 1 second, how long would it take to grep 1 billion lines ?

If we assume we will have a dedicated system, it will take 1,000,000 seconds to process 1,000,000,000 (1 Billion) records, i.e approximately around 11 days.

So what? We can wait for 11 days, Whats the issue? Problem is in those 11 days, We would have generated billions of more records for grepping, this is a never ending process as more data is generated continuously.

Whats the solution?

1. Build a super fast super computer, so grepping is faster than the data generation
2. Setup multiple computer to process 1 Billion lines and parallelize the processing so grepping is faster than the data generation
3. If our requirement is to only get the count of lines having the specified word, we can have statistical way. Grep only 100000 lines for the specific word, and then extrapolate for 1 Billion lines.

Currently option 2 is lot cheaper than 1 and certain compared to option 3. Also option 3 is applicable only when we need counts or averages and not suitable in other text processing.

To implement option 2, Below steps provide a Golden Template which is Divide and Conquer method for handling such processing.

1. Setup multiple system in network to process the BigData
2. Divide the BigJob to smaller Job Parts
3. Distribute the Jobs to individual systems
4. Process individual Job Parts
5. Collate the individual results
6. Sumup/rollup the results

In Short BigData processing is application of Parallel processing when data at hand is huge,unstructured and continuosly increasing, and hence cannot be processed using traditional single instance tools due to size and time limitations.

When above Divide and Conquer template is used for grepping Billion lines, it would be as below.

1. Setup 100 systems in network
2. Split the 1 Billion records into 100 files with each having  10,000,000(10 Million) Records
3. So, now we have 100 systems, and 100 files, copy 1 file to each system.
4. Start grep on the files copied in step 3 on All 100 systems, store the result in temporary file.
5. Copy the temporary files generated in step 4 to one central server
6. On central server, Merge all temporary files. And we have completed grep of 1 Billion lines.

Ignoring the work/time overhead involved in splitting, copying , merging, we can say we completed the grepping of 1 Billion lines in under 3 hours! which would have taken 11 days on single system.

Apart from these we also need to make sure of few things listed below.

1. Since Data is huge in size, Our file system should support it and it should be accessible from all the systems in the network
2. Since our systems are not specialised and low cost, they are prone to failure, we should handle failure of individual systems and also make sure of no data loss.

The above template can be used for processing data in many scenarios and not only limited to grep, Just need to follow the template of splitting the job, distributing, processing, and collating. To simplify and automate these steps, we have frameworks like Hadoop which automates most of the common steps like splitting, distributing, and collating of the results. Users of the framework need to work on only processing of the data.

More on Hadoop:

1. It has distributed file system (called HDFS) which provides zero data loss environment for storing and retrieving TeraBytes of data
2. It has DataBases which use HDFS to store their data, hence provide a way to store structural data running to Billions / Trillions of rows
3. It has MapReduce frame work, Which can Divide Jobs, Distribute it, And Collate the result for final rollup/sumup. The MapReduce framework also handles individual system failure by keeping track of which job part is assigned to which system and reassigning it to different system on failures
4. It has SQL like query engine named Hive to retrieve data from HDFS or Databases like Hbase. Hive internally schedules MapReduce Job to perform the task.

Hadoop MapReduce is a batch/bulk data processing framework, You need to have complete data at hand to begin processing. This is not suited for many reallife scenario where we need realtime results. For example, Hadoop MapReduce can be great for finding out fraud in Banking systems by crunching billions of transactions done on previous day, But is not suited if the requirement is to stop a fraud in realtime, like stopping / blocking spammers abusing the system in realtime.

Alternatively to Hadoop MapReduce, Frameworks like Twitter Storm allows processing of data in RealTime.

General Applications of BigData:

Most of the Applications of BigData is to derive insights, i.e Analytics of some form. General perception is BigData means Analytics and vice versa, However we can apply the above pattern for other usecases which are not analytics. One such non-analytical BigData deployment was done to main the huge mail system at my current company. Millions of mails needs to be archived/deleted/moved to different storage every day, Data at hand is huge(amounting to 10's of TB every day), This amount of data cannot be acted upon using traditional way, This has to be distributed for achieving operational targets, We applied above discussed pattern and achieved it using custom framework instead of using Hadoop MapReduce.

BigData can also be used in system security, A Mail security system was deployed to keep a tab on compromised accounts and also to keep spammers/abusers in check, Again using custom framework. However, this security system uses some kind of analytics involving user's usage patterns/behaviour and hence would be categorised as "Analytics of some form".