UGA-MIT Program: Big Data

What is Big Data?

Previously, we have always thought of data as set of relational databases neatly packed in tables.

Today, we have data coming from the internet, social networking sites, online shopping sites, mobile devices, cell phones, text messaging just to name a few. Where does this data go? How can it be stored, indexed, packaged, queried and how can any one technology keep up with the volume? All of these questions have lead to the term "Big Data".

Big Data is defined by Wikipedia as " data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set"

What are "No SQL" databases? Why are they important?

No SQL databases are databases that are structured much differently than the traditional rational databases. Key components of a "No SQL" database include:

Easy to use in conventional load-balanced clusters
Persistent data (not just caches)
Scale to available memory
Have no fixed schemas and allow schema migration without downtime
Have individual query systems rather than using a standard query language

Three key drivers have created an interest in Big Data and No SQL.

Data from Social Networking and Web 2.0 sites
Data changes over time and many data models don't evolve to keep pace with the changes in data
No SQL technology is becoming a comodity therefore everyone can get it and use it relatively easily.

http://www.thoughtworks.com/articles/nosql-comparison

The following table and website, compare the different companies that are offering No SQL technologies, they include Amazon Web Services, Google Big Table and mongo DB.

http://nosql.findthebest.com/

What is Hadoop?

A current leader in No SQL technology of handling Big Dat is The Apache™ Hadoop™ project which has developed open-source software for reliable, scalable, distributed computing.

Hadoop's software library allows for the distributed processing of large data sets across clusters of computers using a simple programming model.

What is Pig?

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.

What is Hive?

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Big Data is a growing problem and a growing opportunity. With so much data being transmitted through so many different means, companies are constantly looking for ways to manage the increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources) of data.

http://en.wikipedia.org/wiki/Big_data

http://broadstuff.com/archives/2598-Big-Data-Big-opportunity.html

http://news.cnet.com/8301-13846_3-20005279-62.html

UGA-MIT Program

Wednesday, March 28, 2012

Big Data

No comments:

Post a Comment