Friday, 13 May 2016

Spark Training Bangalore

What is Spark?

Spark has open-source community and is the most active Apache project at the moment.
Spark provides general data processing platform. Spark runs programs up to 100 times faster in memory or 10 times faster on disk, than Hadoop. Last year, Spark is now taking over Hadoop by completing the 100 TB contest 3 times faster than machines and it also became the fastest open source engine.

CODE SAMPLE:
parkContext.textFile("hdfs://...")
            .flatMap(line => line.split(" "))
            .map(word => (word, 1)).reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")
Spark Core:
Spark Core is the engine for large-scale distributed and parallel data processing. It is responsible for:
·         memory management.
·         monitoring jobs on a cluster.
Spark introduces the concept of an RDD, an immutable fault-tolerant, distributed collection of objects that can be controlled on in parallel. An RDD contains of object and is created by loading an external data or distributing collection from the driver program.
RDDs support two types of operations:
·         Transformations are performed on RDD and which yield a new RDD containing the result.
·         Actions are operations that return a value after running a computation on an RDD.
Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. A big file was transformed in various ways and passed to first action, Spark only process and return the result for the first line, rather than do the work for the entire file.
By default, each transformed RDD are computed each time you run on it. However, you may also an RDD in memory using the cache method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
SPARK SQL:
Spark SQL is a Spark component that supports query data either via SQL or via the  Query Language. It originated as the Hive port to run on top of Spark (in place of Map Reduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to make SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query
Training:


Peopleclick is one of the leading IT Training institute provides Spark Training Bangalore. The trainers of peopleclick are all working professionals provide Spark Training Bangalore. After Spark Training Bangalore, the candidates get placed in MNC. The trainers also provide live project training in Bangalore. The trainers are also very supportive and guide the candidate throughout the course. For more information please visit: www.hadooptrainingbangalore.com/spark-training-bangalore

2 comments: