Monday, August 3, 2015

CREATING SCHEMA, READING AND WRITING DATA - PIG TUTORIAL

The first step in processing a data set using pig is to define a schema for the data set. A schema is a representation of the data set in terms of fields. Let see how to define a schema with an example. 

Consider the following products data set in Hadoop as an example:
10, iphone,  1000
20, samsung, 2000
30, nokia,   3000

Here first field is the product id, second field is the product name and third field is the product price. 

Defining Schema: 

The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset. 

1. Creating Schema without specifying any fields. 

In this method, we don't specify any field names for creating the schema. An example is shown below:
grunt> A = LOAD '/user/hadoop/products';

Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop. 

To view the schema of a relation, use the describe statement which is shown below:
grunt> describe A;
Schema for A unknown.

As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator.
grunt> DUMP A;
(10,iphone,1000)
(20,samsung,2000)
(30,nokia,3000)

To write the data set into HDFS, use the STORE operator as shown below
grunt> STORE A INTO 'hadoop directory name'

2. Defining schema without specifying any data types. 

We can create a schema just by specifying the field names without any data types. An example is shown below:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id, product_name, price);

grunt> describe A;
A: {id: bytearray,product_name: bytearray,price: bytearray}

grunt> STORE A into '/user/hadoop/products' USING PigStorage('|'); --Writes data with pipe as delimiter into hdfs product directory.

The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator. 

You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields. 

3. Defining schema with field names and data types. 

To specify the data type use the colon. Take a look at the below example:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int);

grunt> describe A;
A: {id: int,product_name: chararray,price: int}

Accessing the Fields: 

So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields. 

The fields can be accessed in two ways: 

  • Field Names: We can specify the field name to access the values from that particular value.
  • Positional Parameters: The field positions start from 0 to n. $0 indicates first field, $1 indicates second field.

Example:
grunt> A = LOAD '/user/products/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int);
grunt> B = FOREACH A GENERATE id;
grunt> C = FOREACH A GENERATE $1,$2;
grunt> DUMP B;
(10)
(20)
(30)
grunt> DUMP C;
(iphone,1000)
(samsung,2000)
(nokia,3000)

FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A. 

Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.

No comments:

Post a Comment