Monday, August 3, 2015

HOW TO FILTER RECORDS - PIG TUTORIAL EXAMPLES

Pig allows you to remove unwanted records based on a condition. The Filter functionality is similar to the WHERE clause in SQL. The FILTER operator in pig is used to remove unwanted records from the data file. The syntax of FILTER operator is shown below:
<new relation> = FILTER <relation> BY <condition>

Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows. 

Pig Filter Examples: 

Lets consider the below sales data set as an example
year,product,quantity
---------------------
2000, iphone, 1000
2001, iphone, 1500 
2002, iphone, 2000
2000, nokia,  1200
2001, nokia,  1500
2002, nokia,  900

1. select products whose quantity is greater than or equal to 1000.
grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS (year:int,product:chararray,quantity:int);
grunt> B = FILTER A BY quantity >= 1000;
grunt> DUMP B;
(2000,iphone,1000)
(2001,iphone,1500)
(2002,iphone,2000)
(2000,nokia,1200)
(2001,nokia,1500)

2. select products whose quantity is greater than 1000 and year is 2001
grunt> C = FILTER A BY quantity > 1000 AND year == 2001;
(2001,iphone,1500)
(2001,nokia,1500)

3. select products with year not in 2000
grunt> D = FILTER A BY year != 2000;
grunt> DUMP D;
(2001,iphone,1500)
(2002,iphone,2000)
(2001,nokia,1500)
(2002,nokia,900)

You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.

No comments:

Post a Comment