================================================================ CSE 344 Lecture 27: Pig Latin examples ================================================================ To try the examples, spin-up an Amazon Elastic MapReduce cluster with one instance. On your master node, create a file called: /tmp/mydata.txt that has the following content FlashyThing,Electronics,50 Gizmo,Gadget,20 SuperGizmo,Gadget,100 Camera,Electronics,250 Umbrella,Misc,10 Towel,Bathroom,15 1 - Write a script that counts the number of items in each category. For categories with more than 1 item, it displays the category name and the number of items. raw = LOAD '/tmp/mydata.txt' USING PigStorage(',') AS (product, category, price); A = GROUP raw BY category; B = FOREACH A GENERATE $0 as category, COUNT($1) as nbitems; C = FILTER B BY nbitems > 1; STORE C INTO '/tmp/myresult.txt' USING PigStorage(); 2 - Write a script that computes the sum of all prices in each category. raw = LOAD '/tmp/mydata.txt' USING PigStorage(',') AS (product, category, price); A = GROUP raw BY category; B = FOREACH A GENERATE $0 as category, SUM($1.price) as totalprice; STORE B INTO '/tmp/myresult.txt' USING PigStorage(); 3 - Write a script equivalent to the following query: SELECT category, AVG(price) FROM Product WHERE price > 10 GROUP By category HAVING COUNT(*) > 1 raw = LOAD '/tmp/mydata.txt' USING PigStorage(',') AS (product, category, price); good_products = FILTER raw BY price > 10; groups = GROUP good_products BY category; big_groups = FILTER groups BY COUNT(good_products) > 1; result = FOREACH big_groups GENERATE $0, AVG($1.price); STORE result INTO '/tmp/myresult.txt' USING PigStorage(); 4 - Write a script that finds pairs of products in the same category raw = LOAD '/tmp/mydata.txt' USING PigStorage(',') AS (product,category, price); raw2 = FOREACH raw GENERATE product as product2, category as category2, price as price2; join_result = JOIN raw BY category, raw2 BY category2; filtered_pairs = FILTER join_result BY product != product2; result = FOREACH filtered_pairs generate product, product2; STORE result INTO '/tmp/myresult.txt' USING PigStorage();