Quick Hit: Common Ways to Interact with Hadoop

MapReduce: geniuses only. If you are on this page, read the next option!

Pig: Short for Pig Latin. Allows you to query Hadoop like SQL. Developed by Yahoo. Easy to learn.

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\w+';
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Hive: originally built by Facebook, a social networking site (you knew you would learn something). It has a SQL-like language called HiveQL. The queries are translated into MapReduce, Tez, or Spark jobs.

2 CREATE TABLE docs (line STRING);
4 CREATE TABLE word_counts AS
5 SELECT word, count(1) AS count FROM
6  (SELECT explode(split(line, 's')) AS word FROM docs) temp
7 GROUP BY word
8 ORDER BY word;

Oozie: an orchestration framework that allows you to string together different MapReduce, Pig, and Hive jobs.




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s