Data Mining 2.0: Mine your data like Facebook, Twitter and Yahoo

The Problem

Facebook

750 million active users, 1 in 3 Internet users have a Facebook account
More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.
Holds 30PB of data for analysis, adds 12 TB of compressed data daily

Twitter

200 million users, 200 million daily tweets
1.6 billion search queries a day
7 TB data for analysis generated daily

Traditional data storage, techniques & analysis tools just do not work at these scales !

MapReduce

Source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Yahoo! 4000 node cluster, sorted 1TB random integers in 62 seconds using MapReduce

Apache Hadoop

Open source platform for reliable, scalable, distributed computing
Provides MapReduce framework for distributed processing
Provides a distributed file system for high-throughput data access (HDFS)

Example word count Map code

public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text value, 
                     OutputCollector<Text, IntWritable> output, 
                     Reporter reporter) throws IOException {
         String line = value.toString();
         StringTokenizer itr = new StringTokenizer(line);
         while (itr.hasMoreTokens()) {
             word.set(itr.nextToken());
             output.collect(word, one);
        }
    }
}

Example word count Reduce code

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, 
                       OutputCollector<Text, IntWritable> output, 
                       Reporter reporter) throws IOException {
       int sum = 0;
       while (values.hasNext()) {
           sum += values.next().get();
       }
       output.collect(key, new IntWritable(sum));
   }

}

The word counting app

public class WordCount {
   public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(WordCount.class);
      conf.setJobName("wordcount");

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);

      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));
      JobClient.runJob(conf);
   }
}

That seems like a lot of code, so...

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL
Developed and used as primary platform by Facebook
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs
Developed and used as primary platform by Yahoo!

Word counting example using Pig

lines = LOAD '../data/words.txt' USING TextLoader() AS (sentence:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(sentence)) AS word;
groupedWords = GROUP words BY word;
counts = FOREACH groupedWords GENERATE group, COUNT(words);
STORE counts INTO 'output/wordcounts' USING PigStorage();

Where words.txt is:

Deer Bear River
Car Car River
Deer Car Bear

Generates:

Car     3
Bear    2
Deer    2
River   2

Word counting example using Hive

create table textlines(text string);    

load data local inpath 'C:\work\ClearPoint\Data20\data\words.txt' overwrite into table textlines;

create table words(word string);

insert overwrite table words select explode(split(text, '[ \t]+')) word from textlines;

select word, count(*) from words group by word;

MovieLens sample data data

Most of the following examples use a sample of the MovieLens, movie recommendation data http://www.grouplens.org/node/73. The sample contains 100,000 ratings for 1682 movies by 943 users.

u.data

Ratings in tab separated list of user id, item id, rating, timestamp

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116

u.item

Movie data, | separated list of movie id, movie title, release date, video release date,IMDb URL, unknown, Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir,Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

u.user

User demographic data, | separated list of user id, age, gender, occupation, zip code

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067

Extract simple stats using Pig

users = LOAD '../data/u.user' USING PigStorage('|') AS (userId,age,gender,occupation,zipCode);

allUsers = GROUP users ALL;
stats = FOREACH allUsers GENERATE COUNT(users), AVG(users.age), SUM(users.age);

byGender = GROUP users BY gender;
genderStats = FOREACH byGender GENERATE group, COUNT(users), AVG(users.age);

programmers = FILTER users BY occupation == 'programmer';
progsByAge = GROUP programmers BY age;
progCountsByAge = FOREACH progsByAge GENERATE group AS age, COUNT(programmers) as NumProgs;
progsCountsByAgeSorted = ORDER progCountsByAge BY NumProgs DESC;

STORE stats INTO 'output/stats' USINg PigStorage('\t');
STORE genderStats INTO 'output/genderStats' USING PigStorage('\t');
STORE progsCountsByAgeSorted INTO 'output/progsCountsByAgeSorted' USING PigStorage('\t');

Results of simple stats

output/stats

943 34.05196182396607   32111.0

output/genderStats

F   273 33.81318681318681
M   670 34.149253731343286

output/progsCountsByAgeSorted

Finding the Top 25 rated movies

votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,imdbURL,unknown,
  Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,Horror,Musical,Mystery,Romance,
  SciFi,Thriller,War,Western);

movieVotesGroup = GROUP votes BY itemId;

movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId, 
                                     AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;

moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;

moviesWithVotesSorted = ORDER moviesWithVotes BY movieVotes::avgRating DESC;

top25 = LIMIT moviesWithVotesSorted 25;
STORE top25 INTO 'output/topMovies' USING PigStorage('\t');

The top 25 movies

1653    5.0 1   1653    Entertaining Angels: The Dorothy Day Story (1996)   27-Sep-1996     http://us.imdb.com/M/title-exact?Entertaining%20Angels:%20The%20Dorothy%20Day%20Story%20(1996)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1293    5.0 3   1293    Star Kid (1997) 16-Jan-1998     http://us.imdb.com/M/title-exact?imdb-title-120478  0   0   1   0   1   0   0   0   0   1   0   0   0   0   0   1   0   0   0
1467    5.0 2   1467    Saint of Fort Washington, The (1993)    01-Jan-1993     http://us.imdb.com/M/title-exact?Saint%20of%20Fort%20Washington,%20The%20(1993) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
814 5.0 1   814 Great Day in Harlem, A (1994)   01-Jan-1994     http://us.imdb.com/M/title-exact?Great%20Day%20in%20Harlem,%20A%20(1994)    0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1500    5.0 2   1500    Santa with Muscles (1996)   08-Nov-1996     http://us.imdb.com/M/title-exact?Santa%20with%20Muscles%20(1996)    0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
1201    5.0 1   1201    Marlene Dietrich: Shadow and Light (1996)   02-Apr-1996     http://us.imdb.com/M/title-exact?Marlene%20Dietrich:%20Shadow%20and%20Light%20(1996)%20(TV) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1122    5.0 1   1122    They Made Me a Criminal (1939)  01-Jan-1939     http://us.imdb.com/M/title-exact?They%20Made%20Me%20a%20Criminal%20(1939)   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
1189    5.0 3   1189    Prefontaine (1997)  24-Jan-1997     http://us.imdb.com/M/title-exact?Prefontaine%20(1997)   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1599    5.0 1   1599    Someone Else's America (1995)   10-May-1996     http://us.imdb.com/M/title-exact?Someone%20Else's%20America%20(1995)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1536    5.0 1   1536    Aiqing wansui (1994)    22-Jul-1996     http://us.imdb.com/M/title-exact?Aiqing%20Wansui%20(1994)   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1449    4.625   8   1449    Pather Panchali (1955)  22-Mar-1996     http://us.imdb.com/M/title-exact?Pather%20Panchali%20(1955) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1594    4.5 2   1594    Everest (1998)  10-Mar-1998     http://us.imdb.com/Title?Everest+(1998) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
119 4.5 4   119 Maya Lin: A Strong Clear Vision (1994)  01-Jan-1994     http://us.imdb.com/M/title-exact?Maya%20Lin:%20A%20Strong%20Clear%20Vision%20(1994) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1398    4.5 2   1398    Anna (1996) 13-Nov-1996     http://us.imdb.com/M/title-exact?Anna%20(1996)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1642    4.5 2   1642    Some Mother's Son (1996)    27-Dec-1996     http://us.imdb.com/M/title-exact?Some%20Mother's%20Son%20(1996) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
408 4.491071428571429   112 408 Close Shave, A (1995)   28-Apr-1996     http://us.imdb.com/M/title-exact?Close%20Shave,%20A%20(1995)    0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0
318 4.466442953020135   298 318 Schindler's List (1993) 01-Jan-1993     http://us.imdb.com/M/title-exact?Schindler's%20List%20(1993)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   1   0
169 4.466101694915254   118 169 Wrong Trousers, The (1993)  01-Jan-1993     http://us.imdb.com/M/title-exact?Wrong%20Trousers,%20The%20(1993)   0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
483 4.45679012345679    243 483 Casablanca (1942)   01-Jan-1942     http://us.imdb.com/M/title-exact?Casablanca%20(1942)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   1   0   0   1   0
114 4.447761194029851   67  114 Wallace & Gromit: The Best of Aardman Animation (1996)  05-Apr-1996     http://us.imdb.com/Title?Wallace+%26+Gromit%3A+The+Best+of+Aardman+Animation+(1996) 0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
64  4.445229681978798   283 64  Shawshank Redemption, The (1994)    01-Jan-1994     http://us.imdb.com/M/title-exact?Shawshank%20Redemption,%20The%20(1994) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
603 4.3875598086124405  209 603 Rear Window (1954)  01-Jan-1954     http://us.imdb.com/M/title-exact?Rear%20Window%20(1954) 0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   1   0   0
12  4.385767790262173   267 12  Usual Suspects, The (1995)  14-Aug-1995     http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995)   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0
50  4.3584905660377355  583 50  Star Wars (1977)    01-Jan-1977     http://us.imdb.com/M/title-exact?Star%20Wars%20(1977)   0   1   1   0   0   0   0   0   0   0   0   0   0   0   1   1   0   1   0
178 4.344   125 178 12 Angry Men (1957) 01-Jan-1957     http://us.imdb.com/M/title-exact?12%20Angry%20Men%20(1957)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0

R

R is an open-source language and environment for statistical computing and graphics
The R environment provides:
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hard-copy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques
http://www.r-project.org/

User stats

> users  <- read.table('C:/work/ClearPoint/Data20/data/u.user', header=FALSE, sep='|', 
+ col.names=c('userid','age','gender','occupation','zipcode'));
> summary(users);

    userid           age        gender          occupation     zipcode   
 Min.   :  1.0   Min.   : 7.00   F:273   student      :196   55414  :  9  
 1st Qu.:236.5   1st Qu.:25.00   M:670   other        :105   55105  :  6  
 Median :472.0   Median :31.00           educator     : 95   10003  :  5  
 Mean   :472.0   Mean   :34.05           administrator: 79   20009  :  5  
 3rd Qu.:707.5   3rd Qu.:43.00           engineer     : 67   55337  :  5  
 Max.   :943.0   Max.   :73.00           programmer   : 66   27514  :  4  
                                         (Other)      :335   (Other):909
> table(users$age);     

 7 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
 1  1  1  5  3  6  5 14 18 23 32 27 37 28 33 38 34 35 36 32 39 25 28 26 17 27 21 19 17 22 21 10 21 13 23 
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 70 73 
15 12 14 20 19 20 20  6 12  4 11  6  9  3  3  9  3  2  3  2  3  1  2  2  3  1

Plots

> plot(users$gender);

Basic Histogram

> hist(users$age);

Lattice - prettier graphs & panels

> library(lattice);
> xyplot(occupation~age,users);

conditioning variable

> xyplot(occupation~age|gender,users);

Age distribution by gender

> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));

Age distribution by gender

> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));

Age distribution by occupation

> histogram(~age|occupation, data=users,col=heat.colors(10));

Gender distribution by occupation

> histogram(~gender|occupation, data=users, col=heat.colors(2));

Are newer movies better ?

First prep data using Pig:

votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,
         imdbURL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,
         Horror,Musical,Mystery,Romance,SciFi,Thriller,War,Western);
movieVotesGroup = GROUP votes BY itemId;
movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId, 
                                     AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;
moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;
movieRatings = FOREACH moviesWithVotes GENERATE movieVotes::movieId, movieVotes::avgRating, 
               movieVotes::numVotes, REGEX_EXTRACT(movies::releaseDate,'(.*)-(.*)-(.*)',3), movies::movieTitle;
STORE movieRatings INTO 'output/movieRatings' USING PigStorage('\t');

output/movieRatings:

1   3.8783185840707963  452 1995    Toy Story (1995)
2   3.2061068702290076  131 1995    GoldenEye (1995)
3   3.033333333333333   90  1995    Four Rooms (1995)
4   3.550239234449761   209 1995    Get Shorty (1995)
5   3.302325581395349   86  1995    Copycat (1995)

Load data into R

> ratings  <- read.table('C:/work/ClearPoint/Data20/pig/output/movieRatings/part-r-00000', header=FALSE, sep='\t', 
            col.names=c('movieid','avgrating','numvotes','year','title'));

> summary(ratings); 

    movieid         avgrating        numvotes           year                                title    
 Min.   :   1.0   Min.   :1.000   Min.   :  1.00   Min.   :1930   Butcher Boy, The (1998)      :  2  
 1st Qu.: 326.0   1st Qu.:2.750   1st Qu.: 11.00   1st Qu.:1989   Desperate Measures (1998)    :  2  
 Median : 697.0   Median :3.278   Median : 42.00   Median :1994   Money Talks (1997)           :  2  
 Mean   : 710.7   Mean   :3.171   Mean   : 73.93   Mean   :1988   Substance of Fire, The (1996):  2  
 3rd Qu.: 984.0   3rd Qu.:3.741   3rd Qu.:101.00   3rd Qu.:1996   1-900 (1994)                 :  1  
 Max.   :1682.0   Max.   :5.000   Max.   :583.00   Max.   :1998   12 Angry Men (1957)          :  1  
                                                   NA's   :   1   (Other)                      :803

Looks like we have some issues, source data has duplicate movies (with different movie Ids) and we are missing a year...

Is there a correlation between movie age and average rating?

> smoothScatter(ratings$year,ratings$avgrating);

Doesn't look like it....

Calculating the correlation

> cor(ratings$year,ratings$avgrating, use='complete.obs');
[1] -0.206226

Looks like there is some correlation

> cor(ratings$year,ratings$numvotes, use='complete.obs');
[1] -0.001756821

Only a very, very weak correlation between year and number of votes

> cor(ratings$avgrating,ratings$numvotes, use='complete.obs');
[1] 0.468041

Interesting there is quite a strong correlation between average rating and number of votes. Wonder why?

Grouping by decade might help...

> ratings$decade <- trunc(ratings$year / 10) * 10;
> histogram(~avgrating|as.factor(decade),ratings,layout=c(1, 7));

Not only structured data

REGISTER lib/piggybank.jar;
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE HourExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('dd/MMM/yyyy:HH:mm:ss Z','yyyy-MM-dd HH:00','Pacific/Auckland');

accesslog = LOAD '../data/access.log' USING LogLoader  AS (remoteAddr, remoteLogname, user, time, method,
                                                           uri, proto, status, bytes, referer, userAgent);
byHour = GROUP accesslog BY HourExtractor(time);

hourCounts = FOREACH byHour GENERATE group as hourOfDay, COUNT(accesslog) as numberOfHits;
hourCountsOrdered = ORDER hourCounts by hourOfDay;

STORE hourCountsOrdered INTO 'output/hitsbyhour' using PigStorage('\t');

output/hitsbyhour

2011-08-15 20:00    47
2011-08-15 21:00    9
2011-08-15 22:00    28
2011-08-15 23:00    48
2011-08-16 00:00    22
2011-08-16 01:00    74
2011-08-16 02:00    34
2011-08-16 03:00    27
2011-08-16 04:00    38
2011-08-16 05:00    32
2011-08-16 06:00    20
2011-08-16 07:00    33

Data Mining 2.0: Mine your data like Facebook, Twitter and Yahoo

The Problem

Facebook

Twitter

Traditional data storage, techniques & analysis tools just do not work at these scales !

MapReduce

Apache Hadoop

Example word count Map code

Example word count Reduce code

The word counting app

That seems like a lot of code, so...

Word counting example using Pig

Word counting example using Hive

MovieLens sample data data

u.data

u.item

u.user

Extract simple stats using Pig

Results of simple stats

output/stats

output/genderStats

output/progsCountsByAgeSorted

Finding the Top 25 rated movies

The top 25 movies

R

User stats

Plots

Basic Histogram

Lattice - prettier graphs & panels

conditioning variable

Age distribution by gender

Age distribution by gender

Age distribution by occupation

Gender distribution by occupation

Are newer movies better ?

Load data into R

Is there a correlation between movie age and average rating?

Calculating the correlation

Grouping by decade might help...

Not only structured data

That's all folks

Questions?