Data Mining 2.0: Mine your data like Facebook, Twitter and Yahoo

The Problem

Facebook

Twitter

Traditional data storage, techniques & analysis tools just do not work at these scales !

MapReduce

Source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Yahoo! 4000 node cluster, sorted 1TB random integers in 62 seconds using MapReduce

Apache Hadoop

Example word count Map code

public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text value, 
                     OutputCollector<Text, IntWritable> output, 
                     Reporter reporter) throws IOException {
         String line = value.toString();
         StringTokenizer itr = new StringTokenizer(line);
         while (itr.hasMoreTokens()) {
             word.set(itr.nextToken());
             output.collect(word, one);
        }
    }
}

Example word count Reduce code

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, 
                       OutputCollector<Text, IntWritable> output, 
                       Reporter reporter) throws IOException {
       int sum = 0;
       while (values.hasNext()) {
           sum += values.next().get();
       }
       output.collect(key, new IntWritable(sum));
   }

}   

The word counting app

public class WordCount {
   public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(WordCount.class);
      conf.setJobName("wordcount");

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);

      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));
      JobClient.runJob(conf);
   }
}   

That seems like a lot of code, so...

Word counting example using Pig

lines = LOAD '../data/words.txt' USING TextLoader() AS (sentence:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(sentence)) AS word;
groupedWords = GROUP words BY word;
counts = FOREACH groupedWords GENERATE group, COUNT(words);
STORE counts INTO 'output/wordcounts' USING PigStorage();

Where words.txt is:

Deer Bear River
Car Car River
Deer Car Bear

Generates:

Car     3
Bear    2
Deer    2
River   2

Word counting example using Hive

create table textlines(text string);    

load data local inpath 'C:\work\ClearPoint\Data20\data\words.txt' overwrite into table textlines;

create table words(word string);

insert overwrite table words select explode(split(text, '[ \t]+')) word from textlines;

select word, count(*) from words group by word; 

MovieLens sample data data

Most of the following examples use a sample of the MovieLens, movie recommendation data http://www.grouplens.org/node/73. The sample contains 100,000 ratings for 1682 movies by 943 users.

u.data

Ratings in tab separated list of user id, item id, rating, timestamp

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116

u.item

Movie data, | separated list of movie id, movie title, release date, video release date,IMDb URL, unknown, Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir,Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

u.user

User demographic data, | separated list of user id, age, gender, occupation, zip code

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067

Extract simple stats using Pig

users = LOAD '../data/u.user' USING PigStorage('|') AS (userId,age,gender,occupation,zipCode);

allUsers = GROUP users ALL;
stats = FOREACH allUsers GENERATE COUNT(users), AVG(users.age), SUM(users.age);

byGender = GROUP users BY gender;
genderStats = FOREACH byGender GENERATE group, COUNT(users), AVG(users.age);

programmers = FILTER users BY occupation == 'programmer';
progsByAge = GROUP programmers BY age;
progCountsByAge = FOREACH progsByAge GENERATE group AS age, COUNT(programmers) as NumProgs;
progsCountsByAgeSorted = ORDER progCountsByAge BY NumProgs DESC;

STORE stats INTO 'output/stats' USINg PigStorage('\t');
STORE genderStats INTO 'output/genderStats' USING PigStorage('\t');
STORE progsCountsByAgeSorted INTO 'output/progsCountsByAgeSorted' USING PigStorage('\t');

Results of simple stats

output/stats

943 34.05196182396607   32111.0

output/genderStats

F   273 33.81318681318681
M   670 34.149253731343286

output/progsCountsByAgeSorted

27  7
30  5
25  5
28  4
21  3
35  3
23  3
29  3
38  3
41  2
24  2
46  2
49  2
26  2
31  2
32  2
33  2
37  2
40  2
20  1
34  1
36  1
42  1
44  1
50  1
52  1
53  1
60  1
63  1

Finding the Top 25 rated movies

votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,imdbURL,unknown,
  Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,Horror,Musical,Mystery,Romance,
  SciFi,Thriller,War,Western);

movieVotesGroup = GROUP votes BY itemId;

movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId, 
                                     AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;

moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;

moviesWithVotesSorted = ORDER moviesWithVotes BY movieVotes::avgRating DESC;

top25 = LIMIT moviesWithVotesSorted 25;
STORE top25 INTO 'output/topMovies' USING PigStorage('\t');

The top 25 movies

1653    5.0 1   1653    Entertaining Angels: The Dorothy Day Story (1996)   27-Sep-1996     http://us.imdb.com/M/title-exact?Entertaining%20Angels:%20The%20Dorothy%20Day%20Story%20(1996)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1293    5.0 3   1293    Star Kid (1997) 16-Jan-1998     http://us.imdb.com/M/title-exact?imdb-title-120478  0   0   1   0   1   0   0   0   0   1   0   0   0   0   0   1   0   0   0
1467    5.0 2   1467    Saint of Fort Washington, The (1993)    01-Jan-1993     http://us.imdb.com/M/title-exact?Saint%20of%20Fort%20Washington,%20The%20(1993) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
814 5.0 1   814 Great Day in Harlem, A (1994)   01-Jan-1994     http://us.imdb.com/M/title-exact?Great%20Day%20in%20Harlem,%20A%20(1994)    0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1500    5.0 2   1500    Santa with Muscles (1996)   08-Nov-1996     http://us.imdb.com/M/title-exact?Santa%20with%20Muscles%20(1996)    0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
1201    5.0 1   1201    Marlene Dietrich: Shadow and Light (1996)   02-Apr-1996     http://us.imdb.com/M/title-exact?Marlene%20Dietrich:%20Shadow%20and%20Light%20(1996)%20(TV) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1122    5.0 1   1122    They Made Me a Criminal (1939)  01-Jan-1939     http://us.imdb.com/M/title-exact?They%20Made%20Me%20a%20Criminal%20(1939)   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
1189    5.0 3   1189    Prefontaine (1997)  24-Jan-1997     http://us.imdb.com/M/title-exact?Prefontaine%20(1997)   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1599    5.0 1   1599    Someone Else's America (1995)   10-May-1996     http://us.imdb.com/M/title-exact?Someone%20Else's%20America%20(1995)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1536    5.0 1   1536    Aiqing wansui (1994)    22-Jul-1996     http://us.imdb.com/M/title-exact?Aiqing%20Wansui%20(1994)   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1449    4.625   8   1449    Pather Panchali (1955)  22-Mar-1996     http://us.imdb.com/M/title-exact?Pather%20Panchali%20(1955) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1594    4.5 2   1594    Everest (1998)  10-Mar-1998     http://us.imdb.com/Title?Everest+(1998) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
119 4.5 4   119 Maya Lin: A Strong Clear Vision (1994)  01-Jan-1994     http://us.imdb.com/M/title-exact?Maya%20Lin:%20A%20Strong%20Clear%20Vision%20(1994) 0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
1398    4.5 2   1398    Anna (1996) 13-Nov-1996     http://us.imdb.com/M/title-exact?Anna%20(1996)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
1642    4.5 2   1642    Some Mother's Son (1996)    27-Dec-1996     http://us.imdb.com/M/title-exact?Some%20Mother's%20Son%20(1996) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
408 4.491071428571429   112 408 Close Shave, A (1995)   28-Apr-1996     http://us.imdb.com/M/title-exact?Close%20Shave,%20A%20(1995)    0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0
318 4.466442953020135   298 318 Schindler's List (1993) 01-Jan-1993     http://us.imdb.com/M/title-exact?Schindler's%20List%20(1993)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   1   0
169 4.466101694915254   118 169 Wrong Trousers, The (1993)  01-Jan-1993     http://us.imdb.com/M/title-exact?Wrong%20Trousers,%20The%20(1993)   0   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
483 4.45679012345679    243 483 Casablanca (1942)   01-Jan-1942     http://us.imdb.com/M/title-exact?Casablanca%20(1942)    0   0   0   0   0   0   0   0   1   0   0   0   0   0   1   0   0   1   0
114 4.447761194029851   67  114 Wallace & Gromit: The Best of Aardman Animation (1996)  05-Apr-1996     http://us.imdb.com/Title?Wallace+%26+Gromit%3A+The+Best+of+Aardman+Animation+(1996) 0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
64  4.445229681978798   283 64  Shawshank Redemption, The (1994)    01-Jan-1994     http://us.imdb.com/M/title-exact?Shawshank%20Redemption,%20The%20(1994) 0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
603 4.3875598086124405  209 603 Rear Window (1954)  01-Jan-1954     http://us.imdb.com/M/title-exact?Rear%20Window%20(1954) 0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   1   0   0
12  4.385767790262173   267 12  Usual Suspects, The (1995)  14-Aug-1995     http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995)   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0
50  4.3584905660377355  583 50  Star Wars (1977)    01-Jan-1977     http://us.imdb.com/M/title-exact?Star%20Wars%20(1977)   0   1   1   0   0   0   0   0   0   0   0   0   0   0   1   1   0   1   0
178 4.344   125 178 12 Angry Men (1957) 01-Jan-1957     http://us.imdb.com/M/title-exact?12%20Angry%20Men%20(1957)  0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0

R

User stats

> users  <- read.table('C:/work/ClearPoint/Data20/data/u.user', header=FALSE, sep='|', 
+ col.names=c('userid','age','gender','occupation','zipcode'));
> summary(users);

    userid           age        gender          occupation     zipcode   
 Min.   :  1.0   Min.   : 7.00   F:273   student      :196   55414  :  9  
 1st Qu.:236.5   1st Qu.:25.00   M:670   other        :105   55105  :  6  
 Median :472.0   Median :31.00           educator     : 95   10003  :  5  
 Mean   :472.0   Mean   :34.05           administrator: 79   20009  :  5  
 3rd Qu.:707.5   3rd Qu.:43.00           engineer     : 67   55337  :  5  
 Max.   :943.0   Max.   :73.00           programmer   : 66   27514  :  4  
                                         (Other)      :335   (Other):909
> table(users$age);     

 7 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
 1  1  1  5  3  6  5 14 18 23 32 27 37 28 33 38 34 35 36 32 39 25 28 26 17 27 21 19 17 22 21 10 21 13 23 
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 70 73 
15 12 14 20 19 20 20  6 12  4 11  6  9  3  3  9  3  2  3  2  3  1  2  2  3  1   

Plots

> plot(users$gender);

Basic Histogram

> hist(users$age);

Lattice - prettier graphs & panels

> library(lattice);
> xyplot(occupation~age,users);

conditioning variable

> xyplot(occupation~age|gender,users);

Age distribution by gender

> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));

Age distribution by gender

> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));

Age distribution by occupation

> histogram(~age|occupation, data=users,col=heat.colors(10));

Gender distribution by occupation

> histogram(~gender|occupation, data=users, col=heat.colors(2));

Are newer movies better ?

First prep data using Pig:

votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,
         imdbURL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,
         Horror,Musical,Mystery,Romance,SciFi,Thriller,War,Western);
movieVotesGroup = GROUP votes BY itemId;
movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId, 
                                     AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;
moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;
movieRatings = FOREACH moviesWithVotes GENERATE movieVotes::movieId, movieVotes::avgRating, 
               movieVotes::numVotes, REGEX_EXTRACT(movies::releaseDate,'(.*)-(.*)-(.*)',3), movies::movieTitle;
STORE movieRatings INTO 'output/movieRatings' USING PigStorage('\t');

output/movieRatings:

1   3.8783185840707963  452 1995    Toy Story (1995)
2   3.2061068702290076  131 1995    GoldenEye (1995)
3   3.033333333333333   90  1995    Four Rooms (1995)
4   3.550239234449761   209 1995    Get Shorty (1995)
5   3.302325581395349   86  1995    Copycat (1995)

Load data into R

> ratings  <- read.table('C:/work/ClearPoint/Data20/pig/output/movieRatings/part-r-00000', header=FALSE, sep='\t', 
            col.names=c('movieid','avgrating','numvotes','year','title'));

> summary(ratings); 

    movieid         avgrating        numvotes           year                                title    
 Min.   :   1.0   Min.   :1.000   Min.   :  1.00   Min.   :1930   Butcher Boy, The (1998)      :  2  
 1st Qu.: 326.0   1st Qu.:2.750   1st Qu.: 11.00   1st Qu.:1989   Desperate Measures (1998)    :  2  
 Median : 697.0   Median :3.278   Median : 42.00   Median :1994   Money Talks (1997)           :  2  
 Mean   : 710.7   Mean   :3.171   Mean   : 73.93   Mean   :1988   Substance of Fire, The (1996):  2  
 3rd Qu.: 984.0   3rd Qu.:3.741   3rd Qu.:101.00   3rd Qu.:1996   1-900 (1994)                 :  1  
 Max.   :1682.0   Max.   :5.000   Max.   :583.00   Max.   :1998   12 Angry Men (1957)          :  1  
                                                   NA's   :   1   (Other)                      :803

Looks like we have some issues, source data has duplicate movies (with different movie Ids) and we are missing a year...

Is there a correlation between movie age and average rating?

> smoothScatter(ratings$year,ratings$avgrating);


Doesn't look like it....

Calculating the correlation

> cor(ratings$year,ratings$avgrating, use='complete.obs');
[1] -0.206226

Looks like there is some correlation

> cor(ratings$year,ratings$numvotes, use='complete.obs');
[1] -0.001756821

Only a very, very weak correlation between year and number of votes

> cor(ratings$avgrating,ratings$numvotes, use='complete.obs');
[1] 0.468041

Interesting there is quite a strong correlation between average rating and number of votes. Wonder why?

Grouping by decade might help...

> ratings$decade <- trunc(ratings$year / 10) * 10;
> histogram(~avgrating|as.factor(decade),ratings,layout=c(1, 7));

Not only structured data

REGISTER lib/piggybank.jar;
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE HourExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('dd/MMM/yyyy:HH:mm:ss Z','yyyy-MM-dd HH:00','Pacific/Auckland');

accesslog = LOAD '../data/access.log' USING LogLoader  AS (remoteAddr, remoteLogname, user, time, method,
                                                           uri, proto, status, bytes, referer, userAgent);
byHour = GROUP accesslog BY HourExtractor(time);

hourCounts = FOREACH byHour GENERATE group as hourOfDay, COUNT(accesslog) as numberOfHits;
hourCountsOrdered = ORDER hourCounts by hourOfDay;

STORE hourCountsOrdered INTO 'output/hitsbyhour' using PigStorage('\t');    

output/hitsbyhour

2011-08-15 20:00    47
2011-08-15 21:00    9
2011-08-15 22:00    28
2011-08-15 23:00    48
2011-08-16 00:00    22
2011-08-16 01:00    74
2011-08-16 02:00    34
2011-08-16 03:00    27
2011-08-16 04:00    38
2011-08-16 05:00    32
2011-08-16 06:00    20
2011-08-16 07:00    33

That's all folks

Questions?