**************************************************************************
ABOUT THE GREENPLUM MAPREDUCE DEMOS FOR Apache Cloudberry
**************************************************************************

This package contains two example programs of Greenplum MapReduce:

1. Grep - This Greenplum MapReduce program searches through web access 
          log files and extracts the URL portion of the log.
       
          Specification File: 1_grep.yml
          Data Files: access_log, access_log2
          Outputs To: STDOUT

2. Word Count - This Greenplum MapReduce program goes through a text 
                document and counts the distinct words found.

                Specification File: 2_wordcount.yml
                Data Files: whitepaper.txt
                Outputs To: STDOUT


 
**************************************************************************
BEFORE YOU BEGIN
**************************************************************************

1. You must be a Greenplum Database superuser (such as gpadmin) to run 
   the MapReduce demos.


2. Create a database to use for the Greenplum MapReduce demos:

    $ createdb gpmrdemo

3. In this database, create the procedural languages used by the demos:

    $ psql gpmrdemo -c 'CREATE LANGUAGE plperl;'
    $ psql gpmrdemo -c 'CREATE LANGUAGE plpython3u;'


NOTE: If the CREATE LANGUAGE command does not succeed because a
      library (.so file) could not be found, you will need to make
      sure that all hosts in your Greenplum array have a shared 
      Perl and Python installation. You must also make sure the
      Perl and Python libraries are findable by your runtime linker
      on all hosts in your Greenplum Database array. Greenplum
      provides installer packages for Perl and Python on 
      https://emc.subscribenet.com  if you need them.


4. On your segment hosts, create a location to put the Greenplum
   MapReduce demo data files. If you are running multiple segment 
   instances per segment host, you can create this location on just 
   one of your segment hosts. For example:

      $ gpssh -h seghost1 -e 'mkdir /home/gpadmin/gpmrdata'

    If you are running with only segment instance per host, 
    you will need to use two segment hosts:

      $ $ gpssh -h seghost1 -h seghost2 -e 'mkdir /home/gpadmin/gpmrdata'




**************************************************************************
RUNNING DEMO 1: GREP
**************************************************************************

1. Copy the data files for this demo to the demo data location on your 
   segment hosts. For example:

    $ gpscp -h seghost1 -h seghost2 data/accesslog data/accesslog2 =:/home/gpadmin/gpmrdata


2. Edit the 1_grep.yml file and change the <seghostname1>, <seghostname2>, 
   and <file_path> place holders to reflect the actual location of the 
   data files. For example:

     FILE:
        - myseghost1:/home/gpadmin/gpmrdata/access_log
        - myseghost2:/home/gpadmin/gpmrdata/access_log2 

3. Execute the 1_grep.yml Greenplum MapReduce job:

    $ gpmapreduce -f 1_grep.yml gpmrdemo

4. The program should return 47 output rows.


**************************************************************************
RUNNING DEMO 2: WORD COUNT
**************************************************************************

1. Copy the data file for this demo to the demo data location on a segment 
   host. For example:

    $ scp data/whitepaper.txt myseghost1:/home/gpadmin/gpmrdata


2. Edit the 2_wordcount.yml file and change the <seghostname> and 
   <file_path> place holders to reflect the actual location of the 
   data file. For example:

     FILE:
        - myseghost1:/home/gpadmin/gpmrdata/whitepaper.txt 

3. Execute the 2_wordcount.yml Greenplum MapReduce job:

    $ gpmapreduce -f 2_wordcount.yml gpmrdemo | more

4. The program should return 1488 output rows.


**************************************************************************
DEMO CLEANUP
**************************************************************************

After you have run the demos, run the following commands to clean up:

1. Remove the demo data on the segment hosts:

    $ gpssh -h seghost1 -h seghost2 -e 'rm /home/gpadmin/gpmrdata'

2. Drop the demo database:

     $ dropdb gpmrdemo




                                   