In this post, we will build a 20 node Beowulf cluster on Amazon EC2 and run some computations using both MPI and its Python wrapper pyMPI. This tutorial will only describe how to get the cluster running and show a few example computations. I'll save detailed benchmarking for a later write-up.
One way to build an MPI cluster on EC2 would be to customize something like Warewulf or rebundle one of the leading linux cluster distributions like Parallel Knoppix or the Rocks Cluster Distribution onto an Amazon AMI. Both of these distros have kernels which should work with EC2. To get things running quickly as a proof of concept, I implemented a "roll-your-own" style cluster based on a Fedora Core 6 AMI managed with some simple Python scripts. I've found this approach suitable for running occasional parallel computations on EC2 with 20 nodes and have been running a cluster off and on for several months without any major issues. If you need to run a much larger cluster or require more complex user management, I'd recommend modifying one of the standard distributions. This will save you from some maintenance headaches and give you the additional benefit of the user/developer base for those systems.
The main task I use the cluster for is distributing large matrix computations, which is a problem well suited to existing libraries based on MPI. Depending on your needs, another platform such as Hadoop, Rinda, or cow.py might make more sense. I use Hadoop for some other projects, including MapReduce style tasks with Jython, and highly recommend it. That said, lets start building the MPI cluster...
The only prerequisite we assume is that the tutorial on Amazon EC2 has been completed and all needed web service accounts, authorizations, and keypairs have been created.
The command blocks which begin with peter-skomorochs-computer:~ pskomoroch$ are run on my local laptop, the commands preceded by -bash-3.1# or [lamuser@domu-12-31-33-00-03-46 ~]$ are run on EC2.
Its looking like this will be a long tutorial, so I'll break it into three parts...
Update: March 5, 2007 - I'm in the process of publishing a public AMI, and have changed a few things in the tutorial. The steps describing copying over rsa keys have been moved from this post to part 2 of the tutorial. People interested in testing an MPI cluster on EC2 can skip all the installs and just use my example AMI with your own keys as described in part 2
Part 1 of 3
- Fire Up a Base Image
- Rebundle a Larger Base Image
- Uploading the AMI to Amazon S3
- Registering the Larger Base Image
- Modifying the Larger Image
- Rebundle the compute node image
- Upload node AMI to Amazon S3
- Register Compute Node Image
- Launching the EC2 nodes
- Cluster Configuration and Booting MPI
- Testing the MPI Cluster
- Changing the Cluster Size
- Cluster Shutdown
Part 3 of 3
- Basic MPI Cluster Administration on EC2 with Python
- Example application: Parallel Distributed Matrix Multiplication with PyMPI and Numpy
- Benchmarking EC2 for MPI
Fire Up a Base Image
We will build our cluster on top of the Fedora Core 6 base image published by "marcin the cool". Navigate to your local bin directory holding the Amazon EC2 developer tools and fire up the public image
peter-skomorochs-computer:~ pskomoroch$ ec2-run-instances ami-78b15411 -k gsg-keypair RESERVATION r-e264818b 027811143419 default INSTANCE i-2b1efa42 ami-78b15411 pending gsg-keypair 0
To check on the status of the instance run the following:
peter-skomorochs-computer:~ pskomoroch$ ec2-describe-instances i-2b1efa42 RESERVATION r-e264818b 027811143419 default INSTANCE i-2b1efa42 ami-78b15411 domU-12-31-33-00-03-46.usma1.compute.amazonaws.com running gsg-keypair 0
The status has changed from "pending" to "running", so we are ready to ssh into the instance as root:
peter-skomorochs-computer:~ pskomoroch$ ssh -i id_rsa-gsg-keypair root@domU-12-31-33-00-03-46.usma1.compute.amazonaws.com The authenticity of host 'domu-12-31-33-00-03-46.usma1.compute.amazonaws.com (188.8.131.52)' can't be established. RSA key fingerprint is ZZZZZZ Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'domu-12-31-33-00-03-46.usma1.compute.amazonaws.com,184.108.40.206' (RSA) to the list of known hosts -bash-3.1#
Here are some basic stats on the EC2 machine:
$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 250 stepping : 1 cpu MHz : 2405.452 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm ts fid vid ttp bogomips : 627.50
The first change we make will be to modify the ssh properties to avoid timeouts:
Edit /etc/ssh/sshd_config and add the following line:
This image boots up fast, but it is missing a lot of basics along with the MPI libraries and Amazon AMI packaging tools. The main partition is fairly small, so before we start our installs, we will need to rebundle a larger version.
In order to rebundle, we need the Amazon developer tools installed...
Amazon AWS AMI tools install
Install the Amazon AWS ami tools from the rpm:
yum -y install wget nano tar bzip2 unzip zip fileutils yum -y install ruby yum -y install rsync make cd /usr/local/src wget http://s3.amazonaws.com/ec2-downloads/ec2-ami-tools.noarch.rpm rpm -i ec2-ami-tools.noarch.rpm
Rebundle a Larger Base Image
Copy over the pk/cert files:
peter-skomorochs-computer:~ pskomoroch$ scp -i id_rsa-gsg-keypair ~/.ec2/pk-FOOXYZ.pem ~/.ec2/cert-BARXYZ.pem root@domU-12-31-33-00-03-46.usma1.compute.amazonaws.com:/mnt/ pk-FOOXYZ.pem 100% 721 0.7KB/s 00:00 cert-BARXYZ.pem 100% 689 0.7KB/s 00:00 peter-skomorochs-computer:~ pskomoroch$
Using the -s parameter we boost the trimmed down fedora core 6 image from 1.5 GB to 5.5 GB so we have room to install more packages (substitute own your cert and user option values from the Amazon tutorial).
-bash-3.1# ec2-bundle-vol -d /mnt -k /mnt/pk-FOOXYZ.pem -c /mnt/cert-BARXYZ.pem -u 99999ABC -s 5536 Copying / into the image file /mnt/image... Excluding: /sys /proc /proc/sys/fs/binfmt_misc /dev /media /mnt /proc /sys /mnt/image /mnt/img-mnt 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.015051 seconds, 69.7 MB/s mke2fs 1.39 (29-May-2006) warning: 256 blocks unused. Bundling image file... Splitting /mnt/image.tar.gz.enc... Created image.part.00 Created image.part.01 Created image.part.02 Created image.part.03 Created image.part.04 Created image.part.05 Created image.part.06 Created image.part.07 Created image.part.08 Created image.part.09 Created image.part.10 Created image.part.11 Created image.part.12 Created image.part.13 Created image.part.14 ...
Created image.part.39 Created image.part.40 Created image.part.41 Generating digests for each part... Digests generated. Creating bundle manifest... ec2-bundle-vol complete.
Uploading the AMI to Amazon S3
This step is identical to the Amazon tutorial, use you own Amazon assigned AWS Access Key ID (aws-access-key-id) and AWS Secret Access Key (aws-secret-access-key). I'll use the following values in the code examples:
- Access Key ID: 1AFOOBARTEST
- Secret Access Key: F0Bar/T3stId
bash-3.1# ec2-upload-bundle -b FC6_large_base_image -m /mnt/image.manifest.xml -a 1AFOOBARTEST -s F0Bar/T3stId Setting bucket ACL to allow EC2 read access ... Uploading bundled AMI parts to https://s3.amazonaws.com:443/FC6_large_base_image ... Uploaded image.part.00 to https://s3.amazonaws.com:443/FC6_large_base_image/image.part.00. Uploaded image.part.01 to https://s3.amazonaws.com:443/FC6_large_base_image/image.part.01. ... Uploaded image.part.48 to https://s3.amazonaws.com:443/FC6_large_base_image/image.part.48. Uploaded image.part.49 to https://s3.amazonaws.com:443/FC6_large_base_image/image.part.49. Uploading manifest ... Uploaded manifest to https://s3.amazonaws.com:443/FC6_large_base_image/image.manifest.xml. ec2-upload-bundle complete
The upload will take several minutes...
Registering the Larger Base Image
To register the new image with Amazon EC2, we switch back to our local machine and run the following:
peter-skomorochs-computer:~/src/amazon_ec2 pskomoroch$ ec2-register FC6_large_base_image/image.manifest.xml IMAGE ami-3cb85d55
Included in the output is an AMI identifier, (ami-3cb85d55 in the example above) which we will use as our base for building the compute nodes.
Modifying the Larger Image
We need to start an instance of the larger image we registered and install some needed libraries.
First, start the new image:
peter-skomorochs-computer:~ pskomoroch$ ec2-run-instances ami-3cb85d55 -k gsg-keypair RESERVATION r-e264818b 027811143419 default INSTANCE i-2z1efa32 ami-3cb85d55 pending gsg-keypair 0
Wait for a hostname so we can ssh into the instance...
peter-skomorochs-computer:~ pskomoroch$ ec2-describe-instances i-2b1efa42 RESERVATION r-e264818b 027811143419 default INSTANCE i-2z1efa32 ami-3cb85d55 domU-12-31-33-00-03-57.usma1.compute.amazonaws.com running gsg-keypair 0
ssh in as root:
peter-skomorochs-computer:~ pskomoroch$ ssh -i id_rsa-gsg-keypair root@domU-12-31-33-00-03-57.usma1.compute.amazonaws.com The authenticity of host 'domu-12-31-33-00-03-57.usma1.compute.amazonaws.com (220.127.116.11)' can't be established. RSA key fingerprint is 23:XY:FO... Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'domu-12-31-33-00-03-57.usma1.compute.amazonaws.com,18.104.22.168' (RSA) to the list of known hosts. -bash-3.1#
Run the following yum installs to get some needed libraries:
yum -y install python-devel yum -y install gcc yum -y install gcc-c++ yum -y install subversion gcc-gfortran yum -y install fftw-devel swig yum -y install compat-gcc-34 compat-gcc-34-g77 compat-gcc-34-c++ compat-libstdc++-33 compat-db compat-readline43 yum -y install hdf5-devel yum -y install readline-devel yum -y install python-numeric python-numarray Pyrex yum -y install python-psyco yum -y install wxPython-devel zlib-devel freetype-devel tk-devel tkinter gtk2-devel pygtk2-devel libpng-devel yum -y install octave
For improved performance in matrix operations, we will want to install processor specific math libraries. Since the Amazon machines run on AMD Opteron processors, we will install ACML instead of Intel MKL.
- Login into the AMD developer page
- Download acml-3-6-0-gnu-32bit.tgz , and scp the archive over to the EC2 instance.
peter-skomorochs-computer:~ pskomoroch$ scp acml-3-6-0-gnu-32bit.tgz root@domU-12-31-33-00-03-57.usma1.compute.amazonaws.com:/usr/local/src/ acml-3-6-0-gnu-32bit.tgz 100% 9648KB 88.5KB/s 01:49
- To install acml, decompress the files and run the install scripts and accept the license. Note where it installs acml (in my case /opt/acml3.6.0/)
- cd into the /opt/acml3.6.0/ directory and run the tests by issuing make.
-bash-3.1# chmod +x /usr/lib/gcc/i386-redhat-linux/3.4.6/libg2c.a -bash-3.1# ln -s /usr/lib/gcc/i386-redhat-linux/3.4.6/libg2c.a /usr/lib/libg2c.a -bash-3.1# cd /usr/local/src/ -bash-3.1# ls acml-3-6-0-gnu-32bit.tgz ec2-ami-tools.noarch.rpm -bash-3.1# tar -xzvf acml-3-6-0-gnu-32bit.tgz contents-acml-3-6-0-gnu-32bit.tgz install-acml-3-6-0-gnu-32bit.sh README.32-bit ACML-EULA.txt -bash-3.1# bash install-acml-3-6-0-gnu-32bit.sh
Add the libraries to the default path by adding the following to /etc/profile:
LD_LIBRARY_PATH=/opt/acml3.6.0/gnu32/lib export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC LD_LIBRARY_PATH
Example of running the ACML tests:
-bash-3.1# cd /opt/acml3.6.0/gnu32/examples/ -bash-3.1# make Compiling program cdotu_c_example.c: gcc -c -I/opt/acml3.6.0/gnu32/include -m32 cdotu_c_example.c -o cdotu_c_example.o Linking program cdotu_c_example.exe: gcc -m32 cdotu_c_example.o /opt/acml3.6.0/gnu32/lib/libacml.a -lg2c -lm -o cdotu_c_example.exe Running program cdotu_c_example.exe: (export LD_LIBRARY_PATH='/opt/acml3.6.0/gnu32/lib:/opt/acml3.6.0/gnu32/lib'; ./cdotu_c_example.exe > cdotu_c_example.res 2>&1) ACML example: dot product of two complex vectors using cdotu ------------------------------------------------------------ Vector x: ( 1.0000, 2.0000) ( 2.0000, 1.0000) ( 1.0000, 3.0000) Vector y: ( 3.0000, 1.0000) ( 1.0000, 4.0000) ( 1.0000, 2.0000) r = x.y = ( -6.000, 21.000) Compiling program cfft1d_c_example.c: gcc -c -I/opt/acml3.6.0/gnu32/include -m32 cfft1d_c_example.c -o cfft1d_c_example.o Linking program cfft1d_c_example.exe: gcc -m32 cfft1d_c_example.o /opt/acml3.6.0/gnu32/lib/libacml.a -lg2c -lm -o cfft1d_c_example.exe Running program cfft1d_c_example.exe: (export LD_LIBRARY_PATH='/opt/acml3.6.0/gnu32/lib:/opt/acml3.6.0/gnu32/lib'; ./cfft1d_c_example.exe > cfft1d_c_example.res 2>&1) ACML example: FFT of a complex sequence using cfft1d ---------------------------------------------------- Components of discrete Fourier transform: Real Imag 0 ( 2.4836,-0.4710) 1 (-0.5518, 0.4968) 2 (-0.3671, 0.0976) 3 (-0.2877,-0.0586) 4 (-0.2251,-0.1748) 5 (-0.1483,-0.3084) 6 ( 0.0198,-0.5650) Original sequence as restored by inverse transform: Original Restored Real Imag Real Imag 0 ( 0.3491,-0.3717) ( 0.3491,-0.3717) 1 ( 0.5489,-0.3567) ( 0.5489,-0.3567) 2 ( 0.7478,-0.3117) ( 0.7478,-0.3117) 3 ( 0.9446,-0.2370) ( 0.9446,-0.2370) 4 ( 1.1385,-0.1327) ( 1.1385,-0.1327) 5 ( 1.3285, 0.0007) ( 1.3285, 0.0007) 6 ( 1.5137, 0.1630) ( 1.5137, 0.1630) ...
... ACML example: solution of linear equations using sgetrf/sgetrs -------------------------------------------------------------- Matrix A: 1.8000 2.8800 2.0500 -0.8900 5.2500 -2.9500 -0.9500 -3.8000 1.5800 -2.6900 -2.9000 -1.0400 -1.1100 -0.6600 -0.5900 0.8000 Right-hand-side matrix B: 9.5200 18.4700 24.3500 2.2500 0.7700 -13.2800 -6.2200 -6.2100 Solution matrix X of equations A*X = B: 1.0000 3.0000 -1.0000 2.0000 3.0000 4.0000 -5.0000 1.0000 Testing: no example difference files were generated. Test passed OK -bash-3.1#
If everything checks out, the next step is to compile a version of cblas from source.
See http://www.netlib.org/blas/ for more details
- Download the cblas source code and unzip into /usr/local/src
To compile we follow George Nurser's writeup (thanks for the help on this part George...). For the 32bit EC2 machines, we changed the compile flags in /usr/local/src/CBLAS/Makefile.LINUX to:
CFLAGS = -O3 -DADD_ -pthread -fno-strict-aliasing -m32 -msse2 -mfpmath=sse -march=opteron -fPIC FFLAGS = -Wall -fno-second-underscore -fPIC -O3 -funroll-loops -march=opteron -mmmx -msse2 -msse -m3dnow RANLIB = ranlib BLLIB = /opt/acml3.6.0/gnu32/lib/libacml.so CBDIR = /usr/local/src/CBLAS
Next we copy the Makefile.LINUX to Makefile.in and execute "make". The resulting cblas.a must then be copied to libcblas.a in the same directory as the libacml.so:
-bash-3.1# cd /usr/local/src/CBLAS -bash-3.1# ln -s Makefile.LINUX Makefile.in -bash-3.1# make all -bash-3.1# cd/usr/local/src/CBLAS/lib/LINUX -bash-3.1# cp cblas_LINUX.a /opt/acml3.6.0/gnu32/lib/libcblas.a -bash-3.1# cd /opt/acml3.6.0/gnu32/lib/ -bash-3.1# chmod +x libcblas.a
This directory then needs to be added to the $LD_LIBRARY_PATH and $LD_RUN_PATH before we compile numpy.
export LD_LIBRARY_PATH=/opt/acml3.6.0/gnu32/lib export LD_RUN_PATH=/opt/acml3.6.0/gnu32/lib
Compile numpy from source:
cd /usr/local/src svn co http://svn.scipy.org/svn/numpy/trunk/ ./numpy-trunk cd numpy-trunk
Before building scipy with setup.py, we need to configure a site.cfg file in both the numpy-trunk directory and the distutils subdirectory. This was overlooked the first time I did this which resulted in a slower default Numpy install that was missing the ACML optimized lapack and blas. If the install fails, make sure that you get rid of earlier tries with:
rm -rf /usr/lib/python2.4/site-packages/numpy rm -rf usr/local/src/numpy-trunk/build
again, for more details see George Nurser's writeup
Contents of both site.cfg files for my install:
[DEFAULT] library_dirs = /usr/local/lib include_dirs = /usr/local/include [blas] blas_libs = cblas, acml library_dirs = /opt/acml3.6.0/gnu32/lib include_dirs = /usr/local/src/CBLAS/src [lapack] language = f77 lapack_libs = acml library_dirs = /opt/acml3.6.0/gnu32/lib include_dirs = /opt/acml3.6.0/gnu32/include
We execute the actual compile with the following:
python setup.py build python setup.py install cd ../ rm -R numpy-trunk
Take a look at the instructions for the lapack and blas environment as described here:
I found that no modifications from the defaults were needed, the install should pick up the libraries built in the previous steps.
Install Scipy from source:
cd /usr/local/src svn co http://svn.scipy.org/svn/scipy/trunk/ ./scipy-trunk cd scipy-trunk python setup.py build python setup.py install cd ../ rm -R scipy-trunk
Verify numpy and scipy work and are using the correct libraries:
-bash-3.1# python Python 2.4.4 (#1, Oct 23 2006, 13:58:00) [GCC 4.1.1 20061011 (Red Hat 4.1.1-30)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy,scipy >>> numpy.show_config() >>> numpy.show_config() blas_info: libraries = ['cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 lapack_info: libraries = ['acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 atlas_threads_info: NOT AVAILABLE blas_opt_info: libraries = ['cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_blas_threads_info: NOT AVAILABLE lapack_opt_info: libraries = ['acml', 'cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE >>> scipy.show_config() blas_info: libraries = ['cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 lapack_info: libraries = ['acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 atlas_threads_info: NOT AVAILABLE blas_opt_info: libraries = ['cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_blas_threads_info: NOT AVAILABLE djbfft_info: NOT AVAILABLE lapack_opt_info: libraries = ['acml', 'cblas', 'acml'] library_dirs = ['/opt/acml3.6.0/gnu32/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] fftw3_info: libraries = ['fftw3'] library_dirs = ['/usr/lib'] define_macros = [('SCIPY_FFTW3_H', None)] include_dirs = ['/usr/include'] umfpack_info: NOT AVAILABLE atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE >>>
Now that we have numpy and scipy, we can install matplotlib:
yum -y install python-matplotlib
We can benchmark the performance improvement from the ACML libraries using a script George Nurser provided:
EC2 image with Default Numpy:
-bash-3.1# python bench_blas2.py Tests x.T*y x*y.T A*x A*B A.T*x half 2in2 Dimension: 5 Array 1.8900 0.4300 0.3900 0.4300 1.2600 1.4500 1.6000 Matrix 6.6100 2.0900 0.9100 0.9400 1.4200 3.1300 3.8100 Dimension: 50 Array 18.8300 2.1600 0.7000 12.8300 2.3100 1.7300 1.9000 Matrix 66.3900 3.9900 1.2200 13.4600 1.7500 3.4300 4.1100 Dimension: 500 Array 1.9800 5.1500 0.6600 125.9200 7.5600 0.3500 0.6700 Matrix 6.8400 5.2200 0.6700 125.9700 0.9000 0.4000 0.7300
EC2 image with Numpy built with ACML:
-bash-3.1# python bench_blas2.py Tests x.T*y x*y.T A*x A*B A.T*x half 2in2 Dimension: 5 Array 2.0300 0.6500 0.3800 0.7100 1.2000 1.4400 1.5200 Matrix 6.7500 2.4100 0.8400 1.2400 1.3800 3.0300 3.5600 Dimension: 50 Array 20.4500 2.7500 0.5900 11.8300 2.2200 1.7300 1.8000 Matrix 68.2400 4.5900 1.1100 12.4200 1.7100 3.3600 3.9100 Dimension: 500 Array 2.1800 5.1900 0.5800 77.1200 7.4200 0.3300 0.6900 Matrix 6.9500 5.2800 0.5900 77.3400 0.6200 0.3800 0.7500
Install mpich2 from source:
cd /usr/local/src wget http://www-unix.mcs.anl.gov/mpi/mpich2/downloads/mpich2-1.0.5.tar.gz tar -xzvf mpich2-1.0.5.tar.gz cd mpich2-1.0.5 ./configure make make install
Build pyMPI from source:
cd /usr/local/src wget http://downloads.sourceforge.net/pympi/pyMPI-2.4b2.tar.gz?modtime=1122458975&big_mirror=0 tar -xzvf pyMPI-2.4b2.tar.gz cd pyMPI-2.4b2
The basic build and install is invoked with:
./configure --with-includes=-I/usr/local/include make make install
This will build a default version of pyMPI based on the python program the configure script finds in your path. It also tries to find mpcc, mpxlc, or mpicc to do the compiling and linking with the MPI libraries.
Install PyTables from source (requires the previous yum install of hdf5-devel)
cd /usr/local/src wget http://downloads.sourceforge.net/pytables/pytables-1.4.tar.gz tar -xvzf pytables-1.4.tar.gz cd pytables-1.4/ python setup.py build_ext --inplace python setup.py install
Configuration and Cleanup
To help reduce the image size, lets remove the compressed source files we downloaded:
-bash-3.1# rm ec2-ami-tools.noarch.rpm mpich2-1.0.5.tar.gz pyMPI-2.4b2.tar.gz acml-3-6-0-gnu-32bit.tgz contents-acml-3-6-0-gnu-32bit.tgz pytables-1.4.tar.gz
For the mpich configuration we need to add a couple of additional files to the base install:
Create the file mpd.conf as follows (with your own password)
cd /etc touch .mpd.conf chmod 600 .mpd.conf nano .mpd.conf secretword=Myp@ssW0rD
Next we set the ssh variable "StrictHostKeyChecking" to "no". This is an evil hack to avoid the tedious adding of each compute node host... I'm assuming these EC2 nodes will only connect to eachother, please be careful.
See the following article for why this is risky: http://www.securityfocus.com/infocus/1806
edit the ssh_config file:
change the following line..
# StrictHostKeyChecking ask StrictHostKeyChecking no
Changing this setting avoids having to manually accept each compute node later on:
The authenticity of host 'domu-12-31-34-00-00-3a.usma2.compute.amazonaws.com (22.214.171.124)' can't be established. RSA key fingerprint is 58:ae:0b:e7:a6:d8:d0:00:4f:ca:22:53:42:d5:e5:22. Are you sure you want to continue connecting (yes/no)? yes
Creating a non-root user
We should run the MPI process as a non-root user, so we will create a "lamuser" account on the instance (in another version of this tutorial, I used LAM instead of MPICH2). Substitute your own cert, keys, and passwords.
-bash-3.1# adduser lamuser -bash-3.1# passwd lamuser Changing password for user lamuser. New UNIX password: Retype new UNIX password: passwd: all authentication tokens updated successfully.
Now configure the .bash_profile and .bashrc:
-bash-3.1# cd /home/lamuser/ -bash-3.1# ls -bash-3.1# ls . ./ ../ .bash_logout .bash_profile .bashrc -bash-3.1# nano .bash_profile
The contents of bash_profile should be as follows (uncomment the LAM settings if you want to use LAM MPI instead of MPICH2):
-bash-3.1# more .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs LAMRSH="ssh -x" export LAMRSH #LD_LIBRARY_PATH="/usr/local/lam-7.1.2/lib/" #export LD_LIBRARY_PATH MPICH_PORT_RANGE="2000:8000" export MPICH_PORT_RANGE PATH=$PATH:$HOME/bin #PATH=/usr/local/lam-7.1.2/bin:$PATH #MANPATH=/usr/local/lam-7.1.2/man:$MANPATH export PATH #export MANPATH
We need to give the lamuser the same MPI configuration we created for the root user in part 1...
Create the file .mpd.conf as follows (with your own password for the secretword):
cd /home/lamuser touch .mpd.conf chmod 600 .mpd.conf nano .mpd.conf secretword=Myp@ssW0rD
The last step is to set ownership on the directory contents to the user:
chown -R lamuser:lamuser /home/lamuser
Adding the S3 Libraries
Download the developer tools for S3 to the instance:
-bash-3.1# wget http://developer.amazonwebservices.com/connect/servlet/KbServlet/download/134-102-759/s3-example-python-library.zip -bash-3.1# unzip s3-example-python-library.zip Archive: s3-example-python-library.zip creating: s3-example-libraries/python/ inflating: s3-example-libraries/python/README inflating: s3-example-libraries/python/S3.py inflating: s3-example-libraries/python/s3-driver.py inflating: s3-example-libraries/python/s3-test.py
Rebundle the compute node image
We are going to make this a public AMI, so we need to clear out some data first.
Here's the advice from the Amazon EC2 Developer Guide:
We have looked at making shared AMIs safe, secure and useable for the users who launch them, but if you publish a shared AMI you should also take steps to protect yourself against the users of your AMI. This section looks at steps you can take to do this.
We recommend against storing sensitive data or software on any AMI that you share. Users who launch a shared AMI potentially have access to rebundle it and register it as their own. Follow these guidelines to help you to avoid some easily overlooked security risks:
- Always delete the shell history before bundling. If you attempt more than one bundle upload in the same image the shell history will contain your secret access key.
- Bundling a running instance requires your private key and X509 certificate. Put these and other credentials in a location that will not be bundled (such as the ephemeral store).
- Exclude the ssh authorized keys when bundling the image. The Amazon public images store the public key an instance was launched with in that instance's ssh authorized keys file.
ssh into the modified image and clean up:
rm -f /root/.ssh/authorized_keys rm -f /home/lamuser/.ssh/authorized_keys rm ~/.bash_history rm /var/log/secure rm /var/log/lastlog
The ec2-bundle-vol command has some optional parameters we will use:
-bash-3.1# ec2-bundle-vol --help Usage: ec2-bundle-vol PARAMETERS MANDATORY PARAMETERS -c, --cert PATH The path to the user's PEM encoded RSA public key certificate file. -k, --privatekey PATH The path to the user's PEM encoded RSA private key file. -u, --user USER The user's EC2 user ID (Note: AWS account number, NOT Access Key ID). OPTIONAL PARAMETERS -e, --exclude DIR1,DIR2,... A list of absolute directory paths to exclude. E.g. "dir1,dir2,dir3". Overrides "--all". -a, --all Include all directories, including those on remotely mounted filesystems. -p, --prefix PREFIX The filename prefix for bundled AMI files. E.g. "my-image". Defaults to "image". -s, --size MB The size, in MB (1024 * 1024 bytes), of the image file to create. The maximum size is 10240 MB. -v, --volume PATH The absolute path to the mounted volume to create the bundle from. Defaults to "/". -d, --destination PATH The directory to create the bundle in. Defaults to "/tmp". --ec2cert PATH The path to the EC2 X509 public key certificate. Defaults to "/etc/aes/amiutil/cert-ec2.pem". --debug Display debug messages. -h, --help Display this help message and exit. -m, --manual Display the user manual and exit.
Execute the same bundle command we ran previously, but give the image a prefix name:
-bash-3.1# ec2-bundle-vol -d /mnt -p fc6-python-mpi-node -k /mnt/pk-FOOXYZ.pem -c /mnt/cert-BARXYZ.pem -u 99999ABC -s 5536 Copying / into the image file /mnt/image... Excluding: /sys /proc /proc/sys/fs/binfmt_misc /dev /media /mnt /proc /sys /mnt/image /mnt/img-mnt 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.015051 seconds, 69.7 MB/s mke2fs 1.39 (29-May-2006) warning: 256 blocks unused. Bundling image file... Splitting /mnt/image.tar.gz.enc... Created fc6-python-mpi-node.part.00 Created fc6-python-mpi-node.part.01 Created fc6-python-mpi-node.part.02 Created fc6-python-mpi-node.part.03 Created fc6-python-mpi-node.part.04 Created fc6-python-mpi-node.part.05 Created fc6-python-mpi-node.part.06 Created fc6-python-mpi-node.part.07 Created fc6-python-mpi-node.part.08 Created fc6-python-mpi-node.part.09 Created fc6-python-mpi-node.part.10 Created fc6-python-mpi-node.part.11 Created fc6-python-mpi-node.part.12 Created fc6-python-mpi-node.part.13 Created fc6-python-mpi-node.part.14 ...
Created fc6-python-mpi-node.part.39 Created fc6-python-mpi-node.part.40 Created fc6-python-mpi-node.part.41 Generating digests for each part... Digests generated. Creating bundle manifest... ec2-bundle-vol complete.
Now remove the keys and delete the bash history:
-bash-3.1# rm /mnt/pk-*.pem /mnt/cert-*.pem
Upload the keyless node AMI to Amazon S3
bash-3.1# ec2-upload-bundle -b datawrangling-images -m /mnt/fc6-python-mpi-node.manifest.xml -a 1AFOOBARTEST -s F0Bar/T3stId Setting bucket ACL to allow EC2 read access ... Uploading bundled AMI parts to https://s3.amazonaws.com:443/datawrangling-images ... Uploaded image.part.00 to https://s3.amazonaws.com:443/datawrangling-images/fc6-python-mpi-node.part.00. Uploaded image.part.01 to https://s3.amazonaws.com:443/datawrangling-images/fc6-python-mpi-node.part.01. ... Uploaded image.part.48 to https://s3.amazonaws.com:443/datawrangling-images/fc6-python-mpi-node.part.48. Uploaded image.part.49 to https://s3.amazonaws.com:443/datawrangling-images/fc6-python-mpi-node.part.49. Uploading manifest ... Uploaded manifest to https://s3.amazonaws.com:443/datawrangling-images/fc6-python-mpi-node.manifest.xml . ec2-upload-bundle complete
The upload will take several minutes...
Register Compute Node Image
To register the new image with Amazon EC2, we switch back to our local machine and run the following:
peter-skomorochs-computer:~ pskomoroch$ ec2-register datawrangling-images/fc6-python-mpi-node.manifest.xml IMAGE ami-3e836657
Included in the output is an AMI identifier for our MPI compute node image (ami-4cb85d77 in the example above). In the next part of this tutorial, we will run some basic tests of MPI and pyMPI on EC2 using this image. In part 3, we will add some python scripts to automate routine cluster maintenance and show some computations which we can run with the cluster.