《并行程序》的编程还是非常麻烦,不仅仅要先编写串行程序,再并行化,而且还要性能分析。两道小题,几乎把前四章全部知识都考了一遍。
这次的作业让我见识了加拿大大学的作业量。我把题目拷贝在这里,以便将来回顾时可以客观评估一下这个学位的含金量是多少。
Assignment 1
Due date: end of week 4
Question 1
A good random number generator has to pass strict quality tests. One family of tests involves checking whether certain ordered sequences of numbers are present in the output with the expected frequency. For this assignment, you will write a serial and then a parallel program to perform a simple test of this kind.
Use a random number generator of your choice to generate a random sequence of integers 0 and 1. Let the sequence length be N. Now determine how many times a subsequence of n identical integers appears in it. For example, for n=4, you would be looking for subsequences 0000 and 1111. Sequence 000000 contains 3 such subsequences.
- Write a serial program to perform this task. You can then use this program to check whether your parallel program is correct, if you use the same random sequence for testing.
- On the basis of the serial program, write a parallel program to perform this task. Use the divide and conquer algorithm, which in this case means having the master (zero) process generate the long random sequence, then parceling out the data to other processes for counting the occurrences of subsequences of various length n.
Write a program that can handle cases up to N=10**9. Run your tests on the course server. Be careful not to exceed the memory of the server.
Use non-blocking communications as appropriate to parcel out the data generated on the master process, so that you overlap computation and communication. Use standard communications as needed to exchange data between neighbouring processes when looking for subsequences which straddle the border between processes. Use collective communications to collect the final data.
For simplicity, you can assume that N/p is an integer. You can also assume that n < N/p, which avoids the possibility that a subsequence you are looking for will be spread across more than 2 processes.
- Based on your results, briefly discuss whether your random number generator functions properly. Make a histogram which shows the frequency of sequences of various values of n for case N=10**9. Use log scale. Compare with what the theoretical expectation of a perfect random number generator would be.
- Test the performance of your code on the course server for N=1000, N=1000*1000 and N=1000*1000*1000, running with 1,2 4 and 8 MPI processes. Comment on the speedup you see compared to the theoretical speedup you could expect.
Question 2
We can use derived datatypes to write functions for (dense) matrix I/O when we store the matrix by block columns
- Write a function that writes to file a square matrix ditributed by block columns among the processes. Suppose that the order of the matrix is n and the number of processes is p, and assume that n is evenly divisible by p. The function should successively gather blocks of n/p rows to process 0, and process 0 should print each block of n/p rows immediately after it has been received. For each gather of n/p rows, each process should send (using MPI_Send) a block of order n/p x n/p to process 0. Process 0 should carry out the gather using a sequence of calls to MPI_Recv. The datatype argument should be a derived datatype created with MPI_Type_vector
- Write a function that reads in a square matrix stored in row-major order in a single file. Process 0 should read in the number of rows and broadcast this information to the other processes. Assume that n, the number of rows, is evenly divisible by p, the number of processes. Process 0 should then read in a block of n/p rows and distribute blocks of n/p columns to each of the processes: the first n/p columns go to 0, the next n/p to 1 etc. Process 0 should then repeat this process for each block of n/p rows. Use a derived datatype created with MPI_Type_vector so that the data sent to each process can be sent with a single call to MPI_Send.
- Write a main program that calls the functions written in a and b. Analyze the performance for various matrix sizes.
上述题目,对于答案的有很详细的要求:
You will be asked to write code for problems related to material covered in the lesson notes. The code should contain comments clearly explaining what it is doing. The code should be submitted electronically, all materials bundled in a zip file. Included should be the source code files in such a format that the instructor can compile and run them. A PDF file containing the discussion for the assignment should also be included. The discussion portion should be at least 2 double spaced pages in length with a font size of 12.
Students should work on the assignments using the computers provided by the university for the course, which have all the necessary software and hardware installed. Instructions for logging in and using these computers will be provided to the students at the start of the course.
这个题目不仅要求全部做对,即代码能运行,还要详细的注解。客观的说,这个题目对于没有工作过的人来说,工作量比较大,主要是细节部分太容易扣分。比如,第二题,从process 0分到其他process的时候,其实内存的排列是不同的,即一行与下一行的距离是不一样的。Process 0是很长距离,而其他的process则是紧紧挨着的,这就决定了要定义三种自定义类型,一是把一行归为一类;二是process 0的内存距离为一类vector;第三类是其他process的内存要紧紧挨着为一类vector。第一道题也用到了不少通信的内容,要完全答对,需要用到分发,非阻塞通信,聚合等函数,真是花了不少时间。
我总共花了一个周末,再加5个晚上,把所有的题目报告做完。我非常详细的编写了程序,并且写好注释做好图标,最终写好设计文档,在星期六上交了。我最终的成绩是A+,满分。