15418 Final Project by taichia

Welcome to our 15418 Final Project.

For our project, we are designing a Deep Boltzman Machine with parallel tampering on a GPU.

Background Information

Deep Boltzmann Machines are powerful generative models which are capable of doing unsupervised learning on large datasets. However, due to their expressivity, the training process is very much compute-bound. Our goal is to implement a Deep Boltzmann Machine training and inference program using modern-day GPUs. Boltzmann Machines can do very interesting things, and they don’t require labels on the data to learn about the data, making them useful in settings where people have collected a lot of data but don’t have the resources to label all of the data. They can learn structure about the data, which can then be used to classify similar datapoints or even generate new or similar data. They can learn, for instance, what a certain alphabet system’s characters “look like” and generate plausible examples that would fit into the alphabet, or complete part of a character given the other half, which can be useful in filling in missing parts of data.

Proposal

The link to the proposal can be found here: Proposal

Checkpoints: (Brief, check proposal for more in detail)

Week of 11/7: Do research and create the sequential version (Finished)
Week of 11/14: Create working parallel version of program (In Progress)
Week of 11/21: Do tuning on the parallel section to make it as fast as possible
Week of 11/28: Put together the deliverables (takes a nontrivial amount of time and coding), poster, and website

Milestones:

For our 75% goal, we plan on getting the parallelized version of the machine implemented with CUDA. Currently, we are writing out sequential code with Python, so we had to research how to implement CUDA programming in python. In our research, we found that we can use numbapro to run vectorized python code on the 1080. We also found that benchmarking will be very easy with the built-in profiler (or use nvprof) to see exactly in where in our code and in which device (cpu or gpu) takes the most time. There are a lot of resources available on how to use cuda on python, most of which is found on youtube (CUDAcast #10 and onward).

For our 100% goal, we would need to tune our CUDA implementation to achieve maximum speed-up. We can do this in several ways, including increasing/decreasing gridsize/blocksize each cuda function, network size, learning rate, etc.

For our 125% goal, we could implement different training algorithms and check their quality/performance against the one we wrote. Such algorithms include the Metropolis-Hastings algorithm, Gibbs sampling, Slice sampling, etc. All of these algorithms can be parallelized in some way, so we can implement crude versions of these algorithms to compare quality/speed.

Midpoint Progress Review

Summary so far:

We’ve finished all of the preliminary research we wanted to do on both Deep Boltzmann Machines and the underlying CUDA portion. Additionally, we’ve implemented a serial program which allows us to train and test on MNIST, an handwritten digit (0-9) dataset. While MNIST is often known as a “toy” dataset, the unoptimized Deep Boltzmann Machine program takes ~5 minutes per epoch on just 10% of the data, and given that we want to run about 100 epochs over all of the data, this is way too slow. It is currently written in python/numpy, which is single-threaded but has otherwise “well-optimized” matrix operations.

Next steps:

The plan from here on out is to convert the numpy portions into CUDA code or equivalent CUDA matrix operation libraries (for example, we don’t think it’s necessary to have to re-implement matrix multiplication from scratch). This will allow us to train more powerful models, which should increase the quality of our deliverables. The simplest deliverable is just going to be a graph of the cross-entropy reconstruction error for various model parameters. However, given that we are creating a generative model, we can do things such as sampling the distribution for “characters” to show off what it has learned about the dataset of characters. Some of them will look like real characters, some might not look like a single character but a morph of two or more. We can also do completions, which is where half of the character is given and we want to fill in the other half with the correct pixels, and upscaling, where we take a character scaled down and try to recreate the original image. Ideally we can create all of these visual deliverables; the algorithmic ones where we compare and contrast different algorithms will definitely something that’s not necessary but “nice to have”.

For implementing CUDA into python code, we will be using this website from NVIDIA for reference and tutorials (https://developer.nvidia.com/how-to-cuda-python). In this website, it goes over sample/examples on how to use CUDA on python modules and also how to use nvprof and other profilers to help us debug where in the code takes the most time and potentially use that information to achieve higher speedup in our machine.

Moving on:

For the week of November 21st up until Thanksgiving, we will try to install anaconda on AFS and test functionality of CUDA on python using this code . If this fails, we will have to ask the professors to allow us to borrow a 1080. In the meantime, I can borrow a non-1080 NVIDIA gpu from one of my friends or we can use our laptop’s 870m/970m until we find a way to borrow a 1080.

From Thanksgiving until November 28th, we plan on implementing a crude CUDA implementation of our code.

From November 28th to December 3rd (very flexible), we will attempt to finetune the algorithm so that we can achieve maximum speedup for the program. We will use tools like nvprof to see where in our code we can achieve faster speedup. Of course because training takes a long time, we are not sure whether we can find better solutions to speedup and quality in just 1 week. This process might take only 1 day, or the entire 7 days.

From December 3rd until December 8th, we will implement a crude version of a parallel tempering algorithm and Metropolis-Hastings algorithm. This process should not take too long since by now we should be familiar with implementing CUDA into python. The only problem we might run into is that these algorithms won’t be fine-tuned to the smallest detail due to the time-constraint.

From December 8th to the December 13th, we will begin implementing the visual portion of the deliverable. We will be creating a slideshow as opposed to a physical posterboard because we will be displaying videos of our work, which we think will work best if we presented our project digitally as opposed to a before-after picture.

One problem we might run into is installing the necessary programs on our AFS. For CUDA to work on python according to the tutorial, we will need to install Anaconda Accelerate in order to use the numbapro compiler. We are not entirely sure whether this is possible to install on remote machine, and if not, we would need to install a linux operating system on our home desktop and install a borrowed 1080 from the school. Another problem we might run into is whether or not we will have time to achieve our 125% milestone, as writing new algorithms for the sake of testing might take too long and we are as of now unsure how much time fine-tuning the Deep Boltzmann machine will take.

Preliminary results: The natural reconstruction cross-entropy is about 540, our program is able to reduce that number to about 210-230 after 30 epochs (which took more than an hour!) So our goal from here is to speed it up so we can add more model complexity to lower the training time and error.

Final Report

The link to the results can be found here: Final Report

Authors and Contributors

The two team members that worked on this project are Taichi Akiyama (@taichia) and Edgar Chen(@EggyLv999).