Question :  What is the diff between CPU and GPU ?
Answer :
CPU – Few, fast cores
– Good at sequential Processing
GPU – many slower cores – thousands
– Good at parallel computation

GPU v Beefy CPU.pngFig . Beefy GPU vs Beefy CPU

Question : Can GPUs be programmed ?
Answer : Yes they can be
CUDA (Nvidia specific) – can write C codes that run directly on GPUs
– High level Apis : CuBLAS, cuFFT, CuDNN etc
OpenCL :  similar to CUDA, but runs on anything
– but is usually slower.

Programming  with GPU : Udacity : Learn to Parallel Programming

Question : So with GPU, how fast is the learning process ?
Answer : VGG – 2 -3 weeks training with 4 GPUs Titan Black-$1K each
ResNet 101 : 2-3 weeks with 4 GPUs.

Question : Do we need to do anything special to use multiple GPUs ?
Answer : The simple way is to compute across multiple GPUs is to  split the mini batch across GPUs. e.g VGG is very memory intensive, hence ant have large mini batch sizes in single GPU, so we  have mini batch of 128 , split mini batch into 4 equal chunks, compute back and forward pass on each GPU, compute gradient  then sum those weights after all 4 GPu are finished and then update the weights.

Question  : In parallelism, do we always have to wait for all weight  update computation to finish , prior to updating the whole weights? Does it not waste computation with  one CPU or GPU blocking all until finished ?
Answer : Well, yes with synchronous,  the computation loss can occur. However we can also use Asynchronous SGD or weight updates.
sync vs async.png

Question : Is there a bottlenecks ?
Answer :  CPU and GPU communication  :: CPU – GPU communication is the bottleneck during the interaction, hence  its better to do the whole forward and backward pass in a GPU  minimizing  the  data exchange between CPU and  GPU.
CPU – Disk bottleneck. Use SSD: Hard disk is slow to read from, so store the pre-processed images in one giant continuous file in the SSD. Get the data sequentially, since random reads are expensive.
GPU-memory bottleneck : TitanX – 12 GB memory. For example for AlexNet, which is comparatively small compared to existing state of art networks, with batch size 256 it  needs 3 GB of memory.
Floating Point Precision : Instead of 64 bit double precision as default in many programming, in deep nets we use  single 32 bit precision. the whole idea being the less bit a number takes, more bit we can fit in the memory, which is good. Preferred smaller data types. For example, simply casting the 64 bit in numpy to 32 bit gives a decent speed up.

Question : If smaller floating point is better, should we use even a smaller bit floating point i.e 16 bit ?
Answer :  Yes its good and is already supported in cuDNN, 16 bit or half precision. Nervana 16 fp kernels are the fastest right now and these are winning all the challenges now.

Question : Is there any problem with 16 bit
Answer : With 16 bit i.e  2 ^ 16 i.e 0 to 65536 number can only be represented which is not  large enough range. Not many numbers we can represent.  And it was found that with 16 bit, the networks had hard time converging.

Question : So what can we do to solve them ? How can we make them converge?
Answer : The trick is Stochastic rounding i.e All of the parameters weights, activation biases  are stored in 16 bit and during the multiplication, they up-convert and then after multiplication down-convert to 16 bit again.

Question : Can you go down upto 8 bit. i.e How low can you go ?
Answer :  Using 10 bit activation and 12 bit for parameter updates.  It was used in the 2015 paper.

Question : Can we go any lower ?
Answer : Train with 1-bit activation. The activation functions are either 1 or -1.
– All activations and weights are +1 or -1.
-Fast with XNOR
– Gradients use higher floating point precision

Take Aways :
– GPUs faster to CPUs
– Low precision makes things faster and it still works
– 32 bit standard, 16 bit soon
– In the future may be binary nets
– Beware of bottlenecks