I didn't see in the paper why these methods are not used in practice. There are parallel implementations of strassens algorithm out there (I have written one). Yes they are faster but they do not guarantee stability. Also they are significantly harder to parallelize than block cyclic MM.
> There are parallel implementations of strassens algorithm out there
Do you have one off the top of your head you could link the code too? Doesn't have to be your implementation specifically. I just want to read the code for a parallel implementation.
It seems that not very many mathematicians are working on optimizing this problem but matrix multiplication used every day by many people and companies.
Did we already exhaust economics of scale on this one?
There has been an enormous amount of effort placed into optimizing matrix multiplication.
The key here is that the methods have impracticably huge constant factors, such that for anything beyond Strassen would necessitate matrices larger than we could conceivably use to be more effective.
This comment also seems to suggest that legions of scientists and engineers aren't trying to optimize matrix multiplication. They are. Here are a few directions to pay attention to:
1. Specialized matrix multiplications for specific sizes and with reduced precision.
For example, Intel's libxsmm, methods for multiplying with reduced precision on [CGT]PU.
2. Structured matrix multiplication
Circulant and FFT-like matrices can be multiplied in linearithmic time. If you can set your problem up to use these instead (see Choromanski's Orthogonal Random Features [2016] or Smola's Deep-Fried Convnets [2015]/Fastfood [2014]), you can dramatically improve the speed and memory efficiency of your application.
3. Hardware-specific optimizations.
Optimizing algorithms by cache efficiency, using FMA (fused multiply-addition) and SIMD instructions, and prefetching all play a big part in improving these multiplications. I'm commenting specifically on CPUs, but similar issues and methods exist on all hardware backends.
In summary, many people are working very hard on improving matrix multiplication. Lower asymptotic complexity algorithms are not the answer for this due to the absurd constant factors they require.
For 2x2 matrix multiplication, it is known that Strassen's algorithm is optimal in the sense that there exists no algebraic circuit that uses only 6 "scalar" multiplications. I took your question to ask whether for 3x3 matrix multiplication, the minimum number of scalar multiplications required was similarly known. It is not, and there are more options to take into account than just orderings.
Well most matrix multiplications tend to be small matrices (eg 4x4) and the time complexity of multiplying 4x4 matrices is O(1). These are actively optimised in the practical, hand-written assembly and new hardware, sense.
If one has to multiply very large matrices then the goal is usually not so much to optimise matrix multiplication in general but to optimise the multiplication of the subset of matrices one is interested in. In these situations one cares about eg density and symmetry and choice of basis. There is a lot of information in a large random matrix and there is typically less information in an interesting large matrix. Therefore the goal is to skip the work required by all the entropy one does not have
This is no longer there case, given deep learning, though. Right? Effectively, the training process can be seen as a giant matrix multiplication in one step.
Deep learning mostly uses matrices with largest dimension <1000. The size of the matrix directly corresponds to the number of input and output units to to a fully connected layer.
Even a 2000x2000 matrix is relatively small as far as non O(N^3) matrix multiplication algorithms go. The faster asymptotic algorithms have larger fixed costs and do not map as nicely to current CPU/GPU architectures. Additionally, the people capable of writing high performance matrix multiplication mostly haven't spent time on the non-N^3 algorithms. I believe every common implementation (MKL, cuBLAS, Eigen, and various other BLAS's) all use the N^3 algorithm with optimizations more focused on computer architecture (cache and register blocking, limiting instruction dependencies, SIMD) than on algorithmic cleverness.
The only attempt that I know of to use a non-N^3 algorithm for practical matrix multiplication is this 2016 paper[0] called Strassens' reloaded. Figure 5 shows that their implementation of Strassen's outperforms MKL on a single core on about a (2K, 2K, 2K) matrix multiplication problem (look at upper right part of figure). However, you're rarely multiplying on a single core. Figure 6 shows that for a 10 core system, Strassen's becomes marginally faster with square matrices of size 4K, and only becomes significantly faster at size 8K. Although these results are super cool in my opinion, they're mostly not applicable to deep learning in it's current incarnation.
Finally, Strassen's algorithm is one of the simplest non-N^3 matmul algorithms (known since 1969, complexity O(n^2.8)) and I believe it has much lower fixed costs than something more recent that has complexity more like O(n^2.37).
Thanks for the links! I had never actually seen it quantified for the size matrix you needed. Just "large" which is always relative. :)
And I also hadn't looked closely enough to see if amount of training data influenced matrix size. Makes sense that it would only influence a single dimension.
The typical neural net matrix multiplication is N_EXAMPLES X N_FEATURES_IN multiplied with N_FEATURES_IN X N_FEATURES_OUT.
The output feature count is completely independent of the data size, and input feature count is only dependent on the dimensionality of the data (not the number of points), and that's only in the first layer of the network.
Even with datasets with huge number of examples, the net usually only trains on a small "minibatch" of examples at a time, typically somewhere between 16 and 1024. This minibatch size is the algorithmic N_EXAMPLES. Given these numbers, the typical neural net matrix multiplication is probably something like (32, 256) x (256, 128). This is not nearly large enough for non-N^3 tmatmul algorithms o accelerate things.
If deep learning we’re just a big matrix multiplication then it would not be deep, it would be linear. The (eg sigmoid) functions mixed between the layers make it deep. I’m not really sure what it is about the training that you think is just a matrix multiplication.
It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication
If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.
However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.
Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?
Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.
I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .
This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.
Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.
Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?
Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.
So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(
> the training process can be seen as a giant matrix multiplication in one step.
It can't be seen that way.
Deep learning becomes deep when you alter linear matrix multiplications with non-continuous functions. After that you can't reduce the problem into one step. You also lose associativity so you can't even do matrix chain multiplication optimization.
Fully connected layers are matrix-vector multiplication (which turns into matrix-matrix if you do an entire batch at once). Convolutional layers are lots of tiny dot products. You can rewrite that as matrix-matrix multiplication.
That's right. For a fully connected layer the matrix is of dimension (batch size) x (number of neurons in layer k) which gets multiplied by a matrix of dimension (number of neurons in layer k) x (number of neurons in layer k+1). People do this because going through the network with a batch is more efficient than going through the network separately for each element in the batch, mainly because multiplying two matrices is more efficient than doing n matrix-vector products. The batch size is limited because you get diminishing speedup out of increasing your batch size. Stochastic gradient descent also gets less improvement out of each data point per pass through the data set as you increase the batch size, so you need more passes. If you take the whole data set to be your batch size then you get ordinary gradient descent, which needs a lot more passes through the data set than stochastic gradient descent.
Not sure I understand this. The circuit for addition is easier than the circuitry for multiplication.
That said, in circuits, it is far less likely to ever multiply that large of a matrix. Even if I were correct that many steps of large training can be cast as a matrix multiply.
Extremely good explanation. But even regular matrixes are thought to be multiplyable at O(n^2). Right now we are a little higher than that, I think O(n^2.2). It's crazy that we can't find min complexity algorithm yet. The low complexity will be due to symmetries in pure uncorrelated matrixes.. which is clearly symmetry in multiplication itself or just numbers themselves. Someone will eventually find it.
The algorithms with the best asymptotic complexity aren't used in practice because they are slow for the n we care about. In practice optimising matrix multiplication comes down to making efficient use of SIMD, multiple cores and caches.
In addition to all the other reasons given (large constant, etc.), these faster algorithms are also often less stable numerically, as another poster alluded to above. See, e.g., https://en.wikipedia.org/wiki/Strassen_algorithm .
At least in my limited experience, matrix multiplication optimization is some combination of a) so specific to the specific matrices being used that the optimization amounts to finding "tricks" to reduce the problem (dgemm to sgemm or sgemm to syrk, etc.) and thus is not generally useful or b) so highly tied to IP that general optimizations (be they mathematical or advances in GPU or CPU low-level code) are kept secret.
The OP's WordPress install appears to be infected with malicious adware; every time I load the site I get quickly redirected to a random one of a number of scummy pages which hijack the back button, have autoplaying sound, display false error messages, and attempt to ask for browser-level permissions.
Put some nicely formatted text such as Google ad results and clearly mention it as an ad. WHY the fuck do ad companies have to do insanely annoying borderline unethical things? Why is this industry so rotten to the core?
Because suckers who are not using not adblockers are the only market segment for these ads.
In this time and age, every computer literate person install adblocking in every system or asks someone to help to do it and almost nevers sees these ads. What is left is people who are more likely to fall for scams.