If deep learning we’re just a big matrix multiplication then it would not be deep, it would be linear. The (eg sigmoid) functions mixed between the layers make it deep. I’m not really sure what it is about the training that you think is just a matrix multiplication.
It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication
If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.
However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.
Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?
Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.
I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .
This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.
Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.
Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?
Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.
So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(
It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication