If deep learning we’re just a big matrix multiplication then it would not be dee...

stochastic_monk · on Sept 1, 2018

If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.

However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.

[0] https://blog.openai.com/nonlinear-computation-in-linear-netw...

crankylinuxuser · on Sept 1, 2018

Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?

Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.

I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .

MauranKilom · on Sept 1, 2018

That post completely disregards the fact that subnormals are usually orders of magnitude slower to work with for common hardware.

ChrisRackauckas · on Sept 1, 2018

This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.

taeric · on Sept 1, 2018

Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.

Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?

abecedarius · on Sept 1, 2018

Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.

taeric · on Sept 1, 2018

But can't you typically do the activation as a separate step? Where you first incorporate all of the inputs, then decide activation?

abecedarius · on Sept 1, 2018

I'm not sure quite what you mean, but you can do the linear part of one layer all together, then the nonlinear part, etc.:

    for weight,bias in layers:
        a = sigmoid(np.dot(weight, a) + bias)

taeric · on Sept 1, 2018

So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(

abecedarius · on Sept 1, 2018

They can be pretty big! I don't know what's normal these days, though.