Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If deep learning we’re just a big matrix multiplication then it would not be deep, it would be linear. The (eg sigmoid) functions mixed between the layers make it deep. I’m not really sure what it is about the training that you think is just a matrix multiplication.

It’s also worth noting that eg a convolution can be written with matrix multiplication (or rather some tensor products and contraction but there would be much redundancy from the fact that distant points cannot influence each other and the same thing is done to each point in the convolution. This is much less general than matrix multiplication



If floating point values were real numbers, simple matrix multiplications would not produce nonlinearities.

However, researchers at OpenAI demonstrated that the use of subnormal floating point values and their discontinuity [0] provides sufficient nonlinearity to learn nonlinear associations.

[0] https://blog.openai.com/nonlinear-computation-in-linear-netw...


Then, would it be better to store rationals as 2 ints with understood division, and irrationals as 2 ints as understood powers?

Why are we working with these numbers so imprecisely, and with such discontinuity? When I've done math work in school, I always worked with the primitives without approximating.. 2^.5 was handled as such unless asked for an approximate answer.

I also ran across this in AutoCAD. Because they force arbitrary precision, one cannot define a sqrt length. 1.414 doesn't reach all the way on a r-triangle with 1,1 .


That post completely disregards the fact that subnormals are usually orders of magnitude slower to work with for common hardware.


This misses the point. Yes, there have to be nonlinear activations. However, almost all of the compute time is spent in matrix multiplications (or stencil operations for convolutions, which are a form of structured matrix multiplication). So when talking about runtimes, you're basically talking about matrix multiplication.


Apologies, I did not mean to imply it was only a matter of multiplication. Only that training has several steps that are a giant matrix multiplication. The more data to train, the larger the matrix.

Edit: also, it isn't the activation function that makes it deep, is it? Rather, it is the literal depth of the network. Right?


Right. But without an activation function, the matrix multiplications would just compose into a single matrix for the whole net. I guess the floating-point rounding brought up above is supposed to serve as an activation function for this purpose, though I haven't looked into it.


But can't you typically do the activation as a separate step? Where you first incorporate all of the inputs, then decide activation?


I'm not sure quite what you mean, but you can do the linear part of one layer all together, then the nonlinear part, etc.:

    for weight,bias in layers:
        a = sigmoid(np.dot(weight, a) + bias)


So my question is how large is the linear part? Granted, I was hoping it was big enough to bring these algorithms into consideration. Looks like I'm right in that they are larger than 2x2. However, they are still not 8kx8k. :(


They can be pretty big! I don't know what's normal these days, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: