This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.

But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.

Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:

Original 3D tensor: [ [ [111,112], [121,122] ], [ [211,212], [221,222] ] ] Reshaped 2D tensor: [ [111,112], [121,122], [211,212], [221,222] ] Applied softmax: [ [111',112'], [121',122'], [211',212'], [221',222'] ] Reshaping back to 3D: [ [ [111',112'], [121',122'] ], [ [211',212'], [221',222'] ] ]

It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:

import theano import theano.tensor as T X = T.tensor3() (d1,d2,d3) = X.shape Y = T.nnet.softmax(X.reshape((d1*d2,d3))).reshape((d1,d2,d3))

Here is what happens step by step:

print(X.reshape((d1*d2,d3)).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> [[ 1. 2.] [ 1. 3.] [ 1. 4.] [ 1. 5.]]

print(T.nnet.softmax(X.reshape((d1*d2,d3))).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> array([[ 0.26894142, 0.73105858], [ 0.11920292, 0.88079708], [ 0.04742587, 0.95257413], [ 0.01798621, 0.98201379]])

print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] })) >>> [[[ 0.26894142 0.73105858] [ 0.11920292 0.88079708]] [[ 0.04742587 0.95257413] [ 0.01798621 0.98201379]]]

The categorical_crossentropy function can be used in the same way:

import theano import theano.tensor as T X = T.tensor3() S = T.imatrix() (d1,d2,d3) = X.shape (e1,e2) = S.shape Y = T.nnet.categorical_crossentropy( T.nnet.softmax(X.reshape((d1*d2,d3))), S.reshape((e1*e2,)) ).reshape((d1,d2))

And this is how it is used:

print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]], S: [[0,1],[1,0]] })) >>> array([[ 1.31326169, 0.12692801], [ 0.04858735, 4.01814993]])

Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.