Difficulty showing the obvious?
If I want to make a machine learning algorithm, the received wisdom, at least as I've received it, is that you should not train one e.g artifical neural network (ANN) on all the data, but train one ANN on each of the three modes. This will help with training and generalisation since the examples the ANN encounter will be more similar.
This seems straightforward, but I have trouble showing it. For instance: AFAIK: The number of examples needed for a given performance of an ANN scales with the number of weights. so if the number of weights can be reduced to one third, you'd get the same performance using only a third of the training examples.
An ANN with 5 inputs, 12 hidden nodes and 1 output has 72 weights. I use 90 examples, with 30 examples of each mode, to train it to a given performance.
I then make three smaller ANNs. Each with 5 inputs, 4 hidden nodes and one output. Each net now have 24 weights (a third of 72) and so only require 30 examples to give the same performance. But since I'm only feeding examples with one mode to each ANN, I still need 90 examples to train all three correctly! And this is assuming that the modes didn't "share" hidden nodes in the orginal ANN, so that splitting up in 4*3 hidden nodes makes sense.
Should I ditch this received wisdom, or have I simply misunderstood when it should be applied?

