mlswarm on Pypi

Description

mlswarm contains two operable classes:

For each, there are three available algorithms:

swarm - swarm-like optimization algorithm
swarm_derivfree - similar to the former but derivative free, by using gaussian clouds
gradient_descent - gradient descent optimization

For a neuralnet object there are three main methods (see examples):

nn = neuralnet(...) - define neural network architecture and create neural network
nn = init_cloud(N) - Initialize cloud with N particles
nn.train(...) - Define the training data, algorithm parameters and start the algorithm

For a function object there are three main methods (see examples):

It is possible to use linear and non-differentiable activation functions because there is no backpropagation - reference
(Multivariate) linear regression, which uses linear activation function, can be done (see Example 1.)
Steps function can be used for binary classification (see Example 5.). Steps activation functions are probably the simplest and fasted to compute.
Since there is no backpropagation, we do not have to worry with the gradients collapsing or exploding, therefore there is more freedom when defining the initial values of the weights. I should try other initialization schemes.
Usually the mean of the cloud has a higher validation metric on the test set then the individual particles. I think that by training different particles and using the mean of those particles, we are diminishing the problem of overfitting, hence the better results.
So far, better than other derivative-free methods when training Neural Networks -> Generic Algorithms, Simulated Annealing, hill climb, random hill climb available the package Python mlrose

np.var(paramsf, axis=0)

instead of

np.mean([ np.linalg.norm(param-params_mean)**2 for param in params])

to compute the cloud variance leads to significantly better results in Neural Networks

Using kernel = kernel(-norm/(2*var)) leads to unstable results in Neural Networks

Check the shape of the cloud over iterations - for instance calculate the p-value for normality test - From what I have seen the test is usually positive across iterations and particles
Training each layer of the network separatly. In this view a particle will contain several subparticles representing the different network layers
Force the clouds to have a certain variance in order to avoid being stuck in local minimuns
L2 norm between particles increases with the dimension of parameter space. This requires choosing different kernel_a for different training sessions. Maybe use an heuristic to choose kernel_a?

sum_i(kernel(i,j)) belongs in [1/N,N] ranging from no iteraction between particles to maximum interaction between particles.

sum_i(kernel(i,j)) / N can be seen as a measure for the intensity between particles

Maybe choose kernel_a such that sum_i(kernel(i,j))/N ~ 0.5 i.e the interection intensity is 50%

I have done this if you choose kernel_a="auto"