Or… you can just square the difference, and hey presto, you have a positive value that tells you how far away you are from your target. Now sure, you do lose some information in doing so – you lose whether you're over or under, but if you're iterating on a solution, you most likely don't care – all you want to do is find out how close to the ballpark you are.

Even better, because you're squaring the difference, this value automatically becomes a fitness value for your solution – because if you're off anywhere, squaring the difference will magnify the error, which allows you to progressively get closer quicker.

So least-squares is pretty much the simplest way you can come up with that generalizes a fitness function across an entire dataset without any kind of special messing around at any point, and that also magnifies errors. Any function which is continuous and amplifies differences will also work, but that's the easiest.

Answer by Simon Cooke:

It's just that it's simple. If you're dealing with a scalar value that can be positive or negative, then finding the difference between that and another value can be difficult – you have to take into account the sign of

aandb, and futz with the values to get the difference in the right form so that it's a meaningful absolute difference.

Or… you can just square the difference, and hey presto, you have a positive value that tells you how far away you are from your target. Now sure, you do lose some information in doing so – you lose whether you're over or under, but if you're iterating on a solution, you most likely don't care – all you want to do is find out how close to the ballpark you are.

Even better, because you're squaring the difference, this value automatically becomes a fitness value for your solution – because if you're off anywhere, squaring the difference will magnify the error, which allows you to progressively get closer quicker.

So least-squares is pretty much the simplest way you can come up with that generalizes a fitness function across an entire dataset without any kind of special messing around at any point, and that also magnifies errors. Any function which is continuous and amplifies differences will also work, but that's the easiest.

If you wanted to, you could make it a linear function, and just take the absolute value of the difference – but that'd be slower to iterate to a solution. (you could square the difference and then square root it and have the same effect).

Is minimizing the Least-Squares function an intuitive thing to do in optimization?