Rating ranting madness 
Let me introduce you to two of my friends - reliability and validity. When it comes to judging rating systems, both play a major role.
Let's begin with reliability. Reliability could be described as "repeatability". Let's say you've got a thermometer and put it into a 50° water, and it shows you 50°. You clean it, test it again - again it shows 50°, that's great! Why is it great? Because it could have also shown 47° first, 52° at second try.
Let's say you've got two thermometers, and make the test with both. Both show 50°. Great! That's reliability!
Now if only one wouldn't be for Celsius, the other for Fahrenheit. Seems like one is broken

.
There are several kinds of reliability. Reliability is the exactness of measurement. The important aspects in our discussed ratings at hand are
in subject (one thermometer at different occassions) and
between subject (comparability of different measures).
Why are they important? As for the
in subject reliability, without it you've got an error in measurement concerning consistency of ratings. Which is you rate something sometimes as 8, and another times at 9. Even trained raters can only achieve a consistency of 0.9 at most on a nine point scale. Most people? Let's say using at best a seven point scale is recommended.
As for the
between subject reliability, without it you can't compare the outputs. That's like when for one a 6/10 is an above average rating, and for another a 7/10 is a below average rating.
Reliability is a pure technical aspect: The lower the reliability, the higher the error in measurement. Nevertheless it should be pointed out that reliability needs the ability to measure. Sounds obvious? Well, from a pure mathematical position a broken thermometer which's scale is fixed at 57° will be 100%
in subject reliable, but it can't show the differences between cold and hot water.
Is a high reliability enough? Well, it's nice if your thermometer always shows the same temperature when put into water. It's not so nice if if's measuring air pressure. Validty is first and foremost the question if the measurement measures what it ought to measure. Intertwined with this is the question to which extends the results can be generalized.
Reliability is necessary but not sufficient for validity. Mathematically validity can't be higher than the root of reliability, because the lower the reliability, the higher the error. On the other hand something can be as reliable as possible - if it measures not what you want to measure, it's no good.
Validity also equals comparability. If things are valid and about the same area, they can be compared - altough for completeness' sake it should be mentioned that they often need to be transformed onto the same scale (e.g. you can compare the results of thermometers made for °C and °F, you only need to know how to calculate degrees of one into the other. The thermometers mentioned above, both showing 50° in the same water now are not reliable, because one of them would have to show an entirely different degree).
Now that you know my two friends, let's take a look at the actual ratings mostly used in our context:
- At first look it seems that by using a scale of 1 to 10 we're already having a too wide range, diminishing
in subject reliability.
- Second look shows that only a two to three point system is effectively used: 10 and 9. Eight and lower nowadays usually gets deleted and the rater banned.
=> Unfortunately this creates another issue, that of a ceiling effect. It's like using thermometers made for °K to measure water temperature, when the thermometer only goes up to 200°K. Whatever water you try to get the temperature (if it's 5°C or 95°C doesn't matter), the thermometer will always show 200°K. While a too high number of choices create issues, the lack to differentiate does, too!
Alright, so reliability is out of the window. And with it validity.
But is it? If you take a closer look, then no, the current currently often used system has a value. Just not what you've expecting:
- The nominal ratings are useless, the numbers meaningless. This is not only caused by the limited range, but also because the people rating use wildly different ideas. One rates a mod a 10 because the preview screenshots look nice, the other because he's played the mod throughoutly and enjoyed it, thinks it's the best mod of all times. And the third because the mod has no major technical flaw and doesn't intent to rate the mod content at all. You've got neither
in subject nor
between subject reliability, therefore the validity of the nominal rating is nil.
- what
is reliable though is the
number of ratings. Not how they rate, but that they rate. See, the popular mod sites are quite good at keeping tracks of the number of votes, and through eliminating everything that's below 8 they usually even have about the same meaning.
Now what meaning does the number of ratings have? Question of validity here. The answer is, in short, popularity. Yet there is a correlation between mod quality and popularity. While this correlation is nowhere near perfect, if you look through several mods at *most popular download location* you'll notice that while Mod A might have a rating of 10, and Mod B a rating of 9,71. Mod A has eleven votes, Mod B eightthousand. Which mod is likely better? You know the answer.
Of course validity is diminished by the fact that several mods are simply better advertised than others. Here another factor can be used: Number of votes versus number of downloads in light of the time the mod is available. While the result is certainly also far from being optimal, it can be used as a good hint of how good a mod is.
That's what the current system offers.
Needsless to say I find the current system horribly broken and refuse to take part in it. Why? Not because of statistical reasons, although getting some validity out of it requires some unconventional thinking. Ethical reasons play the major role. Currently people are forced to vote good, or risk their account. I've seen people getting banned for pointing out that a mod includes "recompile all" garbage, and for pointing out other existing major issues. Now on the one hand while I'm pretty confident that my mods don't have major issues, I'm happy when they're pointed out so I can fix them. I don't want to see people punished for it. And I know that my mods have some controversial aspects. If someone doesn't like them, should he be punished? Of course not. Unfortunately the current system widely used is likely to result in both.
(
Excursus: Of course there's another issue, which is legitimate versus not legitimate complains. I too tend to react annoyed to not legitimate complains, like those people whining about LAME's spell sorting. Legitimate complains are a whole different matter, though, and I believe that the action taken against non legitimate complains also damaged the possibilities to even mention legitimate complains greatly.
The moderators in this forum have so far in my perception shown a far superior handling in differentiating between legitimate and non legitimate complains. Yet it has to be pointed out that they don't need to deal with a heavily distorted rating system on top of written messages.)
Are more valid rating systems possible? I don't think anything will change anywhere, but yes, I do see two possible solutions:
First is to replace the rating system with a "mod kudos" system. In effect it would work like the current system (in which only top notes can be effectively used), except without the hypocrisis and the danger of someone accidently downrating a mod by giving it a 8/10. Top lists then would be generated by the numbers of kudos in time X.
A second solution with actual ratings is also possible, which can remove the statistical problems currently created by the ceiling effect (through use of statistical methods outside of averages). Unfortunately I don't see any possibility for it to come into reality. Every rating system worth it's name
needs to differentiate between mods, and I don't see any chance for a nondistorted rating system today. A nondistorted and therefore valid rating system requires that the full spectrum can be used and is used. I don't see this happening at all.
So yeah, if you ask me, some major mod sites should switch to a mod kudos system instead of continuing using the current methods, which are (when it comes to their intents) completely broken and effectively working as kudos system either way.