An Attempt at Creating 2yo Ratings
In recent articles I have looked at a very simple ratings method for all-age handicap races which, on initial testing, seems to have shown more positives than negatives, writes Dave Renham. I hope and expect to write further about these ratings at a later date, but need more time to do some further detailed research. This will take several weeks, probably a couple of months.
In this somewhat related article, I would like to share with you the process I went through when trying to create ratings for two-year-old (2yo) races. My plan was to stick to a similar methodology which in essence was:
- a) find what I thought were key factors/variables;
- b) use PRB (Percentage of Rivals Beaten) data once more as my metric;
- c) combine the PRB figures in the same way as the all-age handicap ratings by simply adding up the relevant scores.
There are a number of different types of 2yo races such as maidens, novice events, Group/Listed races (which are all non-handicaps) and nurseries (handicaps). My idea was to try to rate the maiden and novice races. To me these are quite similar types of race and hence I hoped that one cap could be worn by both. Of course that would not necessarily be the case, but even if the ratings worked well for one of the two then I would have achieved something.
To begin with, let me discuss factors I considered for use. Here was my ‘longlist’:
- Trainer record – in 2yo maidens/novices
- Sire stats – in 2yo maidens/novices
- Debut course
- Horse Sex – colt, gelding or filly
- Horse purchase price
- Most Recent form – Last time out (LTO) finishing position
- Recent market data – LTO price
- Fitness – days since last race
- Draw
The eagle eyed among regular readers will note that the last four factors are ones I used in my original ratings for all-age handicaps.
From this starting point I felt I needed to trim the list down, for two reasons. Firstly, as I mentioned in my very first ratings article, when creating ratings I prefer not to over complicate things. Secondly, some of the above factors would cause some problems for one reason or another.
The draw was the first to be discarded. In all of the articles I have written on the draw in the past, I have mentioned that draw bias works best in handicap races. Hence, although the draw may affect some 2yo races at certain courses, I felt it was not a reliable enough factor to use here. Next to go was purchase price as I had no easy way to source it, or indeed back check it on past results. Further, many horses are home bred and therefore never go through a sales ring. I felt it had importance, which is why it made the longlist, and I wished I had some data I could ‘crunch’ to see how important it actually was, but I felt it was a no go for these ratings.
Fitness using the days since last run metric was the third factor I decided to discard. My main reasoning here is that the advantage of a quick return, that tends to happen in older age handicaps, is not replicated for 2yo runners. I looked briefly at some win and placed stats which were very even across the days ranges, so I felt it was unlikely that the more accurate PRB figures would really give a wide enough spread of figures. I felt it wasn’t worth the hours of data gathering and sorting if the figures were likely to be almost identical across the board. One makes decisions like this all the time when delving into horse racing research. Of course sometimes we make incorrect ones but, with experience, decision making improves.
That left me with six factors/variables so let’s look at each in a little more depth.
1 Trainer record – I am not someone who bets often in 2yo races. Occasionally I will if I spot what I feel is a good betting opportunity. However, my main bets that involve 2yos occur when I play the Tote Placepot. Most meetings have at least one 2yo race in their first six so I have to use some methodology to choose which juvenile runners I am going to put into my ‘pot’. Trainer information is always my first call.
Many trainers do follow a similar path year in, year out; they generally stick to the same training methods, know which races to target, etc. Now it should be noted (albeit it is fairly obvious) that each year trainers have a completely new ‘string’ of 2yos, so variances in overall performance are going to happen from year to year. However, when we think about the bigger stables they tend to keep many of the same owners, and these are likely to be purchasing similar animals to what they have done in the past. Hence, past trainer 2yo data is usually quite a good guide to future performance. The graph below offers a real life illustration through the record of Charlie Appleby in 2yo maidens/novice races over the past four full years:
These figures are very similar from season to season and, as I am writing this, his current stats for 2023 are in the same ballpark – 31% win strike rate and 55% each way strike rate.
So how best to utilise past 2yo trainer data was my main consideration as there were different stats I could potentially use. One option would be to use PRB figures calculated from all 2yo maiden and novice events for each individual trainer. However, my concern with that was that the number of runs that a 2yo has is usually extremely important. This is a graph I shared in a previous article written in April when examining 2yos on their second starts:
As we can see there is a significant difference in 2yo performance on debut compared to second starts. Such differences would be replicated when comparing the relevant PRB figures. Not only that, this graph is taking all 2yo runners into account and as you can imagine some specific trainers have even more acute differences. For example, and once again using data from 2017 to 2022, Michael Dods had a 2yo debut win SR% of 5.3%, whereas his second starters won over 16% of the time. William Haggas 2yo debutants scored less than 12% of the time, but on second start won 27% of their races. These are just two examples showing one potential pitfall of using overall 2yo trainer data to produce a trainer rating score.
It was at this point in retrospect when the alarm bells should have been ringing, about how complicated just creating the trainer part of these ratings would be. However, I thought that using previous runs would almost certainly be the way I would want to go, and trainer stats would make the final ‘cut’. However, before digging any further I wanted to look at the other five factors.
2 Sire Stats – sire stats are often an important part of the 2yo betting picture due to the limited past run data most juveniles have. In some cases, especially early season, all the runners in a 2yo race will be making their debut. Hence we have no past form to go on, so we have to look elsewhere. Sires are the fathers of the respective horses and can have a significant influence on their offspring. When we dig deeper we find that the offspring of a good proportion of sires have clear traits or preferences. These may be going/ground related, distance related, age related, experience related, etc.
Having essentially decided to use previous starts as a key factor in determining the trainer rating PRB score, it would be difficult to do the same for sire stats, as this would potentially overlap somewhat. It is not as bad as using LTO position and Beaten distance LTO as two factors in a system as they are virtually the same metric. However, the improvement from debut to second start for sires would mirror trainer improvement to some extent.
Therefore for sire stats I felt a distance metric made more sense: splitting the 2yo sire PRB data into two, obtaining figures for sprints (5-6f) and for longer 2yo races (7f or more). The majority of 2yo races are contested at a mile or less so this seemed logical to me. To give an example of a sire whose 2yo distance stats differ across these two distance ranges, let me share the non-handicap 2yo win stats for Kingman. In 5-6f races his strike rate has been 12.8%, at 7f or longer this increases to 22.2%.
Interestingly, though, when I calculated PRB figures for Kingman they were closer than I had expected. His progeny’s 2yo PRB for 7f+ was 0.64 compared with 0.60 for 5-6f. This comparison helps to highlight why I believe PRB figures are the most accurate of all the statistical metrics that compare performance. Win stats are a good barometer, but PRB figures are much better because they effective ‘grade’ each run; not just whether the horse won or didn’t, or placed or didn’t.
No Nay Never is another sire whose 2yo offspring show a distance bias. At sprint distances his 2yo non-handicappers score over 19% of the time, at 7f or longer this drops to 13.6%. The PRB figures for No Nay Never this time do underline the strength of this bias as the sprint figure stands at 0.63, while the 7f+ one is much lower at 0.53.
Sire stats using this distance metric looked a good option to use in the ratings.
3 Debut course – this is something I have researched in the past and the track at which a horse makes its debut can be a factor in how it subsequently performs. It particularly affects the second career start as we can see if we compare the PRB figures for second starting 2yo that made their debut at either Ascot, Newmarket, Redcar or Ripon.
The importance of the debut course becomes less of a factor the more runs a 2yo has, but it still can have a bearing, so I would have to separate out the number of runs since debut in some way or other. Alarm bells were ringing this time as this factor is definitely going to be time consuming from a data gathering aspect, as I would need to collect the LTO course data one at a time and then combine number of last runs to each course. That could mean anything between 100 and 200 separate data ‘dumps’ into excel as well as adding extra columns and data to it. Ouch. However, at this point I was undeterred, as there have been times in the past when I have had to perform an enormous amount of data collection to write an article or series of articles. Also, I felt this factor was really important and would improve the ratings if it was included.
Having slightly buried my head in the sand regarding the enormity of this project, the question I now considered was does factoring in debut course combined with past career runs conflict / overlap with the trainer data idea which was going to use past runs too? I guessed it would to a small extent, but I was open-minded enough not to dismiss using this metric because of that slight concern. Clearly trainers have their preferred starting points for 2yos in terms of races and courses for debut runs, but individual course debut data combines all trainers and hence any significant overlap is extremely unclear. I was fairly confident – hopeful at least! – that the two factors would not conflict enough to make the ratings biased in any way.
Before moving on, I started to think about another problem that I had known would be a real issue in terms of 2yo ratings. What to do if the horse was making its debut? They have no past race data to work with; no debut course stats. What PRB rating could be assigned to those runners? I had several things to ponder, but decided to move onto the next factor as I felt it would at least have fewer issues.
4 Horse Sex – the sex of a horse has relevance and in 2yo races there are essentially three types of runners – colts (entire males), geldings (males who have been gelded) and fillies (females). I did some initial number crunching as this data collection was easy to do and not time consuming. I compared their PRB figures based on about 25,000 2yo runs in maidens and novices. Here are the findings:
As we can see colts have the best record, followed by fillies and finally geldings. The majority of 2yo runners are colts and fillies (around 87% of all runners combined) leaving geldings that make up a much smaller 13% of the runners.
These stats look promising from a ratings perspective, and I had some data collection completed!
Onto the last two factors now, both of which I used last time.
5 Most Recent form – LTO finishing position is a good barometer of most recent form and it seemed to work well in the handicap ratings. However, I would have same issue with the course debut stats with horses making their debut. What PRB figure would I use?
6 LTO price – LTO price also seemed to work well with the handicap ratings but again the question was what to do about debutants?
*
At this point I was feeling happy that potentially I had six factors to combine to create the ratings. On the flip side, there were a myriad of issues. Perhaps the biggest was the problem of 2yos that were making their debut. These runners would not have PRB figures for three of the six factors (LTO course, LTO position, LTO price). I needed to consider the options.
Option 1 – To use just one of the three LTO factors giving debut runners a standard PRB figure based on all debut run performances.
Option 2 - Combining the three LTO factors giving debut runners a standard PRB figure based on all debut run performances, and dividing the score by three. This would mean all three factors had some relevance (in essence 1/3 of a rating factor).
Option 3 – Use the ratings only on 2yo races where all the runners had previously run at least once.
Of the three I felt the last option made the most sense as I really wanted to combine all six factors if I could. Based on a look at race data going back to 2019, 33% of 2yo races involved horses that all had run at least once previously. This would still provide around 350 races a year where the ratings could be employed. Added to that I had the facility to pull out all these races.
Having decided that was the preferred way forward, thoughts turned to the enormity of the data collection. As a researcher one is limited by the amount of data one has, or can access. We are also limited to a great extent by our computer skills. If you are able to write and use sophisticated computer programs for example, this gives you a huge advantage over those who cannot. If you have a vast database of results with every single type of variable/factor you can think of you also have a big advantage. Time is such a precious commodity and, without either of the above, my constant issue was the hours required for complicated or detailed research.
My expertise in terms of data number crunching is purely Microsoft Excel-based. I am proficient using Excel and use certain time-saving tricks such as cell formulae, pivot tables, functions like ‘VLOOK up’, and so on. However, I cannot write VBA code for macros, which impinges greatly on what I am able to do in terms of quantity and within certain time frames.
Back to the problem in hand. It was time to look at each factor again and try to work out how much work / hours would be involved with each one.
- Trainer record – the advantage I have from a research perspective in terms of trainer data is that when I export thousands of results, the trainer of each horse is part of the data set. Hence as a rule trainer data collection/manipulation is not as time consuming as many others things. On the negative side I would be looking at probably three or four separate data sets which I would need to combine and sort. Once that was done I could create the necessary formulae to calculate individual PRB figures and once those are added for all runners, I could use a pivot table to help calculate each trainer’s individual PRB figure. At least I didn’t have to worry about getting the debut stats; that would save me a little time.
The ideal plan would be to have PRB trainer figures for horses that have raced once, raced twice, and then group those who have raced three or more times together.
That part of the research was not too daunting; definitely doable. It would take several hours probably, but not several days!
- Sire stats – when I started thinking about how ‘easy’ it would be pulling and then crunching the sire PRB data for the two distance ranges, I suddenly realised that a trick I often use with sire data collection would not work for PRB figures. I could pull sire data relatively quickly if I was using win strike rates or each way strike rates. BUT not for PRB figures. It suddenly dawned on me that I would have to go one at a time, sire by sire. If that wasn’t bad enough from a time perspective, I also realised that even once I’d done that I’d need to find a way of ‘marrying’ the sire data with the trainer data. That would be even more time consuming and rather fiddly to do.
I thought then, OK I could ditch the sire stats part. I’ll still have five factors to use. The other ratings worked well with five, and even with four when I rated races without the draw factor.
- Debut course – back to this potentially tricky factor. I no longer needed to worry about debutants and what figure I would assign to them. However, as I mentioned earlier, I would still need to collect the LTO course data one course at a time combined with the number of career runs the horse had. As with the trainer data collection plan the aim would be to have ‘course on debut’ PRB figures for horses that had raced once, twice, and three or more. Earlier I had reckoned that I would need to collect separate data around 100 to 200 times and marry it together; it was clear that this was going to be within that range, although at the lower end (roughly 110).
-------
It was at this point that, if I had a towel nearby, I would have thrown it in! I had already reached the moment where the data collection and subsequent number crunching was too much to comprehend and hence attempt. It would take several weeks – far too many hours of my time for what I was endeavouring to do. Not only that, I still had three other rating factors where I would need to gather data. That being said, data collection for those three factors would all be far less onerous than the first three. However, it would still be several hours’ worth to add on top.
I was at a crossroads: I needed to decide whether I totally shelved my idea, or adapted it in some way. It has already been established that logically I cannot back test the data over several hundred races as I’d like to, due to the vast amount of time it would take. However, an alternative would be to look to rate races one by one, in real time as it were. Find races for the remainder of the season that qualify and then number crunch each individual race. To be able to do that though, I would still need to have sourced and collated the trainer data from the last few seasons (probably going back to 2015 or thereabouts).
In addition, I would need to source and calculate the PRB figures for LTO position and LTO price. I cannot use the PRB figures I used in the all-age handicap ratings because I used past all-age handicaps to calculate them. To collate the LTO position and LTO price PRB figures for 2yos would not take too long. Again, hours rather than days. On a more positive note the horse sex figures I had already calculated so that rating factor is no problem.
Then, for the sire stats (which I could incorporate doing it this way) and the debut course stats, I would need to check each horse in the race, crunching and then collating the relevant figures. That would take some time, and rating one race would potentially take up to 20 minutes if there was a big field of runners. On the plus side, once I had calculated the individual sire PRB figure that could be added to my 2yo ratings database.
The same would apply for the course on debut/number of career runs PRB figures. Once one was calculated that, too, could be added to the database. After rating, say, 20 to 30 races, the sire PRB stats and the debut course PRB stats would be starting to build up. That would make rating subsequent races far easier as I would start to have some data to hand for some horses that I didn’t need to recalculate.
Hence this is a potential way forward for these ratings should I choose to go that route in future. It will still be a very slow process, and because of that I am undecided in terms of what to do. What is most likely to happen is that I will start to collate some stats over the coming weeks, then try and rate five or six races, and go from there. If the first few races offer some positive signs, it will be easier to plough on and look at more races. If they don’t then it possibly is back to the drawing board.
-----
I hope this article has highlighted the fact that not all horse racing research goes smoothly.
It also shows that, despite all the best intentions, some ideas, no matter how good they may turn out to be, are simply too time-consuming or difficult to research. What has happened to me here is not a one off. In the past I have started researching numerous ideas with the plan of writing about them, only to abort the process at some point. So I’m used to the disappointment!
That was going to be the end of the article, but before checking it through I decided to source and collate the trainer data. As I have now done that I feel it is only fair to share the data with you. If nothing else, you now have some 2yo trainer PRB figures that may prove useful.
Below is a table of 2yo maiden/novice PRB figures for a selection of trainers. I have chosen the 30 trainers who have saddled the most 2yo runners. The figures are grouped as I discussed earlier into horses that have run once previously, horses that have had two career starts and then horses who have run three or more times:
As you can see most trainers have similar figures in the first two columns, with the third column being the best. My next job will be to source and calculate the PRB trainer figures for horses making their debut. However, that will need to wait for another time.
So I will finish here and ponder what next as far as my attempt to produce ratings for 2yo maiden/novice races is concerned. There will be an update in the future, I promise!
- DR