
As seems to have become the norm in the preamble to the Advanced Technologies and Treatments in Diabetes conference, another accuracy study was released in the past month (along with various “new” CGMs getting CE mark clearance).
As ever, all of these things deserve scrutiny, even if only to allow people to see them.
So what is this latest study?
It’s “Performance of Three Continuous Glucose Monitoring Systems in Adults With Type 1 Diabetes” undertaken in Germany with Guido Freckmann involved. Having Guido involved is important as he’s been one of the leading lights in attempting to introduce standards around CGM accuracy measurements.
The study
The study itself compares the Abbott Libre3, Dexcom G7 and Medtronic Simplera side by side, and is probably one of the best designed studies of its type. Quoting from the study, the method is:
Method:Twenty-four adult participants with type 1 diabetes mellitus wore one sensor of each CGM system in parallel for up to 15 days. Sensors of DG7 and MSP were exchanged on days 5 and 8, respectively. Three 7-hour sessions with 15-minute comparator blood glucose–level measurements using YSI 2300 (YSI, venous), Cobas Integra (INT, venous), and Contour Next (CNX, capillary) were conducted on days 2, 5, and 15. Simultaneously, glucose-level excursions with transient hyperglycemia and hypoglycemia were induced according to a recently published testing procedure. The accuracy was evaluated using various metrics, including mean absolute relative differences (MARDs).
This is generally considered to be about as good as it gets in terms of procedures, even if the number of participants, and therefore datapoints, is a little low (only 2088 datapoints).
To analyse the data, the recorded datapoints of each system were used retrospectively and analysed against the various blood glucose testing systems. The closest in time CGM datapoint to the comparator reading was used, ±5mins.
The results
The results were fairly striking. The table below breaks them out clearly.

A key thing this table highlights is that the chosen comparator is important in determining results in an accuracy study. We see this here with the three comparisons and what they measure.
The YSI (YSI 2300 Stat Plus system) uses a glucose oxidase-based measurement method.
The INT (Cobra Integra 400 Plus system) uses a hexokinase-based method.
The CNX (Contour Next) uses a glucose hydrogenase-based method (and was probably selected as being the most “accurate” of the finger prick tests). Incidentally, the Abbott Optium strips that Abbott normally recommend for Libre comparison are glucose oxidase-based.
All of the sensors in use are glucose oxidase-based.
This adds to the complexity of how should one measure accuracy. As the table shows above, all the sensors are roughly equivalent to the glucose oxidase-based YSI venous test (which is generally used in all MARD -style studies). The other to options show a wider variety of outcomes in comparison to the blood test in use.

As we’ve discussed at Diabettech many times, MARD isn’t really a good measure of accuracy at the best of times, and is subject to a variety of biases within study design. The report also helpfully includes the CG-DIVA data, shown above. This demonstrates that compared to the YSI tests, at difference glucose ranges, the different sensors match to the iCGM standards to different levels, but also that across the multiple sensors used in the study, both the Libre3 and G7 tended to overstate values compared to the Simplera, which was understating them.
The study vs the real world
While I’ve often railed at the lack of consistency across studies of CGM accuracy, one thing I’ve mentioned far less frequently is the applicability to the real world.
And in this particular case, there are a number of things we can discuss.
There are two main components of the study design that raise questions about how the outcomes of this study would compare with real world use of these sensors. The data value used when comparing with the blood tests and the timing of the data value.
Data value used
As the paper clearly states, five minute recorded data is used to allow systematic comparison of blood data with the CGM reading.
For the Simplera and G7, the recorded data is the actual CGM reading at the time.
For the Libre3, that’s not the case. The key here is recorded vs reported data.
When you look at the LibreView app to make a decision, you see a datapoint “created” by the app, using the actual reading and some algorithmic magic to extrapolate what’s going on, to try and provide a number close to a fingerprick.
The recorded 5 minute data is not this one minute value at five minute intervals. Nor is it the average of those values that you could have seen in the app at a point in time.
It’s a retrospectively smoothed average of the data, with various outliers removed.
So with the best will in the world, the study is comparing apples with oranges when you try and read across between the Libre3 and either of the other two sensors. If anything, by using the smoothed, retrospectively corrected, recorded datapoint, you may be biasing the Libre3 CGM readings to be more, rather than equally “accurate”.
Timing of the CGM data point used
A key question is about the timing of the datapoint used in these comparisons.
In this study, the nearest CGM reading to the comparator is used, within a ±5 minute period.
There are a number of arguments for operating like this, but equally a number of criticisms as well.
As we said the the previous discussion of CGM head-to-heads:
The Dexcom G7 accuracy study suggests that the mean lag between venous and CGM readings was 3 minutes 30 seconds. For the Libre3, the equivalent study suggests it had a lag of 1 minute 48 seconds.
What does this mean?
It means that in the worst case, the accuracy calculations use a comparator reading 2mins 29seconds after the CGM data point.
Given that for the G7, this means that the CGM data was based on a blood glucose value 210secs earlier, in the worst case, the comparator blood value that’s being compared against the CGM value could be some 5 minutes later than the actual value the CGM reading was based on. For the Libre3, this timeframe is smaller, but still exists. If we assume that 50% of the CGM readings used are before the blood comparator reading, it biases any study towards the Libre3, due to the reduced lag in the time from blood to reading.
If we wanted the study to reflect the absolute accuracy of the systems, it would need to use the first CGM reading in the five minute period after the comparator blood value.
But, and it’s quite a big one, this may not be reflective of how users would look at the data. We should also remember that the datapoints that the apps provide may not also be a reflection of the data as it was, but as it is expected to be now.
As an end user, it depends on your knowledge of CGMs and how they work. If you are aware that there is an inherent lag in the system, then you would probably want to see the above method.
If, on the other hand, like many users, you aren’t aware of it, you’d want to see the closest result to the fingerprick that you’ve just done, which is what the study reflects.
And what about an AID system? What datapoints does it use?
Do studies need to take this into account?
If there was a way to reflect the real Libre3 datapoint at the five minute intervals used in these types of study, rather than the Abbott adapted, recorded value, I’d like to se it used. It would then align with what a user might see and provide a better handle on user experience of accuracy. There are applications in the wild that allow the minute-by-minute datapoints created by the onboard processing algorithm of the LibreView app to be used, but I can’t see Abbott supporting their use in a study like this. I think that would go a long way to harmonising user experience with study data evaluation.
And what about timing? As I mentioned, this one is harder. For some people, the nearest CGM reading to the comparator reflects their view of the world. For others, the first CGM point following the comparator blood reading fits their model. As long as the process is consistent across a study and more importantly, is standardised so that all studies use the same technique, I suspect that it doesn’t matter too much.
Ultimately, the key question is:
Does it really matter?
As long as the long term health outcomes reflect improvements with the use of these technologies, and informed choices can be made across standardised metrics that give some reflection of what users see in the real world, some of the minutiae will remain just that. Irrelevant to the average end user.
Leave a Reply