

agreement for combined review of PUNLMP + LG versus HG
(kappa values 0.70 and 0.91, respectively).
3.5.
Discussion
3.5.1.
Principal findings
This study demonstrates that both classifications identify
patients at risk of tumour progression and recurrence; the
risk rises significantly with increasing grade.
Additionally, we found that the 2004/2016 classification
identifies patients with generally better prognosis. Our
analysis demonstrates lower progression rates in all three
grades of the 2004/2016 classification compared with the
1973 classification. Progression rates in G1 patients were
similar to those in LG patients, while the rates in G3 patients
were higher than those in HG patients. We found a lower
recurrence rate in PUNLMP versus G1 patients, but a higher
recurrence rate in G3 compared with HG patients.
Reproducibility assessment was hindered by a paucity of
available studies
[3,33]. In both studies, the interobserver
reproducibility for G1 versus G2 versus G3 tumours was
poor (kappa values 0.003–0.365), while the interobserver
reproducibility for PUNLMP versus LG versus HG was poor
to fair (kappa values 0.17–0.516). Comparing the reproduc-
ibility of G1 + G2 versus G3 and PUNLMP + LG versus HG
tumours, kappa values were slightly higher for the 2004/
2016 classification (0.44–0.58 vs 0.46–0.72). These findings
suggest that the interobserver reproducibility of the2004/
2016 classification may be slightly better than that of the
1973 classification; however, the interobserver kappa
values for both systems are disappointingly low.
The repeatability of both 1973 and 2004/2016 classifica-
tions was assessed in two studies
[3,16] .In general, the
intraobserver repeatability for G1 versus G2 versus G3 for the
two pathologists was good (kappa values 0.61–0.69),
whereas the repeatability for PUNLMP versus LG versus
HG was fair to good (kappa values 0.56–0.83). Moreover,
repeatability for G1 + G2 versus G3 and PUNLMP + LG versus
HG was good to excellent (kappa values 0.88 and 0.80,
respectively). One study
[16]suggests that the intraobserver
repeatability of the 2004/2016 classification may be better
than that of the 1973 classification; however, another
demonstrated no difference
[3] .3.5.2.
How do the review findings impact clinical practice and
further research?
To address this, a discussion of the background, rationale,
and critique of both grading systems is essential. Tumour
grade is routinely used to determine prognosis, treatment,
and follow-up of patients with NMIBC. Ideally, a grading
system has to be practical, reproducible, and prognostically
valid. EAU guidelines currently advocate the simultaneous
use of both 1973 and 2004/2016 WHO classifications for
grade because the 2004/2016 classification has not been
sufficiently validated against the 1973 system
[4].
Although the 1973 classification is well understood by
clinicians, it has been criticised for a poorly defined G2
category, seen as a ‘‘default diagnosis’’. Pathologists tend to
classify a majority of tumours into the middle group when
using a three-tier grading system
[35] .The 2004/2016 classification is based on better-defined
histological criteria. In theory, this should reduce inter- and
intraobserver variability within a two-tiered classification,
with the addition of PUNLMP category. However, several
studies have shown considerable interobserver variability
using the WHO 2004/2016 system
[3,16,33].
There are several groups that are problematic for both
grading systems:
3.5.2.1. G2 category.
A high percentage of NMIBC is classified
as G2 disease; previous studies have suggested that this is
due to a lack of a clear definition of this category
[8,36]. The
proportion of G2 tumours in the 20 studies analysed in this
systematic reviewwas 50%; G1 tumours comprised 29% and
G3 tumours 21%. This confirms the tendency to classify
most patients as G2 in the 1973 classification and
corresponds to the incidence of G2 tumours reported in
the literature, which varies from 13% to 69%
[37,38].
3.5.2.2. HG category.
The primary objective of the 2004/2016
system was to improve the stratification of patients
according to the risk of progression
[8] .However, the
inclusion of some G2 patients significantly enlarges the
high-risk group. The percent of patients with HG tumours
was two-fold higher (1887 cases, 42%) than those with G3
tumours (929 cases, 21%;
Table 1). Treating HG tumours the
same as G3 disease could lead to overtreatment of patients
Table 5 – Intraobserver repeatability for the 1973 and 2004/2016 WHO classifications
1973 WHO classification
2004 WHO classification
Study
Pathologist
(type of analysis)
Agreement
(95% CI)
Kappa
(95% CI)
Pathologist
(type of analysis)
Agreement
(95% CI)
Kappa
(95% CI)
Mangrud (2014)
[16]A (G1 vs G2 vs G3)
68% (61–74%)
0.69 (0.59–0.79)
NA
NA
NA
A (G1 + G2 vs G3)
88% (82–92%)
0.66 (0.54–0.79)
NA
NA
NA
B (G1 vs G2 vs G3)
63% (56–70%)
0.61 (0.48–0.74)
B (PUNLMP vs LG vs HG)
93% (88–96%)
0.83 (0.74–0.92)
B (G1 + G2 vs G3)
89% (83–93%)
0.68 (0.55–0.80)
van Rhijn (2010)
[3]A (G1 vs G2 vs G3)
80%
0.67 (0.57–0.76)
A (PUNLMP vs LG vs HG)
71%
0.56 (0.46–0.66)
D (G1 vs G2 vs G3)
81%
0.69 (0.59–0.78)
D (PUNLMP vs LG vs HG)
82%
0.69 (0.60–0.78)
A (G1 + G2 vs G3)
91%
0.64 (0.48–0.81)
A (PUNLMP + LG vs HG)
86%
0.68 (0.57–0.80)
D (G1 + G2 vs G3)
95%
0.88 (0.80–0.96)
D (PUNLMP + LG vs HG)
90%
0.80 (0.72–0.89)
CI = confidence interval; G1 = grade 1; G2 = grade 2; G3 = grade 3; HG = high grade; LG = low grade; PUNLMP = papillary urothelial neoplasm with low
malignant potential; WHO = World Health Organization.
E U R O P E A N U R O L O G Y 7 2 ( 2 0 1 7 ) 8 0 1 – 8 1 3
810