EURURO Vol. 72 No. 5

agreement for combined review of PUNLMP + LG versus HG

(kappa values 0.70 and 0.91, respectively).

3.5.

Discussion

3.5.1.

Principal findings

This study demonstrates that both classifications identify

patients at risk of tumour progression and recurrence; the

risk rises significantly with increasing grade.

Additionally, we found that the 2004/2016 classification

identifies patients with generally better prognosis. Our

analysis demonstrates lower progression rates in all three

grades of the 2004/2016 classification compared with the

1973 classification. Progression rates in G1 patients were

similar to those in LG patients, while the rates in G3 patients

were higher than those in HG patients. We found a lower

recurrence rate in PUNLMP versus G1 patients, but a higher

recurrence rate in G3 compared with HG patients.

Reproducibility assessment was hindered by a paucity of

available studies

[3,33]

. In both studies, the interobserver

reproducibility for G1 versus G2 versus G3 tumours was

poor (kappa values 0.003–0.365), while the interobserver

reproducibility for PUNLMP versus LG versus HG was poor

to fair (kappa values 0.17–0.516). Comparing the reproduc-

ibility of G1 + G2 versus G3 and PUNLMP + LG versus HG

tumours, kappa values were slightly higher for the 2004/

2016 classification (0.44–0.58 vs 0.46–0.72). These findings

suggest that the interobserver reproducibility of the2004/

2016 classification may be slightly better than that of the

1973 classification; however, the interobserver kappa

values for both systems are disappointingly low.

The repeatability of both 1973 and 2004/2016 classifica-

tions was assessed in two studies

[3,16] .

In general, the

intraobserver repeatability for G1 versus G2 versus G3 for the

two pathologists was good (kappa values 0.61–0.69),

whereas the repeatability for PUNLMP versus LG versus

HG was fair to good (kappa values 0.56–0.83). Moreover,

repeatability for G1 + G2 versus G3 and PUNLMP + LG versus

HG was good to excellent (kappa values 0.88 and 0.80,

respectively). One study

[16]

suggests that the intraobserver

repeatability of the 2004/2016 classification may be better

than that of the 1973 classification; however, another

demonstrated no difference

[3] .

3.5.2.

How do the review findings impact clinical practice and

further research?

To address this, a discussion of the background, rationale,

and critique of both grading systems is essential. Tumour

grade is routinely used to determine prognosis, treatment,

and follow-up of patients with NMIBC. Ideally, a grading

system has to be practical, reproducible, and prognostically

valid. EAU guidelines currently advocate the simultaneous

use of both 1973 and 2004/2016 WHO classifications for

grade because the 2004/2016 classification has not been

sufficiently validated against the 1973 system

[4]

Although the 1973 classification is well understood by

clinicians, it has been criticised for a poorly defined G2

category, seen as a ‘‘default diagnosis’’. Pathologists tend to

classify a majority of tumours into the middle group when

using a three-tier grading system

[35] .

The 2004/2016 classification is based on better-defined

histological criteria. In theory, this should reduce inter- and

intraobserver variability within a two-tiered classification,

with the addition of PUNLMP category. However, several

studies have shown considerable interobserver variability

using the WHO 2004/2016 system

[3,16,33]

There are several groups that are problematic for both

grading systems:

3.5.2.1. G2 category.

A high percentage of NMIBC is classified

as G2 disease; previous studies have suggested that this is

due to a lack of a clear definition of this category

[8,36]

. The

proportion of G2 tumours in the 20 studies analysed in this

systematic reviewwas 50%; G1 tumours comprised 29% and

G3 tumours 21%. This confirms the tendency to classify

most patients as G2 in the 1973 classification and

corresponds to the incidence of G2 tumours reported in

the literature, which varies from 13% to 69%

[37,38]

3.5.2.2. HG category.

The primary objective of the 2004/2016

system was to improve the stratification of patients

according to the risk of progression

[8] .

However, the

inclusion of some G2 patients significantly enlarges the

high-risk group. The percent of patients with HG tumours

was two-fold higher (1887 cases, 42%) than those with G3

tumours (929 cases, 21%;

Table 1

). Treating HG tumours the

same as G3 disease could lead to overtreatment of patients

Table 5 – Intraobserver repeatability for the 1973 and 2004/2016 WHO classifications

1973 WHO classification

2004 WHO classification

Study

Pathologist

(type of analysis)

Agreement

(95% CI)

Kappa

(95% CI)

Pathologist

(type of analysis)

Agreement

(95% CI)

Kappa

(95% CI)

Mangrud (2014)

[16]

A (G1 vs G2 vs G3)

68% (61–74%)

0.69 (0.59–0.79)

A (G1 + G2 vs G3)

88% (82–92%)

0.66 (0.54–0.79)

B (G1 vs G2 vs G3)

63% (56–70%)

0.61 (0.48–0.74)

B (PUNLMP vs LG vs HG)

93% (88–96%)

0.83 (0.74–0.92)

B (G1 + G2 vs G3)

89% (83–93%)

0.68 (0.55–0.80)

van Rhijn (2010)

[3]

A (G1 vs G2 vs G3)

80%

0.67 (0.57–0.76)

A (PUNLMP vs LG vs HG)

71%

0.56 (0.46–0.66)

D (G1 vs G2 vs G3)

81%

0.69 (0.59–0.78)

D (PUNLMP vs LG vs HG)

82%

0.69 (0.60–0.78)

A (G1 + G2 vs G3)

91%

0.64 (0.48–0.81)

A (PUNLMP + LG vs HG)

86%

0.68 (0.57–0.80)

D (G1 + G2 vs G3)

95%

0.88 (0.80–0.96)

D (PUNLMP + LG vs HG)

90%

0.80 (0.72–0.89)

CI = confidence interval; G1 = grade 1; G2 = grade 2; G3 = grade 3; HG = high grade; LG = low grade; PUNLMP = papillary urothelial neoplasm with low

malignant potential; WHO = World Health Organization.

E U R O P E A N U R O L O G Y 7 2 ( 2 0 1 7 ) 8 0 1 – 8 1 3

810