Bepaling invasiediepte

Uitgangsvraag

Hoe zou de invasiediepte van mondholtecarcinomen bepaald dienen te worden?

Aanbeveling

Bepaal de invasiediepte met een beeldvormingstechniek, bij voorkeur MRI, met als alternatief intraorale echografie.

Overwegingen

De gevonden resultaten met betrekking tot invasiediepte zijn samengevat in Tabel 1, welke is te vinden in de samenvatting van de literatuur (onder het tabblad ‘onderbouwing’). Voor de overeenkomst op T-stadium met behulp van invasiedieptemetingen ten opzichte van histopathologie werd er geen data gevonden voor het gebruik van CT en PET-CT. Voor de overeenkomst, gecategoriseerd door een afkapwaarde, in invasiedieptemetingen werd er geen data gevonden die CT of PET-CT vergeleek met histopathologie. Voor de overeenstemming van invasiedieptemetingen op een continue schaal, in millimeters, werden ten slotte geen data gevonden voor CT, PET-CT en ultrageluid.

De bewijskracht werd met een aangepaste versie van GRADE beoordeeld (Mokkink, 2018). Voor de overeenstemming op T-stadium waarbij invasiedieptemetingen werden gebruikt was het vertrouwen, volgens de GRADE, in een klinische beoordeling (met onduidelijke procedures) laag. Goel (2016) rapporteerde een overeenstemming van k = 0,47 tussen een klinische beoordeling (met onduidelijke procedures) en histopathologie. Het vertrouwen in de gerapporteerde uitkomsten van MRI voor de overeenstemming op T-stadium was redelijk. Hier rapporteerden drie studies (Goel, 2016; Verma, 2019; Vidiri, 2019) hun uitkomsten. Verma (2019) en Vidiri (2019) gebruikten in de studies de 8^e editie van de TNM-classificering (Verma (2019) gebruikte óók de 7^e editie). Er werden kappa-waarden van 0.69 (95%BHI: niet gerapporteerd), 0.65 (95%BHI: niet berekend), 0.74 (ervaren beoordelaar, 95%BHI: 0.56 tot 0.92) en 0.60 (onervaren beoordelaar, 95%BHI: 0.40 tot 0.80) door de studie-auteurs gerapporteerd (Goel, 2016; Verma, 2019; Vidiri, 2019). De gevonden resultaten liggen rond de vooraf gedefinieerde grens van besluitvorming (dat wil zeggen dat K ≥ 0.70 als voldoende overeenstemming werd gezien).

Twee studies werden geïncludeerd voor invasiedieptemetingen die gecategoriseerd werden aan de hand van een afkappunt (Alsaffar, 2016; Iida, 2018). De enige afkapwaarde die werd gebruikt was 5 millimeter waardoor er twee categorieën ontstonden (dat wil zeggen een invasiediepte van < 5 millimeter of ≥ 5 millimeter). Het vertrouwen in de gerapporteerde uitkomsten waren volgens GRADE zeer laag voor een klinische beoordeling, MRI en ultrageluid. De zeer lage GRADE-beoordeling ontstond vooral vanwege het risico op vertekening van de resultaten en de beperkte omvang van de steekproeven. In beide studies was de periode tussen het preoperatieve assessment en het histopathologische assessment onduidelijk.

Ten slotte werd er voor de overeenstemming van invasiedieptemetingen op een continue schaal (in millimeters) tussen modaliteiten en histopathologie alleen data gevonden voor MRI (Mao, 2019; Vidiri, 2019). Beide studies rapporteerden de tijdsperiode tussen het preoperatieve assessment en het histopathologisch assessment deels. Mao (2019) beschreef dat de preoperatieve meting met MRI binnen 1 week vóór resectie werd uitgevoerd, terwijl dit 3 tot 4 weken was voor Vidiri (2019). De werkgroep had a priori vastgesteld dat een meting tot maximaal 4 weken vóór de chirurgische resectie als adequaat werd gezien. Voor het meten van de invasiediepte op continue schaal (in millimeters) met MRI is er, volgens GRADE, een redelijk vertrouwen in de gevonden resultaten voor MRI uit de twee studies. In de gerapporteerde data werd gezien dat 95% van de MRI metingen (n=150) tussen een onderschatting van 0,97 millimeter en een overschatting van 5,61 millimeter lag in één van de studies (Mao, 2019). In de andere studie (n=53) lagen 95% van de metingen tussen 5,5 millimeter onderschatting en 4,9 millimeter overschatting door een ervaren radioloog en tussen 6,6 millimeter onderschatting en 5,8 millimeter overschatting door een onervaren radioloog (Vidiri, 2019).

De gevonden resultaten met betrekking tot tumordikte zijn samengevat in Tabel 2 en is te vinden in de samenvatting van de literatuur (onder het tabblad ‘onderbouwing’). Voor de overeenkomst op T-stadium met behulp van tumor diktemetingen werd geen data gevonden voor het gebruik van CT, PET-CT, MRI en ultrageluid. Voor de overeenkomst, gecategoriseerd door een afkapwaarde, in tumor diktemetingen werd er voor geen enkele modaliteit van interesse data gevonden (dat wil zeggen klinisch onderzoek, CT, PET-CT, MRI en ultrageluid). Voor de overeenstemming van invasiedieptemetingen op een continue schaal, in millimeters, werden ten slotte enkel voor klinisch onderzoek geen data gevonden.

Voor de overeenstemming op T-stadium met behulp van tumor diktemetingen werd er één studie geïncludeerd (Choi, 2017). Deze studie beschreef een work-up (bestaande uit een endoscopisch beoordeling, palpatie en beeldvorming door CT of MRI) ten opzichte van een histopathologische beoordeling. Er werd een kappa-waarde van 0.80 gerapporteerd, maar het 95%BHI werd niet vermeld en het vertrouwen in deze uitkomst was laag (volgens GRADE).

Er werden geen studies geïncludeerd die de overeenstemming voor tumordikte categoriseerden met behulp van specifieke afkapwaarden. Hierdoor is er voor geen enkele modaliteit van interesse data voor deze specifieke situatie beschikbaar.

Wanneer de overeenstemming op een continue schaal (in millimeters) werd onderzocht was het vertrouwen in de gerapporteerde resultaten, volgens GRADE, in CT zeer laag, in MRI laag en in ultrageluid redelijk. Het vertrouwen in de gerapporteerde uitkomsten van MRI werd verlaagd vanwege de beperkte steekproefomvang en het risico op vertekening van de resultaten. De tijdsperiode tussen het preoperatieve assessment en het histopathologische assessment was onduidelijk in beide studies. Bij het gebruik van MRI lagen 95% van de metingen tussen een onderschatting van 5,3 millimeter en een overschatting van 5,4 millimeter bij 83 participanten (Brouwer de Koning, 2019) en tussen een onderschatting van 4,6 en een overschatting van 4,99 millimeter bij 150 patiënten (Nair, 2018) ten opzichte van histopathologie. Er was een redelijk vertrouwen in de gerapporteerde uitkomsten van ultrageluid. Klein Nulent (2018) combineerde data van 10 studies waardoor de gepresenteerde Bland-Altman plots data van 240 patiënten bevatte. Er werd gerapporteerd dat 95% van de metingen met ultrageluid tussen een onderschatting van 5,5 millimeter en een overschatting van 6,5 millimeter lagen ten opzichte van histopathologie (Klein Nulent, 2018).

Een onderschatting van de invasiediepte of tumordikte geeft een verhoogd risico op inadequate snijranden, terwijl een overschatting een verhoogd risico op te ruime resectieranden geeft. Als de resectieranden inadequaat zijn is er (vaak in combinatie met andere negatieve histopathologische bevindingen) een indicatie voor adjuvante therapie in de vorm van een heroperatie of radiotherapie met of zonder chemotherapie. Door inadequate resectieranden kan de overleving verminderd zijn. Als gevolg van adjuvante behandeling of te ruime resectieranden kunnen mondfuncties zijn aangedaan en kwaliteit van leven verminderd zijn. In het algemeen wordt onderschatting ernstiger gevonden dan overschatting.

De werkgroep werd na de zoekdatum voor literatuur op de hoogte gesteld van een aantal relevante publicaties. Deze publicaties werden daarom niet in de literatuuranalyse opgenomen, maar zullen kort besproken worden ter overweging. Deze korte bespreking geeft wellicht geen compleet literatuuroverzicht van de periode na de systematische zoekopdracht in deze richtlijnmodule en bevat geen GRADE-beoordelingen.

Noorlag (2020) onderzocht met retrospectieve data de tumordiepte gemeten met MRI of intra-orale echografie ten opzichte van een postoperatief histopathologisch assessment. De auteurs rapporteerden een Pearson’s correlatiecoëfficiënt van 0.792 (p<0,001) voor MRI ten opzichte van histopathologie. Er werd ook bekeken welke beeldvorming een groter verschil ten opzichte van histopathologie had bij tumoren met een kleine invasiediepte (≤ 1 centimeter) en een grote invasiediepte (> 1 centimeter). De auteurs concludeerden dat intra-orale echografie voor tumoren met een kleine invasiediepte accurater zou zijn dan MRI, maar dat echografie bij dikkere tumoren de invasiediepte zou onderschatten.

Baba (2020) onderzocht aan de hand van retrospectieve data wat de correlatie tussen MRI en histopathologie was bij het meten van invasiediepte in het buccale slijmvlies. Er werd een correlatie gerapporteerd tussen coronale T2-gewogen MRI en histopathologie (Spearman’s r=0.67, p=0.012) en tussen coronale T1-gewogen MRI met contrast en vetsupressie (CET1) en histopathologie (Spearman’s r=0.68, p<0.001). De auteurs concludeerden dat MRI-metingen behulpzaam zouden kunnen zijn bij het schatten van de histopathologische invasiediepte.

Chin (2020) rapporteerde de overeenkomst tussen contrast CT en histopathologie voor het meten van invasiediepte bij patiënten met plaveiselcelcarcinomen op de tong. Voor de overeenstemming tussen axiale contrast CT en histopathologie werd een ICC (ICC=0.96, 95%BHI: 0.89-0.98, p=<0.001) en een Bland-Altman plot (gemiddeld verschil: -0.72 millimeter,95% limieten van overeenstemming: 3.34 tot -4.78 millimeter) gerapporteerd. Voor een coronale contrast CT was de ICC 0.957 (95%BHI: 0.86-0.99, p<0.001) en het gemiddelde verschil -1.11 millimeter (95% limieten van overeenstemming: 2.73 tot -4.93 millimeter). De auteurs concludeerden dat er een excellente overeenstemming was tussen contrast CT en histopathologie.

Cocker (2020) gebruikte retrospectieve data om de overeenstemming tussen histopathologie en verschillende modaliteiten voor het meten van invasiediepte bij patiënten met mondholtecarcinomen. Er werd data voor echografie (exacte overeenstemming: 9 metingen, binnen 3 millimeter: 52 metingen, buiten 3 millimeter: 17 metingen), MRI (exacte overeenstemming: 1 meting, binnen 3 millimeter: 58 metingen, buiten 3 millimeter: 45 metingen) en CT (exacte overeenstemming: 1 metingen, binnen 3 millimeter: 11 metingen, buiten 3 millimeter: 9 metingen) gerapporteerd. De auteurs concludeerden dat, van de drie modaliteiten, echografie de meest betrouwbare modaliteit was en dat de huidige beeldvormende modaliteiten wellicht geen robuuste en accurate metingen geven.

Filauro (2020) rapporteerde de correlatie van MRI of echografie met histopathologie voor metingen van invasiediepte bij patiënten met mondholtecarcinomen. Er werd een correlatie gevonden tussen metingen met MRI en histopathologie (Spearman’s r=0.83, p<0.000) en tussen echografie en histopathologie (Spearman’s r=0.76, p<0.0001). De overeenstemming op T-stadium werd tevens vermeld voor MRI en histopathologie (gewogen kappa=0.53, 95%BHI: 0.32-0.74, p<0.0001) en voor echografie en histopathologie (gewogen kappa=0.36, 95%BHI: 0.14-0.58). De auteurs concludeerden dat beide modaliteiten valide manieren zijn om preoperatief invasiediepte te bepalen, hetzij met andere kosten-effectiviteitsprofielen en indicaties.

Waech (2020) onderzocht de correlatie tussen MRI of contrast CT en histopathologie voor het meten van invasiediepte bij patiënten met mondholtecarcinomen. De Spearman’s r werd gerapporteerd voor de correlatie van T1-gewogen MRI (r=0.635, p<0.0001), T2-gewogen MRI (r=0.679, p<0.0001) of contrast CT (r=0.718, p<0.0001) met histopathologie. De auteurs concludeerden dat preoperatieve metingen met MRI of CT tot een overschatting van de histologische invasiediepte leiden, vooral in tumoren met een invasiediepte kleiner dan vijf millimeter.

Wanneer invasiediepte gebruikt wordt voor T-stadiering kan door over- of onderschatting het klinische T-stadium veranderen. Aangezien het T-stadium op basis van invasiediepte meestal niet het primaire behandelplan beïnvloedt, lijkt dit minder belangrijk voor T-stadiering dan het bepalen van de resectie. Indien het beleid ten aanzien van het electief behandelen van de lymfeklieren in de hals gebaseerd is op invasiediepte (in plaats van schildwachtklierprocedure of beeldvormend onderzoek van de hals) kan onder- of overschatting wel het beleid van de hals beïnvloeden.

Voordeel van een MRI is dat deze meestal al gemaakt wordt om de uitbreiding van de primaire tumor en lymfeklieren in de hals te onderzoeken. Ook kunnen later de invasiediepte en tumordikte nog bepaald worden. Er kunnen contra-indicaties zijn voor het gebruik van een MRI. Dit betreffen de gebruikelijke contra-indicaties voor MRI. Patiënten dienen, zoals gangbaar is, gescreend te worden op deze contra-indicaties. Patiënten zouden eventueel een andere beeldvorming kunnen prefereren wegens claustrofobische klachten die bij het afnemen van een MRI zouden kunnen optreden. Echografie is doorgaans meer afhankelijk van de ervaring van de verrichter en metingen kunnen alleen real-time verricht worden. Daarbij moet de tumor voldoende bereikbaar zijn voor de intraorale probe. Hoewel met de zogenaamde hockey-stick probes tumoren ook achter in de mond beter bereikbaar zijn geworden, kunnen factoren als de (veranderde) lokale anatomie, dentitie, trismus en tumorgrootte een goede meting verhinderen. Voor het gebruik van intra-orale echografie zijn er geen directe contra-indicaties, behalve wanneer de tumor niet te bereiken is. Hierbij is, bijvoorbeeld, te denken aan pijnklachten, trismus en/of de tumorlocatie die het bereiken van de tumor met de probe niet mogelijk maken. Deze echografie kan gemakkelijk tegelijk met een eventuele echografie van de lymfeklieren in de hals plaatsvinden. Hiervoor dient dan alleen de probe gewisseld te worden.

In de diagnostische work-up worden er al vaak een MRI en/of echografie uitgevoerd voor andere doeleinden. Deze beelden zijn echter ook te gebruiken voor het bepalen van de invasiediepte (en/of tumordikte), zonder dat er nieuwe beeldvorming plaats hoeft te vinden. Daardoor zullen er weinig extra kosten worden verwacht bij het gebruik van MRI of intra-orale echografie. Wanneer met verwacht dat een oppervlakkige tumor niet zichtbaar is op een MRI kan echografie overwogen worden indien de locatie bereikbaar is. De lokalisatie van de tumor kan leidend zijn voor de keuze tussen MRI of echografie. Vooral voor tumoren op de tong is er data over het gebruik van echografie bekend, maar minder voor andere tumorlocaties. Voor het gebruik van de intra-orale echografie kan, specifiek bij moeilijk bereikbare tumorlocaties, een hockey-stick probe worden gebruikt. De werkgroep beseft zich dat deze probe niet overal beschikbaar is en dat aanschaf van een dergelijke probe extra kosten met zich meebrengt. Ook de extra tijd die echografie in beslag neemt en training van personeel kunnen kosten met zich meebrengen. Zowel MRI als echografie zijn geaccepteerde beeldvormingstechnieken in de diagnostische work-up van patiënten met hoofd-halstumoren, waardoor er geen problemen worden verwacht in de aanvaardbaarheid, haalbaarheid en implementatie.

Rationale van de aanbeveling: weging van argumenten voor en tegen de interventies

Beeldvormingstechnieken, zoals MRI en intra-orale echografie, die voor andere doeleinden worden ingezet bij patiënten met hoofd-halstumoren kunnen worden gebruikt om de invasiediepte te meten zonder grote bijkomende kosten. MRI en intra-orale echografie lijken de voorkeur te hebben boven de mogelijke alternatieven, zoals palpatie of CT. Doordat er weinig informatie beschikbaar is en er weinig zekerheid bestaat over het gebruik van palpatie, acht de werkgroep het van belang om in elk geval een beeldvormingstechniek te gebruiken voor het meten van de invasiediepte.

Onderbouwing

Achtergrond

Invasiediepte van de primaire tumor is een prognostische factor. In de nieuwe TNM-classificatie is de invasiediepte opgenomen als belangrijke parameter voor het stadiëren van mondholtecarcinomen. Ook is bij mondholtecarcinomen de invasiediepte een voorspeller voor de aanwezigheid van lymfekliermetastasen. De invasiediepte is de afstand van de (gereconstrueerde) mucosa tot het diepste punt van de tumor in het weefsel. Dit is niet gelijk aan de tumordikte. Bij ulceratieve tumoren is de tumordikte kleiner dan de invasiediepte, bij exofytisch groeiende tumoren is het omgekeerde het geval. Er worden diverse technieken gebruikt om preoperatief de invasiediepte te bepalen, maar het is nog onduidelijk wat de beste modaliteit is om mondholte carcinomen preoperatief te stadiëren.

Samenvatting literatuur

Description of studies included for the agreement on depth of invasion

Alsaffar (2016) assessed the agreement between palpation or MR images and histopathology for the depth of invasion. The study recruited patients with newly diagnosed oral squamous cell carcinoma (n=53) of which there were 34 males. The mean age was 64 (SD or range not reported). Various T-stages (T1: n=22, T2: n=22, T3: n=7, T4: n=2) and N-stages (N0: n=32, N1: n=7, N2: n=11) were in the sample. It was unclear which staging system was used, however it is likely the AJCC TNM-staging system (presumably the 7^th edition) was used. The palpation was performed by the treating surgeon, prior to the radiological assessment. Preoperative MRI was performed and the depth of invasion was measured from the adjacent mucosa to the deepest tumor invasion. The time period between the preoperative assessments and the histopathological assessment (on formalin fixed specimens) was unclear. Tumor invasion was categorized in two categories: < 5 millimeters and ≥ 5 millimeters. A Cohen’s kappa was calculated to assess the agreement.

Goel (2016) recruited patients (n=61) to assess the agreement between clinical examination or MRI and histopathology for the depth of invasion (categorized in T-stages) in patients with biopsy proven squamous cell carcinomas of the tongue or gingiva-buccal area. Forty-five of the included patients were male and various T-stages were in the sample (T1: n=4, T2: n=16, T3: n=13, T4: n=28). A TNM staging system was used (unclear edition). No procedures were described for the clinical examination or histopathological assessment. However, the tissue was probably fixed with formalin. MRI was performed with a 1.5T scanner (used sequence: axial and coronal T2WI, postcontrast T1WI). The time period between the preoperative assessments and histopathological assessment was unclear. Agreement between the clinical examination or MR imaging and histopathology on the T-stage was calculated with a Cohen’s kappa.

Iida (2018) assessed the agreement on depth of invasion between ultrasound and histopathology in patients with an early oral tongue squamous cell carcinoma between June 2008 and December 2015. Fifty-six patients were included, with a mean age of 59 years (range: 25 to 90) and of which 34 were male. All participants had their carcinoma located on the lateral ledge of the tongue. Tumor stage was not reported for the participants. It was unclear if and which edition of a staging system was used. The ultrasound assessment was performed in an outpatient clinic, using a 16‐MHz scanner and a T‐shaped ultrasonographic probe, where the patient extended their tongues during the preoperative ultrasound measurement. Histopathological assessment was performed with a micrometer in the tumor specimen, which was formalin-fixed and paraffin-embedded. The time period between preoperative ultrasound and postoperative histopathology was unclear. The depth of invasion was categorized by a threshold, resulting in two categories: < 5 millimeters and ≥ 5 millimeters. A Cohen’s kappa was calculated for the agreement.

Mao (2019) investigated the agreement between MR imaging and histopathology in patients first diagnosed with squamous cell carcinoma of the tongue (n=150). The mean age of patients was 58 years (SD: 12.1). There were 80 males and 70 females in the sample, with various tumor locations: ventral side of the tongue (n=35), border of the tongue (n=89), dorsal side of the tongue (n=19), and the base of the tongue (n=7). Several tumor morphologies were identified: ulcer type (n=41), invasive type (n=94), and exogeneous type (n=15). Participants had a T-stage of T1 (n=43), T2 (n=71), or T3 (n=36) and an N-stage of N1 (n=16), N2b (n=17), or N2c (n=2). The 7^th edition of the AJCC staging system was used. A 1.5T MR scanner was used 1 week preoperatively to measure the depth of invasion with a section thickness of 1 millimeter (used sequences: T1 axial, coronal and sagittal sequences, T2 axial and coronal sequences with fat suppression, T1-weighted axial, coronal and sagittal sequences with fat suppression and contrast media). Surgical tumor specimens were preserved in formalin. Pathological sections and staining were performed to measure the tumor invasion. Agreement was quantified in Bland-Altman plots for all participants, per T-stage, and per tumor morphology.

Verma (2019) assessed the agreement between MR imaging and histopathology for tumor thickness (per T-stage) in patients with biopsy proven squamous cell carcinoma of the tongue (n=50). The sample consisted of 38 males and 12 females with mean age of the sample was 49 (SD not reported). Various T-stages were prevalent in the sample: T1 (n=24), T2 (n=18), and T3 (n=8). The 7^th and 8^th edition of the AJCC staging system were used for the study. No other characteristics were reported. Tumor thickness was preoperatively assessed with MR imaging (4 millimeter slices, used sequences: T1W1 axial and coronal, T2WI axial, coronal and saggital, coronal STIR, and postcontrast axial T1W). Tumor dimensions (anteroposteriorly, mediolaterally, superoinferiorly) were measured. Tumor thickness was measured in three dimensions with histopathology (presumably on formalin fixed material), however no further procedures were reported. The time period between the preoperative MR imaging and the histopathological assessment was unclear. A Cohen’s kappa was not calculated by the authors, but could be calculated from the presented 3-by-3 table showing the T-classifications of MR imaging and histopathology.

Vidiri (2019) assessed the depth of invasion as well as tumor thickness (per T-stage) in patients diagnosed with oral tongue squamous cell carcinoma between 2013 and 2018. The median age for the patients (n=43, 18 males and 25 females) was 65 with a range from 31 to 81. Various T-stages were prevalent: T1 (n=10), T2 (n=12), and T3 (n=21). The 8^th edition of the AJCC staging system was used. Preoperative MR imaging was performed with a 1.5T scanner 3 to 4 weeks preoperatively (used sequences: coronal T2W, axial FSE T2W, pre-contrast axial T1WI, DWI through single-shot spin-echo and echo-planar imaging). Two radiologists, one experienced and one inexperienced, assessed the images independently from each other. Agreement between histopathology (on formalin fixed material) and the results of both radiologists were reported separately. Resected tissue was fixed in formalin. Embedding, sectioning, and staining (with hematoxylin and eosin) was performed for histopathological analyses. Bland-Altman plots were reported for the depth of invasion and Cohen’s kappa was reported for agreement on T-stage as tumor thickness.

Studies included for the agreement on tumor thickness

Brouwer de Koning (2019) investigated the agreement between ultrasound or MR imaging and histopathology for tumor thickness in clinically stages T1-2 oral cavity carcinomas. MR images were acquired between 2011 and 2016. A total of 83 patients were included in the analyses with a mean age of 61 years (range: 31 to 88). Forty-five patients were male. Several tumor locations were included in the study: tongue (n=58), floor of the mouth (n=24), palate (n=2), and the lip (n=1). The 7^th and 8^th editions of the AJCC staging system were used. Tumor thickness was measured with ultrasound in 46 patients and with MR imaging in 76 patients. For ultrasound, the probe (13 to 7 MHz transducer) was placed directly on the lesion. MR imaging was performed and tumor dimensions were measured in 3D (used sequences: T1W, TSE, TRA, TR, TE 538/10ms, flip angle 90, matrix 288/248, slice thickness of 4mm, STIR TSE COR, TR/TE 2500/60ms, matrix 216/170, T1 3D Thrive fat-saturation, intravenous injection of 15cc gadoterate meglumine, TR/TE 9.86/4.59ms, flip angle 10, matrix 200/179, slice thickness 1mm). Radiologists reported the MRI outcome and suggested a T-stage. The pathologist reported the tumor dimension in the pathological report. Further pathological procedures were not described and it was unclear how specimens were fixed. The time period between the preoperative assessments and the histopathological assessment was unclear.

Choi (2017) assessed the agreement between clinical examination and histopathology for the tumor thickness (categorized in T-stages) in n=252 patients with biopsy proven squamous cell carcinomas of the oral cavity. Patients had a median age of 55 years (range: 47 to 65) and had tumors on the tongue (n=195), floor of mouth (n=34), or on the buccal mucosa (n=23). Various T-stages were included in the sample: pT1 (n=109), pT2 (n=80), pT3 (n=25), pT4a (n=37), and pT4b (n=1). The 7^th edition of the AJCC staging system was used. Clinical examination consisted of a physician performing preoperative endoscopic assessment, palpation and imaging with either CT or MRI. Surgical specimens of the primary tumor were assessed microscopically, however it remained unclear how specimens were fixed. Agreement between the T-stages as assessed with the preoperative clinical examination and postoperative histopathology was quantified with a Cohen’s kappa. The time period between both assessments was unclear.

Klein Nulent (2018) performed a systematic review with a search up to the 6^th of July 2016 in PubMed (Medline), EMBASE, and the Cochrane databases for studies comparing intraoral ultrasound tumor thickness measurements with postoperative pathological assessment. Included studies had to contain patients with oral squamous cell carcinoma and the ultrasound measurements had to be performed preoperatively or intraoperatively. Included patients had tongue tumors, buccal mucosa tumors, tumors on the floor of the mouth, lip tumours, or alveolar mucosa tumors. Data was extracted according to the 7^th edition of the AJCC staging system. Ten out of the twelve included studies (n=240) were used for the assessment of the agreement between intraoral ultrasound and postoperative histopathology in a Bland-Altman plot. For one of the included studies the authors estimated the individual patient data from a figure. The QUADAS-2 tool was used to assess the risk of bias of the included studies. Time between preoperative measurement and postoperative histology was not assessed. The tissue fixation method for histopathological analyses were not reported for the individual studies included in the systematic review.

Nair (2018) recruited patients with biopsy proven T1N0 (n=18) or T2N0 (n=6) primary squamous cell carcinomas of the tongue to assess the agreement between ultrasound and histopathology for tumor thickness. A total of 25 patient were recruited with a median age of 55 years (range: 22 to 76). Sixteen of the recruited patients were male. The 6^th edition of the AJCC staging system was used. Preoperative ultrasound assessments using a 17 or 9 MHz conventional linear probe were performed with patients extending their tongue. The probe was placed directly upon the lesion. The surgical tumor specimens were placed in saline and immediately send to the pathology department for assessment (specimens were not fixed in formalin), where the specimens were cut into 2 to 3 millimeter thick transverse slices. The time period between the preoperative ultrasound assessment and the histopathological assessment was unclear.

Shintani (2001) assessed the agreement between CT or MRI and histopathology for the measurement of tumor thickness in 38 patients with oral cancer. Furthermore, ultrasound (7.5-Mhz intracavitarytransducers) was assessed and this data was included in the systematic review of Klein Nulent (2018). The patients had a mean age of 58.2 years (SD: not reported, range: 36 to 91) and had tumors on the tongue (n=26), buccal mucosa (n=8), and floor of mouth (n=4). The 5^th edition of the UICC TNM staging system was used. Tumor thickness was measured with a contrast-enhanced 5-mm axial CT in 38 participants and with 4-mm axial and coronal T1/T2-weighted MR imaging in 26 patients. Histological sections (presumably from formalin fixed tissue) were assessed with a micrometer. The authors state that tumors smaller than 5 millimeters were difficult to differentiate with CT or MRI. Tumors were not detected by CT in 19 patients and by MRI in 11 patients. These patients were therefore not included in the analyses. Furthermore, the time period between the preoperative assessment and the histopathologic assessment was unclear.

Results

Depth of invasion

Results concerning the instrument agreement for depth of invasion are summarized in Table 1.

Categorical (T-stage)

Clinical examination

Goel (2016) reported the agreement between a clinical examination and histopathology in T-stages in 63 patients. It was unclear what the procedures for clinical examination were. Clinical assessment for T-stage showed an agreement of Cohen’s kappa = 0.47 (95%CI: not reported) with histopathology.

No studies were included that reported the T-stage agreement between CT and histopathology while measuring depth of invasion.

PET-CT

No studies were included that reported the T-stage agreement between PET-CT and histopathology while measuring depth of invasion.

MRI

Goel (2016) reported the agreement between MRI and histopathology in T-stages in 63 patients. A Cohen’s kappa of 0.69 (95%CI: not reported) was found.

Verma (2019) reported the T-stage classifications of MRI and histopathology in a three-by-thee table (T1-3) from 50 included patients. A Cohen’s kappa was not reported but could be calculated from the table. Here, a kappa of 0.65 was calculated (95%CI: not calculated).

Vidiri (2019) reported the T-stage agreement of an experienced and an inexperienced radiologist interpreting MR imaging with the histopathology results. The experienced radiologist showed a Cohen’s kappa of 0.74 (95%CI: 0.56 to 0.92) for the agreement of T-stage between MRI and histopathology. For the inexperienced radiologist the Cohen’s kappa was 0.60 (95%CI: 0.40 to 0.80).

Ultrasound

No studies were included that reported the T-stage agreement between ultrasound and histopathology while measuring depth of invasion.

Categorical (at a threshold)

Clinical examination

Alsaffar (2016) categorized the depth of invasion, which resulted in two categories: < 5 millimeters depth of invasion and ≥ 5 millimeters depth of invasion. The treating surgeon performed a palpation to assess the depth of invasion. The agreement between the treating surgeon’s preoperative palpation and the postoperative histopathology was quantified with a Cohen’s kappa (n=53). A kappa of 0.61 (95%CI: 0.36 to 0.87) was reported.

No studies were included that reported the agreement at a specified threshold between CT and histopathology while measuring depth of invasion.

PET-CT

No studies were included that reported the agreement at a specified threshold between PET-CT and histopathology while measuring depth of invasion.

MRI

Alsaffar (2016) assessed the agreement between MRI and histopathology for measuring depth of invasion and used two categories: < 5 millimeters and ≥ 5 millimeters. A Cohen’s kappa of 0.80 (95%CI: 0.59 to 1.00) was reported (n=43).

Ultrasound

Iida (2018) assessed the agreement between ultrasound and histopathology for the depth of invasion in 53 participants. Depth of invasion was categorized in: < 5 millimeters and ≥ 5 millimeters. A Cohen’s kappa of 0.65 (95%CI: 0.43 to 0.87) was reported.

Continuous (in millimeters)

Clinical examination

No studies were included that reported the agreement on a continuous scale between clinical examination (palpation) and histopathology while measuring depth of invasion.

No studies were included that reported the agreement on a continuous scale between CT and histopathology while measuring depth of invasion.

PET-CT

No studies were included that reported the agreement on a continuous scale between PET-CT and histopathology while measuring depth of invasion.

MRI

Mao (2019) constructed Bland-Altman plots for the agreement between MRI and histopathology measuring the depth of invasion. Several plots were constructed for different tumor stages and types. Overall, MRI showed a mean overestimation of 2.32 millimeters when compared to the histopathologic results (n=150). In 95% of the measurements, MRI measured between an underestimation of 0.97 millimeters (-0.97 millimeters) and an overestimation of 5.61 millimeters. Furthermore, agreement per tumor stage was assessed: T1 (mean difference: 1.46 millimeters, 95% limits of agreement: -0.67 to 3.63 millimeters, n=43), T2 (mean difference: 2.08 millimeters, 95% limits of agreement: -0.45 to 4.62 millimeters, n=71), and T3 (mean difference: 3.79 millimeters, 95% limits of agreement: -0.13 to 7.7 millimeters, n=36). Finally, agreement per tumor type was assessed: ulcer type (mean difference: 3.72 millimeters, 95% limits of agreement: -0.16 to 7.6 millimeters, n=41), invasive type (mean difference: 1.83 millimeters, 95% limits of agreement: -0.59 to 4.25 millimeters, n=91), and exogenous type (mean difference: 1.53 millimeters, 95% limits of agreement: 0.01 to 3.06 millimeters, n15).

Vidiri (2019) constructed Bland-Altman plots for the agreement between an experienced or inexperienced radiologist using MRI and histopathology for the depth of invasion in 43 patients. The MRI measurements by an experienced radiologist had a mean underestimation of 0.3 millimeters (-0.3 millimeters), where 95% of the measurements lay between an underestimation of 5.5 millimeters (-5.5 millimeters) and an overestimation of 4.9 millimeters. For the inexperienced radiologist the MRI measurements had a mean underestimation of 0.4 millimeters (-0.4 millimeters), while 95% of the MRI measurements lay between an underestimation of 6.6 millimeters (-6.6 millimeters) and 5.8 millimeters overestimation.

Ultrasound

No studies were included that reported the agreement on a continuous scale between ultrasound and histopathology while measuring depth of invasion.

Table 1 Study results for depth of invasion per measurement level per instrument

Variable	Measurement level	Measurement instrument	Threshold	Author	Result	Risk of Bias (COSMIN, unless stated otherwise)
Depth of invasion	Categorical (T-stage)	Clinical examination (unclear procedures)	T-stage	Goel 2016	Kappa for the agreement of T-stage, n=61 (clinical examination versus pathological data): K=0.47 (95%CI: not reported)	Doubtful
		CT	No studies were included that reported the T-stage agreement between CT and histopathology while measuring depth of invasion.
		PET-CT	No studies were included that reported the T-stage agreement between PET-CT and histopathology while measuring depth of invasion.
		MRI	T-stage	Goel 2016	Kappa for the agreement of T-stage, n=61 (MRI versus pathological data): K=0.69 (95%CI: not reported)	Doubtful
			T-stage	Verma 2019	Kappa for the agreement of tumour thickness (T-stage) as measured by MRI and histopathology was not reported. However, the 3x3 table was reported from which a kappa could be calculated in n= 50: Kappa = 0.65	Doubtful
			T-stage	Vidiri 2019	Kappa for the agreement of T-stage, n=43 (MRI experienced radiologist versus pathological data): K=0.74 (95%CI: 0.56-0.92) Kappa for the agreement of T-stage, n=43 (MRI inexperienced radiologist versus pathological data): K=0.60 (95%CI: 0.40-0.80)	Adequate
		Ultrasound	T-stage	No studies were included that reported the T-stage agreement between ultrasound and histopathology while measuring depth of invasion.
	Categorical (at a threshold)	Clinical examination (palpation)	5 mm	Alsaffar 2016	Kappa at a threshold of 5 millimetres, n=53: K=0.61 (95%CI: 0.36-0.87)	Doubtful
		CT	No studies were included that reported the agreement at a specified threshold between CT and histopathology while measuring depth of invasion.
		PET-CT	No studies were included that reported the agreement at a specified threshold between PET-CT and histopathology while measuring depth of invasion.
		MRI	5 mm	Alsaffar 2016	Kappa at a threshold of 5 millimetres, n=53: K=0.80 (95%CI: 0.59-1.00)	Doubtful
		Ultrasound	5 mm	Iida 2018	Kappa at a threshold of 5 millimetres, n=59: K=0.651 (95%CI: 0.43-0.87)	Doubtful
	Continuous	Clinical examination	No studies were included that reported the agreement on a continuous scale between clinical examination (palpation) and histopathology while measuring depth of invasion.
		CT	No studies were included that reported the agreement on a continuous scale between CT and histopathology while measuring depth of invasion.
		PET-CT	No studies were included that reported the agreement on a continuous scale between PET-CT and histopathology while measuring depth of invasion.
		MRI	NA	Mao 2019	Bland-Altman plot overall n=150 (MRI-histopathology), mm: Mean difference: 2.32. 95% upper limit: 5.61 95% lower limit: -0.97 Bland-Altman plot tumour T1-stage n=43 (MRI-histopathology), mm: Mean difference: 1.48. 95% upper limit: 3.63 95% lower limit: -0.67 Bland-Altman plot tumour T2-stage n=71 (MRI-histopathology), mm: Mean difference: 2.08. 95% upper limit: 4.62 95% lower limit: -0.45 Bland-Altman plot tumour T3-stage n=36 (MRI-histopathology), mm: Mean difference: 3.79. 95% upper limit: 7.70 95% lower limit: -0.13 Bland-Altman plot ulcer type tumour n=41 (MRI-histopathology), mm: Mean difference: 3.72. 95% upper limit: 7.60 95% lower limit: -0.16 Bland-Altman plot invasive type tumour n=91 (MRI-histopathology), mm: Mean difference: 1.83. 95% upper limit: 4.25 95% lower limit: -0.59 Bland-Altman plot exogenous type tumour n=15 (MRI-histopathology), mm: Mean difference: 1.53. 95% upper limit: 3.06 95% lower limit: 0.01	Adequate
		MRI	NA	Vidiri 2019	Bland-Altman plot n=43 (MRI experienced radiologist-histopathology), mm: Mean difference: -0.3 95% upper limit: 4.9 95% lower limit: -5.5 Bland-Altman plot n=43 (MRI inexperienced radiologist-histopathology), mm: Mean difference: -0.4 95% upper limit: 5.8 95% lower limit: -6.6	Adequate
		Ultrasound	No studies were included that reported the agreement on a continuous scale between ultrasound and histopathology while measuring depth of invasion.
NA: Not Applicable

Tumor thickness

Results concerning the instrument agreement for tumor thickness are summarized in Table 2.

Categorical (T-stage)

Clinical examination

Choi (2017) reported a Cohen’s kappa = 0.81 (95%CI not reported) for the agreement of T-stages between a preoperative clinical assessment and a histopathologic assessment in 252 participants. The clinical examination consisted of an endoscopic assessment, a palpation, and either CT or MR imaging.

No studies were included that reported the T-stage agreement between CT and histopathology while measuring tumor thickness.

PET-CT

No studies were included that reported the T-stage agreement between PET-CT and histopathology while measuring tumor thickness.

MRI

No studies were included that reported the T-stage agreement between MRI and histopathology while measuring tumor thickness.

Ultrasound

No studies were included that reported the T-stage agreement between ultrasound and histopathology while measuring tumor thickness.

Categorical (at a threshold)

Clinical examination

No studies were included that reported the agreement at a specified threshold between clinical examination and histopathology while measuring tumor thickness.

No studies were included that reported the agreement at a specified threshold between CT and histopathology while measuring tumor thickness.

PET-CT

No studies were included that reported the agreement at a specified threshold between PET-CT and histopathology while measuring tumor thickness.

MRI

No studies were included that reported the agreement at a specified threshold between MRI and histopathology while measuring tumor thickness.

Ultrasound

No studies were included that reported the agreement at a specified threshold between ultrasound and histopathology while measuring tumor thickness.

Continuous (in millimeters)

Clinical examination

No studies were included that reported the agreement on a continuous scale between clinical examination (palpation) and histopathology while measuring tumor thickness.

Shintani (2001) did not report agreement parameters. However, the agreement could be calculated from the reported individual patient data (n=19). CT had a mean overestimation of 5.93 millimeters. When the 95% limits of agreement were calculated, 95% of the CT measurements lay between an underestimation of 5.66 millimeters (-5.66 millimeters) and an overestimation of 17.53 millimeters compared to histopathology.

PET-CT

No studies were included that reported the agreement on a continuous scale between PET-CT and histopathology while measuring tumor thickness.

MRI

Brouwer de Koning (2019) constructed a Bland-Altman plot where the mean overestimation of MRI was 1.3 millimeters in 83 patients. Ninety-five percent of the MRI measurements fell between an underestimation of 6.1 millimeters (-6.1 mm) and an overestimation of 8.6 millimeters compared to histopathology.

Shintani (2001) did not report agreement parameters. Nonetheless, the agreement between MRI and histopathology could be calculated (n=13). The mean difference was an overestimation of 8.55 millimeters by MRI. When the 95% limits of agreement were calculated, 95% of the MRI measurements lay between an underestimation of 5.94 millimeters (-5.94 millimeters) and an overestimation of 23.05 millimeters compared to histopathology.

Ultrasound

Brouwer de Koning (2019) reported a mean overestimation of 0.05 millimeters by ultrasound when compared to histopathological results in 83 patients. The ultrasound measurements were in 95% of the cases between an underestimation of 5.3 millimeters (-5.3 millimeters) and an overestimation of 5.4 millimeters when compared to histopathologic results.

Klein Nulent (2018) performed a systematic review and used individual patient data from 240 patients to construct a Bland-Altman plot. Ultrasound had a mean overestimation of 0.5 millimeters compared to histopathology. In 95% of the measurements the ultrasound resulted in measurements between -5.5 millimeters (5.5 millimeters underestimation) and 6.5 millimeters (6.5 millimeters overestimation) when compared to histopathology results.

Nair (2018) recruited 24 patients for the agreement between ultrasound and histopathology measuring tumor thickness. A Bland-Altman plot showed a mean difference between ultrasound an histoplathology where ultrasound underestimated the tumor thickness by 0.15 millimeters (-0.15 millimeters). The limits of agreement were not reported but could be approximated from the reported figure. Here, 95% of the ultrasound measurements were between 4.6 millimeters underestimation (-4.6 millimeters) and 4.99 millimeters overestimation compared to histopathologic results.

Table 2 Study results for tumor thickness per measurement level per instrument

Variable	Measurement level	Measurement instrument	Threshold	Author	Result	COSMIN Risk of Bias
Tumor thickness	Categorical (T-stage)	Clinical examination (endoscopic + palpation + CT or MRI)	T-stage	Choi 2017	Kappa for the agreement of T-stage n=252 (clinical examination versus pathological data): K=0.81 (95%CI: not reported)	Doubtful
		CT	No studies were included that reported the T-stage agreement between CT and histopathology while measuring tumor thickness.
		PET-CT	No studies were included that reported the T-stage agreement between PET-CT and histopathology while measuring tumor thickness.
		MRI	No studies were included that reported the T-stage agreement between MRI and histopathology while measuring tumor thickness.
		Ultrasound	No studies were included that reported the T-stage agreement between ultrasound and histopathology while measuring tumor thickness.
	Categorical (at a threshold)	Clinical examination	No studies were included that reported the agreement at a specified threshold between clinical examination and histopathology while measuring tumor thickness.
		CT	No studies were included that reported the agreement at a specified threshold between CT and histopathology while measuring tumor thickness.
		PET-CT	No studies were included that reported the agreement at a specified threshold between PET-CT and histopathology while measuring tumor thickness.
		MRI	No studies were included that reported the agreement at a specified threshold between MRI and histopathology while measuring tumor thickness.
		Ultrasound	No studies were included that reported the agreement at a specified threshold between ultrasound and histopathology while measuring tumor thickness.
	Continuous	Clinical examination	No studies were included that reported the agreement on a continuous scale between clinical examination (palpation) and histopathology while measuring tumor thickness.
		CT	NA	Shintani 2001	Bland-Altman parameters calculated from presented data n=19 (CT-histopathology), mm: Mean difference: 5.93. 95% upper limit: 17.53 95% lower limit: -5.66	Doubtful
		PET-CT	NA	No studies were included that reported the agreement on a continuous scale between PET-CT and histopathology while measuring tumor thickness.
		MRI	NA	Brouwer de Koning 2019	Bland-Altman plot n=83 (MRI-histopathology), mm: Mean difference: 1.3. 95% upper limit: 8.6 95% lower limit: -6.1	Doubtful
		MRI	NA	Shintani 2001	Bland-Altman parameters calculated from presented data n=13 (MRI-histopathology), mm: Mean difference: 8.55. 95% upper limit: 23.05 95% lower limit: -5.94	Doubtful
		Ultrasound	NA	Brouwer de Koning 2019	Bland-Altman plot n=83 (US-histopathology), mm: Mean difference: 0.05. 95% upper limit: 5.4 95% lower limit: -5.3	Doubtful
			NA	Klein Nulent 2018	Bland-Altman plot n=240 (ultrasound-histopathology), mm: Mean difference: 0.5. 95% upper limit: 6.5 95% lower limit: -5.5	Klein Nulent 2018 assessed the risk of bias with the QUADAS-2 tool. For flow and timing: 4 low risk / 1 high risk / 7 unclear
			NA	Nair 2018	Bland-Altman plot overall n=150 (US-histopathology), mm: Mean difference: -0.15 95% upper limit: 4.99 95% lower limit: -4.6 Limits of agreement were approximated from the provided Bland-Altman plot:	Doubtful
NA: Not applicable

Level of evidence of the literature

Depth of invasion

Categorical (T-stage)

Clinical examination

The level of evidence regarding clinical examination for the outcome measure ‘categorical agreement (T-stage)’ was downgraded by 3 levels because of study limitations (2 level for risk of bias: there is only one study of doubtful quality), and the number of included patients (1 level for imprecision: sample size was less than 100, but more than 50); publication bias was not assessed.

GRADE could not be applied because none of the included studies reported data about the categorical agreement on T-stage between CT and histopathology when measuring depth of invasion.

PET-CT

GRADE could not be applied because none of the included studies reported data about the categorical agreement on T-stage between PET-CT and histopathology when measuring depth of invasion.

MRI

The level of evidence regarding MRI for the outcome measure ‘categorical agreement (T-stage)’ was downgraded by 1 level because of study limitations (1 level for risk of bias: multiple studies of doubtful quality and one study of adequate quality); publication bias was not assessed.

Ultrasound

GRADE could not be applied because none of the included studies reported data about the categorical agreement on T-stage between ultrasound and histopathology when measuring depth of invasion.

Categorical (at a threshold)

Clinical examination

The level of evidence regarding clinical examination (palpation) for the outcome measure ‘categorical agreement (at a threshold)’ was downgraded by 3 levels because of study limitations (2 levels for risk of bias: there is only one study of doubtful quality) and the number of included patients (1 level for imprecision: sample size was less than 100, but more than 50); publication bias was not assessed.

GRADE could not be applied because none of the included studies reported data about the categorical agreement for depth of invasion at a threshold between CT and histopathology.

PET-CT

GRADE could not be applied because none of the included studies reported data about the categorical agreement for depth of invasion at a threshold between PET-CT and histopathology.

MRI

The level of evidence regarding MRI for the outcome measure ‘categorical agreement (at a threshold)’ was downgraded by 3 levels because of study limitations (2 levels for risk of bias: there is only one study of doubtful quality) and the number of included patients (1 level for imprecision: sample size was less than 100, but more than 50); publication bias was not assessed.

Ultrasound

The level of evidence regarding ultrasound for the outcome measure ‘categorical agreement (at a threshold)’ was downgraded by 3 levels because of study limitations (2 levels for risk of bias: there is only one study of doubtful quality) and the number of included patients (1 level for imprecision: sample size was less than 100, but more than 50); publication bias was not assessed.

Continuous (in millimeters)

Clinical examination

GRADE could not be applied because none of the included studies reported data about the agreement on a continuous scale for depth of invasion between a clinical examination and histopathology.

GRADE could not be applied because none of the included studies reported data about the agreement on a continuous scale for depth of invasion between CT and histopathology.

PET-CT

GRADE could not be applied because none of the included studies reported data about the agreement on a continuous scale for depth of invasion between PET-CT and histopathology.

MRI

The level of evidence regarding MRI for the outcome measure ‘continuous agreement (in millimeters)’ was downgraded by 1 level because of conflicting results (1 level for inconsistency: Mao (2019) reports a mean overestimation of 2.32 millimeters, while Vidiri (2019) reports a mean underestimation of 0.3 millimeters. Furthermore, Vidiri (2019) reports wider 95% lower limits of agreement when compared to Mao (2019): an underestimation of 5.5 millimeters (Vidiri, 2019) versus an underestimation of 0.97 millimeters (Mao, 2019)); publication bias was not assessed.

Ultrasound

GRADE could not be applied because none of the included studies reported data about the agreement on a continuous scale for depth of invasion between ultrasound and histopathology.

Tumor thickness

Categorical (T-stage)

Clinical examination

The level of evidence regarding a clinical examination (consisting of an endoscopic examination, palpation, and either CT or MR imaging) for the outcome measure ‘categorical agreement (T-stage)’ was downgraded by two levels because of study limitations (2 levels for risk of bias: there is only one study of doubtful quality); publication bias was not assessed.

GRADE was not applied because none of the included studies reported data about the categorical agreement on T-stage between CT and histopathology when measuring tumor thickness.

PET-CT

GRADE was not applied because none of the included studies reported data about the categorical agreement on T-stage between PET-CT and histopathology when measuring tumor thickness.

MRI

GRADE was not applied because none of the included studies reported data about the categorical agreement on T-stage between MRI and histopathology when measuring tumor thickness.

Ultrasound

GRADE was not applied because none of the included studies reported data about the categorical agreement on T-stage between ultrasound and histopathology when measuring tumor thickness.

Categorical (at a threshold)

Clinical examination

GRADE was not applied because none of the included studies reported data about the categorical agreement for tumor thickness at a threshold between a clinical examination and histopathology.

GRADE was not applied because none of the included studies reported data about the categorical agreement for tumor thickness at a threshold between CT and histopathology.

PET-CT

GRADE was not applied because none of the included studies reported data about the categorical agreement for tumor thickness at a threshold between PET-CT and histopathology.

MRI

GRADE was not applied because none of the included studies reported data about the categorical agreement for tumor thickness at a threshold between MRI and histopathology.

Ultrasound

GRADE was not applied because none of the included studies reported data about the categorical agreement for tumor thickness at a threshold between ultrasound and histopathology.

Continuous (in millimeters)

Clinical examination

GRADE was not applied because none of the included studies reported data about the agreement on a continuous scale for tumor thickness between a clinical examination and histopathology.

The level of evidence regarding CT for the outcome measure ‘agreement on a continuous measurement level (in millimeters)’ was downgraded by 4 levels because of study limitations (2 levels for risk of bias: there is only one study of doubtful quality) and the number of included patients (2 levels for imprecision: the sample size was less than 50); publication bias was not assessed.

PET-CT

GRADE was not applied because none of the included studies reported data about the agreement on a continuous scale for tumor thickness between PET-CT and histopathology.

MRI

The level of evidence regarding MRI for the outcome measure ‘agreement on a continuous measurement level (in millimeters)’ was downgraded by 2 levels because of study limitations (1 level for risk of bias: there were multiple studies of doubtful quality) and the number of included patients (1 level for imprecision: the sample size was less than 100, but more than 50); publication bias was not assessed.

Ultrasound

The level of evidence regarding ultrasound for the outcome measure ‘agreement on a continuous measurement level (in millimeters)’ was downgraded by 1 level because of study limitations (1 level for risk of bias: there were multiple studies of doubtful quality. Klein Nulent 2018 assessed the risk of bias with the QUADAS-2 and scored 4 studies with low risk / 1 study with high risk / 7 studies with unclear risk on the ‘flow and timing’ item); publication bias was not assessed.

Conclusions

Depth of invasion

The agreement estimates of modalities measuring depth of invasion and their certainty (following GRADE) are summarized in Table 3.

Table 3 Summarized results for the agreement and GRADE certainty of clinical examination, CT, PET-CT, MRI, or intraoral ultrasound measuring depth of invasion

Modality	Agreement on a categorical level per T-Stage (GRADE certainty)	Agreement on a categorical level using a threshold (GRADE certainty)	Agreement on a continuous level (GRADE certainty)
Clinical examination	K = 0.47 (unclear procedures) (VERY LOW)	K = 0.61 (95%CI: 0.36-0.87) at a 5-millimeter threshold (palpation) (VERY LOW)	NA
Clinical examination	References: Goel 2016	References: Alsaffar 2016	NA
CT	NA	NA	NA
PET-CT	NA	NA	NA
MRI	Range: K = 0.60-0.74 (MODERATE)	K = 0.80 (95%CI: 0.59-1.00) at a 5-millimeter threshold (VERY LOW)	Range upper 95% LoA: 4.9-5.8* Range mean difference: -0.4–2.32* Range lower 95% LoA: -0.97– -6.6* (MODERATE)
MRI	References: Goel 2016; Verma 2019; Vidiri 2019	References: Alsaffar 2016	References: Mao 2019; Vidiri 2019
Ultrasound	NA	K = 0.65 (95%CI: 0.43-0.87) at a 5-millimeter threshold (VERY LOW)	NA
Ultrasound	NA	References: Iida 2018	NA
Sub-analyses in Mao 2019 were not included in the range CI: Confidence Interval* LoA: Limit of Agreement NA: Not Available

Tumor thickness

The agreement estimates of modalities measuring tumor thickness and their certainty (following GRADE) are summarized in Table 4.

Table 4 Summarized results for the agreement and GRADE certainty of clinical examination, CT, PET-CT, MRI, or intraoral ultrasound measuring tumor thickness

Modality	Agreement on a categorical level per T-Stage (GRADE certainty)	Agreement on a categorical level using a threshold (GRADE certainty)	Agreement on a continuous level (GRADE certainty)
Clinical examination	K = 0.81 (endoscopic examination, palpation, and either CT or MR imaging) (LOW)	NA	NA
Clinical examination	References: Choi 2017	NA	NA
CT	NA	NA	Upper 95% LoA: 17.53 Mean difference: 5.93 Lower 95% LoA: -5.66 (VERY LOW)
	NA	NA	References: Shintani 2001
PET-CT	NA	NA	NA
MRI		NA	Range upper 95% LoA: 8.6-23.05 Range mean difference: 1.3-8.55 Range lower 95% LoA: -5.94– -6.1 (LOW)
MRI	NA	NA	References: Brouwer de Koning 2019; Shintani 2001
Ultrasound	NA	NA	Range upper 95% LoA: 4.99-6.5 Range mean difference: -0.15-0.5 Range lower 95% LoA: -4.6– -5.5 (MODERATE)
Ultrasound	NA	NA	References: Brouwer de Koning 2019; Klein Nulent 2018; Nair 2018
CI: Confidence Interval LoA: Limit of Agreement NA: Not Available

Zoeken en selecteren

A systematic review of the literature was performed to answer the following question:

What is the agreement between preoperative clinical examination (by palpation), computed tomography (CT), positron emission tomography/computed tomography (PET-CT), magnetic resonance imaging (MRI) or intraoral ultrasound, and postoperative histopathologic results for measuring the depth of the invasion (or tumor thickness) by a tumor in patients with an oral cavity carcinoma?

P: patients with an oral cavity carcinoma;

I: preoperative determination of the depth of invasion (or tumor thickness) with palpation, CT, PET-CT, MRI, or intraoral ultrasound;

C: comparisons between palpation, CT, PET-CT, MRI, or intraoral ultrasound with postoperative pathological assessment as a reference standard;

O: agreement parameters on a continuous (depth in millimeters) or categorical (at a threshold, or for T-stage) measurement level.

Relevant outcome measures

The guideline development group considered agreement parameters regarding the final T-staging of the tumor and agreement on a continuous measurement level as a critical outcome measures for decision making.

A priori, the working group did not define the outcome measures listed above but used the definitions used in the studies.

The working group defined an underestimation and overestimation of >2 millimeter compared to the postoperative pathological assessment as a clinically important disagreement. This is acknowledged to be an arbitrary choice, since evidence regarding the clinical importance of the 2-millimeter border is lacking. A Cohen’s kappa (k) was considered sufficient when the K was greater or equal to 0.70 (Terwee, 2007; Prinssen, 2016).

The working group defined the time between preoperative assessment and the surgical resection shorter than or equal to 4 weeks as adequate. This is acknowledged to be an arbitrary interval, however it was presumed that this period would usually not allow a change in the construct to be measured.

Search and select (Methods)

The databases Medline (via OVID) and Embase (via Embase.com) were searched with relevant search terms until 12^th of November 2019 for systematic reviews and primary diagnostic studies. The detailed search strategy is depicted under the tab Methods. The systematic literature search resulted in 311 hits. Studies were selected based on the following criteria: patients had an oral cavity carcinoma, agreement between preoperative assessment of the depth of invasion or tumor thickness with palpation (clinical examination) /CT/PET-CT/MRI/intraoral ultrasound and a postoperative pathological assessment was reported, reported parameters were for absolute agreement or these could be calculated. Initially 35 studies were selected after the screening of title and abstract. The working group checked the methods of the full-text studies to determine whether ‘depth of invasion’ or ‘tumor thickness’ was measured. After reading the full text, 24 studies were excluded (see the table with reasons for exclusion under the tab Evidence tables). Ten primary studies and one systematic review were included.

Results

Six primary studies were included in the analyses of literature for depth of invasion. One systematic review (10 studies provided information) and four primary studies were included for tumor thickness. Important study characteristics and results were extracted in the evidence tables. Results are summarized in Table 1 (depth of invasion) and Table 2 (tumor thickness) under the ‘summary of literature’. The assessment of the risk of bias is summarized in the risk of bias tables (under the tab Evidence tables).

Risk of bias was assessed with the Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) risk of bias checklist (Mokkink, 2010). The boxes concerning reliability and measurement error were used for the risk of bias assessment, since the clinical question concerned inter-instrument reliability and/or inter-instrument agreement. The conclusive risk of bias outcome is the lowest score on the COSMIN 4-point risk of bias tool (i.e. the lowest-score-counts principle). The study design and procedures for each assessed instrument was assessed. For example, when a study assessed both CT and MRI measurements (separately) versus histopathological measurements, both the design and procedures of CT versus histopathology and MRI versus histopathology are assessed individually for potential risk of bias. A preoperative assessment with an interval of 4 weeks or shorter before surgery was deemed appropriate.

The adapted GRADE assessment was conducted in accordance with the described procedure by Mokkink (2018). The adapted GRADE procedure entailed that three levels could be downgraded in the risk of bias domain: one level for a serious risk (multiple studies of doubtful quality or one study of adequate quality), two levels for a very serious risk (multiple studies of inadequate quality or one study of doubtful quality), or three levels for an extremely serious risk (only one study of inadequate quality). The inconsistency domain could be downgraded by one or two levels when there was unexplained heterogeneity between the reported outcomes. A maximum of two levels could be downgraded for imprecision: one level (body of evidence contains n=50 to n=100), or two levels (body of evidence contains less than n=50). When the included study did not completely match the PICO as defined in this guideline module, one or two levels could be downgraded for indirectness. Publication bias is not assessed in the adjusted GRADE procedure.

Referenties

Alsaffar HA, Goldstein DP, King EV, de Almeida JR, Brown DH, Gilbert RW, Gullane PJ, Espin-Garcia O, Xu W, Irish JC. Correlation between clinical and MRI assessment of depth of invasion in oral tongue squamous cell carcinoma. J Otolaryngol Head Neck Surg. 2016 Nov 22;45(1):61. PubMed PMID: 27876067; PubMed Central PMCID: PMC5120480.
Baba A, Masuda K, Hashimoto K, Matsushima S, Yamauchi H, Ikeda K, Yamazaki M, Suzuki T, Ogane S, Kurokawa R, Kurokawa M, Ota Y, Mogami T, Nomura T, Ojiri H. Correlation between the magnetic resonance imaging features of squamous cell carcinoma of the buccal mucosa and pathologic depth of invasion. Oral Surg Oral Med Oral Pathol Oral Radiol. 2021 Jan 8:S2212-4403(21)00002-X. doi: 10.1016/j.oooo.2020.12.023. Epub ahead of print. PMID: 33516643.
Brouwer de Koning SG, Karakullukcu MB, Lange CAH, Ruers TJM. The oral cavity tumor thickness: Measurement accuracy and consequences for tumor staging. Eur J Surg Oncol. 2019 Nov;45(11):2131-2136. doi: 10.1016/j.ejso.2019.06.005. Epub 2019 Jun 4. PubMed PMID: 31227341.
Chin SY, Kadir K, Ibrahim N, Rahmat K. Correlation and accuracy of contrast-enhanced computed tomography in assessing depth of invasion of oral tongue carcinoma. Int J Oral Maxillofac Surg. 2020 Nov 5:S0901-5027(20)30377-5. doi: 10.1016/j.ijom.2020.09.025. Epub ahead of print. PMID: 33162298.
Choi N, Noh Y, Lee EK, Chung M, Baek CH, Baek KH, Jeong HS. Discrepancy between cTNM and pTNM staging of oral cavity cancers and its prognostic significance. J Surg Oncol. 2017 Jun;115(8):1011-1018. doi: 10.1002/jso.24606. Epub 2017 Mar 23. PubMed PMID: 28334428.
Cocker H, Francies O, Adams A, Sassoon I, Schilling C. Do we have a robust method for preoperative tumour depth assessment for oral cavity tumours with clinically negative necks? Int J Oral Maxillofac Surg. 2020 Dec 25:S0901-5027(20)30416-1. doi: 10.1016/j.ijom.2020.11.002. Epub ahead of print. PMID: 33358587.
Filauro M, Missale F, Marchi F, Iandelli A, Carobbio ALC, Mazzola F, Parrinello G, Barabino E, Cittadini G, Farina D, Piazza C, Peretti G. Intraoral ultrasonography in the assessment of DOI in oral cavity squamous cell carcinoma: a comparison with magnetic resonance and histopathology. Eur Arch Otorhinolaryngol. 2020 Oct 21. doi: 10.1007/s00405-020-06421-w. Epub ahead of print. PMID: 33084951.
Goel V, Parihar PS, Parihar A, Goel AK, Waghwani K, Gupta R, Bhutekar U. Accuracy of MRI in Prediction of Tumour Thickness and Nodal Stage in Oral Tongue and Gingivobuccal Cancer With Clinical Correlation and Staging. J Clin Diagn Res. 2016 Jun;10(6):TC01-5. doi: 10.7860/JCDR/2016/17411.7905. Epub 2016 Jun 1. PubMed PMID: 27504375; PubMed Central PMCID: PMC4963735.
Iida Y, Kamijo T, Kusafuka K, Omae K, Nishiya Y, Hamaguchi N, Morita K, Onitsuka T. Depth of invasion in superficial oral tongue carcinoma quantified using intraoral ultrasonography. Laryngoscope. 2018 Dec;128(12):2778-2782. doi: 10.1002/lary.27305. Epub 2018 Oct 16. PubMed PMID: 30325049.
Klein Nulent TJW, Noorlag R, Van Cann EM, Pameijer FA, Willems SM, Yesuratnam A, Rosenberg AJWP, de Bree R, van Es RJJ. Intraoral ultrasonography to measure tumor thickness of oral cancer: A systematic review and meta-analysis. Oral Oncol. 2018 Feb;77:29-36. doi: 10.1016/j.oraloncology.2017.12.007. Epub 2017 Dec 18. PubMed PMID: 29362123.
Mao MH, Wang S, Feng ZE, Li JZ, Li H, Qin LZ, Han ZX. Accuracy of magnetic resonance imaging in evaluating the depth of invasion of tongue cancer. A prospective cohort study. Oral Oncol. 2019 Apr;91:79-84. doi: 10.1016/j.oraloncology.2019.01.021. Epub 2019 Mar 4. PubMed PMID: 30926067.
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., Bouter, L. M., … de Vet, H. C. (2010). The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Quality of life research: an international journal of quality of life aspects of treatment, care and rehabilitation, 19(4), 539-49.
Mokkink LB, Prinsen CA, Patrick DL, Alonso J, Bouter LM, de Vet HC, Terwee CB, Mokkink LB. COSMIN methodology for systematic reviews of patient-reported outcome measures (PROMs). User manual. 2018;78:1. Available from: https://www.cosmin.nl/wp-content/uploads/COSMIN-syst-review-for-PROMs-manual_version-1_feb-2018-1.pdf.
Nair AV, Meera M, Rajamma BM, Anirudh S, Nazer PK, Ramachandran PV. Preoperative ultrasonography for tumor thickness evaluation in guiding management in patients with early oral tongue squamous cell carcinoma. Indian J Radiol Imaging. 2018 Apr-Jun;28(2):140-145. doi: 10.4103/ijri.IJRI_151_17. PubMed PMID: 30050234; PubMed Central PMCID: PMC6038222.
Noorlag R, Klein Nulent TJW, Delwel VEJ, Pameijer FA, Willems SM, de Bree R, van Es RJJ. Assessment of tumour depth in early tongue cancer: Accuracy of MRI and intraoral ultrasound. Oral Oncol. 2020 Jul 9;110:104895. doi: 10.1016/j.oraloncology.2020.104895. Epub ahead of print. PMID: 32653839.
Prinsen CA, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, et al. How to select outcome measurement instruments for outcomes included in a "Core Outcome Set" – a practical guideline. Trials. 2016;17(1):449.
Shintani S, Yoshihama Y, Ueyama Y, Terakado N, Kamei S, Fijimoto Y, Hasegawa Y, Matsuura H, Matsumura T. The usefulness of intraoral ultrasonography in the evaluation of oral cancer. Int J Oral Maxillofac Surg. 2001 Apr;30(2):139-43. PubMed PMID: 11405449.
Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60(1):34-42.
Verma, A., Singhal, A., Hadi, R., Singh, P. and Raj, G., 2019. Evaluation of tumor thickness in three dimensions on magnetic resonance imaging and its comparison with final histopathology in squamous cell carcinoma of the tongue. Clinical Cancer Investigation Journal, 8(4), p.161.
Vidiri A, Panfili M, Boellis A, Cristalli G, Gangemi E, Pellini R, Marzi S, Covello R. The role of MRI-derived depth of invasion in staging oral tongue squamous cell carcinoma: inter-reader and radiological-pathological agreement. Acta Radiol. 2019 Jul 18:284185119862946. doi: 10.1177/0284185119862946. (Epub ahead of print) PubMed PMID: 31319692.
Waech T, Pazahr S, Guarda V, Rupp NJ, Broglie MA, Morand GB. Measurement variations of MRI and CT in the assessment of tumor depth of invasion in oral cancer: A retrospective study. Eur J Radiol. 2021 Feb;135:109480. doi: 10.1016/j.ejrad.2020.109480. Epub 2020 Dec 15. PMID: 33370639.

Evidence tabellen

Study reference

Study characteristics

Patient characteristics

Measurement properties and procedures

Follow-up

Interpretability of results

Outcome measures and effect size ⁴

Comments

Klein Nulent 2018 (systematic review)

Instruments assessed:

Ultrasound (intraoral) versus. histopathology for tumor thickness

Included studies:

A: Joshi 2014

B: Yesuratnam 2014

C: Chammas 2011

D: Lodder 2011

E: Kodama 2010

F: Mark Taylor 2010

G: Kaneoya 2009

H: Baek 2008

I: Yamane 2007

J: Songra 2006

K: Helbig 2005

L: Shintani 2001

Setting and country:

A: India

B: Australia

C: Brazil

D: Netherlands

E: Japan

F: Canada

G: Japan

H: Korea

I: Japan

J: UK

K: Germany

L: Japan

Funding and conflicts of interest:

SR authors declare that they have no conflicts of interest and no specific grants were reveived. CoI and funding is not reported for included studies

Inclusion criteria:

Patients with OSCC, pre- or intraoperative measurement of tumor thickness or margin assessment by ultrasound, measurements compared to histopathological tumor thickness or margin width.

Exclusion criteria:

Duplicates, reviews / book chapters/ case reports / editorials / oral presentations / notes / poster presentations, analyses of head and neck SCC without subgroup analysis for oral SCC, languages other than English or German.

Search date:

6 july 2016

Search sources:

PubMed, Embase, Cochrane databases

Sample characteristics¹:

Sample size (in Bland-altmanplot), n:

A: 7

B: 66

C: 19

D: 33

E: 13

F: 21

G: -

H: 20

I: -

J: 14

K: 9

L: 38

Tumor site:

A: tongue and buccal mucosa

B: tongue

C: tongue

D: tongue and FOM

E: tongue

F: tongue and FOM

G: tongue

H: tongue

I: tongue

J: tongue, FOM, lip, alveolar mucosa

K: tongue

L: tongue, FOM, buccal mucosa

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error (inter-instrument)

Studies included in Bland-altman plot:

A: Yes

B: Yes

C: Yes

D: Yes

E: Yes

F: Yes

G: No

H: Yes

I: No

J: Yes

K: Yes

L: Yes

Timing of US:

A: preoperative

B: preoperative

C: preoperative

D: preoperative

E: intraoperative

F: preoperative

G: unclear

H: intraoperative

I: unclear

J: intraoperative

K: intraoperative

L: unclear

Device and probe:

A: RIC5 GE Voluson (9-5MHz)

B: Philips iU22 (7-15MHz)

C: GE Med. Systems Logiq 500 (5-10MHz)

D: Philips iU22 (5-7MHz, 7-15MHz)

E: Aloka SSD-1200CV (7.5MHz)

F: unclear (10-12MHz)

G: Toshiba PLM-1202S (12MHz)

H: Aloka UST-9120 (8-10MHz)

I: Aloka SSD-630 (10MHz)

J: AT HDI-5000 (5-10MHz)

K: Diasonics VERSUST (8-12MHz)

L: Toshiba PEF-704LA (7.5MHz), Aloka UST-995 (7.5MHz), Aloka UST-5536 (7.5MHz)

Preoperative or intraoperative ultrasound was compared to a histopathological assessment.

Incomplete outcome data:

No participants where no individual data was available, n:

A: all data available

B: 22

C: all data available

D: 32

E: all data available

F: all data available

G: not included in bland-altman plot

H: all data available

I: not included in bland-altman plot.

J: all data available

K: all data available

L: 1

How were missing data handled?

Excluded from bland-altman plot.

Length of follow-up (if applicable):

Not reported in the SR, somewhat deducible from ‘Flow and timing’ in the risk of bias assessment; 9/16 unclear, 1/16 high risk, 6/16 low risk.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described? (yes/no):

Not described

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Bland-Altman plot (US-histopathology), mm:

Mean difference: 0.5.

95% upper limit: 6.5

95% lower limit: -5.5

Authors used the QUADAS-2 tool for the risk of bias assessment.

Individual patient data from Taylor 2010 was estimated from a figure by the SR authors.

TNM edition: TNM7 (data was extracted according to AJCC7)

Alsaffar 2016

Instruments assessed:

Clinical assessment versus histopathology and MRI assessment versus histopathology for depth of invasion (categorized)

Setting and country: Cancer center, Canada

Sampling method:

Patients referred to the cancer center were recruited.

Funding and conflicts of interest: Authors had no CoI or funding

Inclusion criteria:

Newly diagnosed oral SCC

Exclusion criteria:

Referred with an MRI that was reviewed by the surgeon prior to clinical exam and enrolment in the study, CT imaging only, carcinoma in situ or previous excisional biopsy, previous head and neck (chemo)radiation.

N total at baseline:

N= 53

Sample characteristics¹:

Mean age ± SD (or median age (range)):

Sex (male/female):

G: 34/19

T-stage, n:

T1: 22

T2: 22

T3: 7

T4: 2

N-stage, n:

N0: 32

N1: 7

N2: 11

N3: 0

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

Clinical examination consisted of the treating surgeon performing a palpation prior to radiographic evaluation of the tumor depth.

For MRI and pathologic assessment, the depth of invasion was measured form the adjacent mucosa to the deepest tumor aspect.

Invasion depth was categorized (binominal) using 2 categories: < 5mm invasion, and ≥5mm invasion.

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Unclear time between preoperative assessment and pathological assessment.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described?

Pathological depth <5: 10

Pathological depth ≥5: 43

Radiological depth <5: 9

Radiological depth ≥5: 40

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Kappa at a cut-off point of 5mm invasion depth for MRI:

K=0.80 (95%CI: 0.59-1.00)

Kappa at a cut-off point of 5mm invasion depth for clinical assessment:

K=0.61 (95%CI: 0.36-0.87)

Unclear time between preoperative assessment and pathological assessment.

TNM edition: unclear

Brouwer de Koning, 2019

Instruments assessed:

Ultrasound versus histopathology and MRI versus histopathology for tumor thickness

Setting and country: Hospital, Netherlands

Sampling method:

Database between 2011 and 2016

Funding and conflicts of interest: Authors declare that they have no CoI. Funding not reported.

Inclusion criteria:

Oral cavity cancer that was clinically staged as T1 or 2, US or MR images were acquired between 2011-2016.

Exclusion criteria:

None reported.

N total at baseline:

N=142 treated, n=83 included

Sample characteristics¹:

Mean age ± SD (or median age (range)):

61 (range: 31-88)

Sex (male/female):

G: 45M/38F

Disease site, n:

Tongue: 58

FOM: 24

Palate: 2

Lip: 1

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

Tumor thicknes was measured in n=76 with US and in n=46 with MRI.

Preoperative MRI and US were performed (US after excision was excluded).

For US, a Hitachi EUB-900 was used, with the EUP-054 transducer at 13-7MHz. The probe was placed directly on the lesion.

MRI was performed with a Philips Achieva 3T (dedicated 16-channel SENSE neurovascular coil). Used sequences: T1W, TSE, TRA, TR, TE 538/10ms, flip angle 90, matrix 288/248, slice thickness of 4mm, STIR TSE COR, TR/TE 2500/60ms, matrix 216/170, T1 3D Thrive fat-saturation, intravenous injection of 15cc gadoterate meglumine, TR/TE 9.86/4.59ms, flip angle 10, matrix 200/179, slice thickness 1mm. The tumor dimensions were measured in 3D and were reported by the radiologist (suggesting a T-stage). Reports of the radiologist were used for the study.

Tumor dimensions were also reported by the pathologist in the pathological report. Further procedures on the histopathological measurement were not reported.

Incomplete outcome data:

From sample (and subgroups if applicable)

N=32 was excluded because tumor dimensions were reported by only 1 modality; n=11 was excluded because no histopathology data was available; n=5 were excluded because US was acquired after excision, n=1 because histopathology showed scar tissue, n=10 tumor was not assessable with US. (leaving n=83 for analysis)

How were missing data handled?

Patients were excluded from analysis.

Length of follow-up (if applicable):

Unclear time between preoperative assessment and histopathology.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described?

US tumor thickness, mean (SD) in mm: 5.1 (3.1)

Histopathology tumor thickness, mean (SD) in mm: 5.1 (3.5)

MR tumor thickness, mean (SD) in mm: 7.4 (3.5)

Histopathology tumor thickness, mean (SD) in mm: 6.1 (3.2)

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Bland-Altman plot (US-histopathology), mm:

Mean difference: 0.05.

95% upper limit: 5.4

95% lower limit: -5.3

Bland-Altman plot (MRI-histopathology), mm:

Mean difference: 1.3.

95% upper limit: 8.6

95% lower limit: -6.1

All tumors were SCC.

TNM edition: AJCC7 and AJCC8 were used. However, the AJCC7 was used to classify in T-stage (using the greatest dimension)

Choi 2017

Instruments assessed: composite clinical assessment (endoscopic + palpation + CT or MRI) versus histopathology for tumor thickness (categorized for T-stage)

Setting and country: Hospital, Korea

Sampling method: Database between 1996 and 2012

Funding and conflicts of interest: Funded by Samsung biomedical research institute basic clinical collaborative research grant and National Research Foundation of Korea. Authors state that there were no CoI to declare.

Inclusion criteria:

Biopsy-proven SCC of the oral cavity had undergone curative surgical resection of primary tumor and neck dissection or sentinel node biopsy as initial treatment.

Exclusion criteria:

Synchronous or metachronous cancers, distant metastasis

N total at baseline:

N=252

Sample characteristics¹:

Median age range:

55(range: 47-65)

Sex (male/female):

164M/88F

Tumor site, n (%):

Tongue: 195 (77.4%)

FOM: 34 (13.5%)

Buccal: 23 (9.1%)

pT classification, n (%):

T1: 109 (43.3%)

T2: 80 (31.7%)

T3: 25 (9.9%)

T4a: 37 (14.7%)

T4b: 1 (0.4%)

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

Clinical staging was performed preoperatively by physicians with endoscopes (magnified view) and palpation and either contrast-enhanced CT or MRI. No further procedures described.

Surgical specimens of the primary tumor were assessed grossly and microscopically. No further procedures described.

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Unclear time period between preoperative assessment and postoperative histopathological assessment

Loss-to-follow-up (if applicable):

From sample (and subgroups if applicable)

Was the distribution of the (total) scores in the study sample described? (yes/no):

pT classification, n (%):

T1: 109 (43.3%)

T2: 80 (31.7%)

T3: 25 (9.9%)

T4a: 37 (14.7%)

T4b: 1 (0.4%)

cT classification, n (%):

T1: 114 (45.2%)

T2: 87 (34.5%)

T3: 25 (9.9%)

T4a: 37 (14.7%)

T4b: 1 (0.4%)

Percentage of the sample with the lowest score possible:

Percentage of the sample with the highest score possible:

Minimally important change/difference determined or referred? (yes/no)

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Kappa for the agreement of T-stage (clinical examination versus pathological data):

K=0.81 (95%CI: not reported, p<0.001)

Unclear time period between preoperative and postoperative assessment.

TNM edition: TNM7 (AJCC7) was used.

Goel 2016

Instruments assessed:

clinical versus histopathology, and MRI versus histopathology for tumor thickness (categorized for T-stage)

Setting and country: Hospital, India

Sampling method:

Prospective data-gathering of consecutive participants between July 2013 and august 2015.

Funding and conflicts of interest:

authors state that there are no financial or competing interests.

Inclusion criteria:

Biopsy proven SCC of the tongue or gingiva-buccal region with enlarged neck nodes

Exclusion criteria:

Claustrophobic patients, metallic implants, other cancers of oral cavity, MR staging > T4a, patients not willing to undergo MRI.

N total at baseline:

N=61

Sample characteristics¹:

Mean age ± SD (or median age (range)):

Sex (male/female):

G: 45M/16F

T-stage, n:

T1: 4

T2: 16

T3: 13

T4: 28

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

No procedures of the clinical examination were reported.

MRI was performed with 1.5-T (BRIVO MR 355 1.5Tesla GE MRI). Patients lay supine on the MRI table and a head coil was applied. Sequences of 4mm thickness with 1mm intersection gap and a 256x256 matrix were used (240mm FOV). Routine T1WI, T2WI, Coronal STIR followed by post contrast axial T1W were performed.

Axial, coronal T2WI and post contrast T1WI were used to measure the lesion size. Lesions were stages (T-stage).

No procedures for histopathology were reported.

The agreement of clinical examinarion or MRI with histopathology on T-stage was assessed (T1-T4).

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Unclear how much time there was between clinical or MRI examination and histopathological assessment.

Loss-to-follow-up (if applicable):

From sample (and subgroups if applicable)

Was the distribution of the (total) scores in the study sample described? (yes/no):

T-stage histopathology, n:

T1: 4

T2: 16

T3: 13

T4: 28

T-stage clinical examination, n:

T1: 0

T2: 34

T3: 11

T4: 16

T-stage MRI, n:

T1: 5

T2: 18

T3: 9

T4: 29

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Kappa for the agreement of T-stage (clinical examination versus pathological data):

K=0.47 (95%CI: not reported)

Kappa for the agreement of T-stage (MRI versus pathological data):

K=0.69 (95%CI: not reported)

Agreement was categorized in T-stages.

No procedures for histopathology were reported.

TNM edition: Unclear

Iida 2018

Instruments assessed:

Ultrasound versus histopathology for tumor invasion (categorized)

Setting and country: US was performed in an outpatient clinic, Hospital, Japan.

Sampling method:

Funding and conflicts of interest:

Inclusion criteria:

Early oral tongue SCC, patient between June 2008 and December 2015,

Exclusion criteria:

Local recurrence after partial glossectomy, prior radiotherapy, prior chemotherapy.

N total at baseline:

N=56

Sample characteristics¹:

Mean age ± SD):

59 (range: 25-90)

Sex (male/female):

G: 34M/22F

Disease site, n:

Lateral ledge: 56

Other: 0

Tumor size by clinical measurement, n:

≤20mm: 44

>20mm: 12

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

Histological assessment was performed by using a micrometer in the tumor specimen (formalin-fixed paraffin-embedded). Depth of invasion was measured from the level of basement membrane of the closest normal mucosa.

Proeoperative intraoral US was performed in an outpatient clinic. A Hitachi Aloka UST-5713T (16MHz) was used. The patient extended the tongue, which was held with gauze on the contralateral side. The reference line was defined as the line connecting tumor-normal mucosal junction of both sides. Measurement had a resolution of 0.1mm.

Instrument agreement was categorized using a cutoff value of 5mm, resulting in 2 categories: ≥5mm invasion, and <5mm invasion.

Incomplete outcome data:

How were missing data handled?

Length of follow-up (if applicable):

The time between preoperative measurement and histopathological assessment was unclear.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described?

Invasion assessed by histology, median (range): 3.5 (0.0-12.0)

Invasion assessed by ultrasound, median (range): 3.6 (0.7-9.2)

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Kappa at a cut-off point of 5mm invasion depth:

K=0.651 (95%CI: 0.43-0.87)

Instrument agreement was categorized using a cutoff value of 5mm, resulting in 2 categories: ≤5mm invasion, and >5mm invasion.

TNM edition: Unclear

Mao 2019

Instruments assessed:

MRI versus histopathology for depth of invasion

Setting and country: Hospital, China

Sampling method:

Prospective, from April 2015 to December 2017

Funding and conflicts of interest: funding by Discipline Construction Fund of Beijing Stomatology Hospital (grant/award number 17-09-14). The authors declared that the funding source had no role in the study. The authord declared that there was no CoI.

Inclusion criteria:

First diagnosis of SCC of the tongue

Exclusion criteria:

Ineligible for MRI, T4-stage, recurrent disease, received neoadjuvant treatment, prior radiotherapy.

N total at baseline:

N=150

Sample characteristics¹:

Mean age ± SD:

58.01 (SD: 12.10)

Sex (male/female):

G: 80M/70F

Disease location, n:

Ventral tongue: 35

Tongue border: 89

Dorsal tongue: 19

Tongue base: 7

Morphology, n:

Ulcer: 41

Invasive: 94

Exogeneous: 15

T-stage, n:

T1: 43

T2: 71

T3: 36

pN-stage, n:

N1: 16

N2b: 17

N2c: 2

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

A 1.5T MR was used with section thickness of 1mm (Siemens Magnetom Aera). Proeperative MRI was performed within one week before surgery. The scanning protocol consisted of: T1 axial / coronal / saggital sequences, T2 axial / coronal sequences with fat suppression, T1-weighted axial / coronal / sagittal sequences with fat suppression and contrast media.

Intraoperative tumor specimens were obtained by dissecting along the coronal/axial interval of 3mm, and the depth of invasion was measured on a micrometer. Surgical specimens were preserved in formalin. Pathological sections and staining were used to measure the depth of invasion (i.e., the vertical distance between the simulated normal mucosal junction and the deepest point of infiltration).

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Time between preoperative imaging and histopathological assessment was unclear, however preoperative imaging was performed within 1 week prior to surgery.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described? (yes/no):

Depth of invasion, mean mm (SD):

MRI: 11.75 (6.49)

Pathological: 9.43 (5.57)

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Bland-Altman plot overall n=150 (MRI-histopathology), mm:

Mean difference: 2.32.

95% upper limit: 5.61

95% lower limit: -0.97

T-stage

Bland-Altman plot tumor T1-stage n=43 (MRI-histopathology), mm:

Mean difference: 1.48.

95% upper limit: 3.63

95% lower limit: -0.67

Bland-Altman plot tumor T2-stage n=71 (MRI-histopathology), mm:

Mean difference: 2.08.

95% upper limit: 4.62

95% lower limit: -0.45

Bland-Altman plot tumor T3-stage n=36 (MRI-histopathology), mm:

Mean difference: 3.79.

95% upper limit: 7.70

95% lower limit: -0.13

Morphology

Bland-Altman plot ulcer type tumor n=41 (MRI-histopathology), mm:

Mean difference: 3.72.

95% upper limit: 7.60

95% lower limit: -0.16

Bland-Altman plot invasive type tumor n=91 (MRI-histopathology), mm:

Mean difference: 1.83.

95% upper limit: 4.25

95% lower limit: -0.59

Bland-Altman plot exogeneous type tumor n=15 (MRI-histopathology), mm:

Mean difference: 1.53.

95% upper limit: 3.06

95% lower limit: 0.01

Patient characteristics were extracted from table 1 in the article’s supplement.

TNM edition: TNM7 (AJCC7)

Nair 2018

Instruments assessed:

Ultrasound versus histopathology for tumor thickness

Setting and country: hospital, India

Sampling method:

Recruitment between January 2012 and December 2013. Unclear recruitment method.

Funding and conflicts of interest:

authors declare that there was no funding and CoI

Inclusion criteria:

Biopsy proven T1N0 or T2N0 primary SCC of the tongue, tumor located on the lateral two-third of the tongue.

Exclusion criteria:

Tongue tumor crossing the midline of the tongue or involving the tip of the tongue, lateral surface of anterior two-third of the tongue infiltrating into surrounding structures, irradiated tumor, recurrent tumor, tumor of other subsites in the oral cavity.

N total at baseline:

N=24

Sample characteristics¹:

Mean age ± SD (or median age (range)):

55 (range: 22-76)

Sex (male/female):

G: 16M/8F

T-stane, n:

pT1: 18

pT2: 6

Describe the assessed measurement properties and their procedures:

Reliability

Preoperative measurements were performed with ultrasound at 17 or 19MHz. The tongue was extended, and the probe was placed directly on the tumor. Tumor thickness was measures from the surface to the deepest point of invasion. For ulcer type tumors an imaginary line was drawn over the area, joining the normal mucosa on both ends and the deeperst point of invasion was measured.

After resection the specimens were placed in saline (not fixed with formalin) and send to the pathology department. The specimen was cut into 2-3mm thick transverse slices.

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Unclear time between preoperative US assessment and histopathological assessment.

Loss-to-follow-up (if applicable):

From sample (and subgroups if applicable)

Was the distribution of the (total) scores in the study sample described? (yes/no):

Pathological (range): 2-15mm

Ultrasound (range): 1-14mm

Deduced from a figure (table 3)

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Reliability

ICC=0.821 (95%CI not provided, ICC model not reported)

Measurement error

Bland-Altman plot overall n=150 (US-histopathology), mm:

Mean difference: -0.15

95% upper limit: not reported.

95% lower limit: not reported.

LoA’s can be approximated from the provided bland-altman plot:

UL: 4.99

LL: -4.6

Relatively large spread in histological tumor thickness (2-15mm) may influence the ICC; Pearson’s r=0.69, ICC=0.82.

95% LoA’s were approximated from the bland-altman plot (table 4).

TNM edition TNM6 (AJCC6)

Shintani 2001

Instruments assessed:

CT versus histopathology and MRI versus histopathology for tumor thickness

Setting and country: Hospital, Japan

Sampling method:

Unclear

Funding and conflicts of interest:

Unclear

Inclusion criteria:

Unclear

Exclusion criteria:

Unclear

N total at baseline:

N=38 (38 had CT scans and 24 also had an MRI)

Sample characteristics¹:

Mean age ± SD:

58.2 (SD not reported, range: 36-91)

Sex (male/female):

Not reported

Tumor location, n:

Tongue: 26

Buccal: 8

FOM: 4

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

Tumor thickness was measured by a contrast-enhanced 5-mm axials CT scan in 39 participants.

A 4-mm axial and coronal T1/T2-weighted MRI was used in 26 participants.

An ocular micrometer was used for histological sections.

Incomplete outcome data:

Study included n=39, but n=38 was reported in table 1.

MRI was not available in n=14 participants.

Small tumors were not detected by CT and MRI: 19/38 not detected by CT, 11/24 not detected by MRI.

How were missing data handled?

Excluded from analysis.

Length of follow-up (if applicable):

Unclear time period between preoperative examination and postoperative histology.

Loss-to-follow-up (if applicable):

Was the distribution of the (total) scores in the study sample described? (yes/no):

pT-stage was not reported.

cT-stage, n:

T1: 14

T2: 13

T3: 8

T4: 4

Tumor thickness by histology, millimetres range:

1-33mm

Tumor thickness by CT, millimetres range:

1-37mm

Tumor thickness by MRI, millimetres range:

3.7-40

Percentage of the sample with the lowest score possible:

Percentage of the sample with the highest score possible:

Minimally important change/difference determined or referred? (yes/no)

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

The mean difference and the 95% limits of agreement could be calculated from the presented data.

Bland-Altman parameters calculated from presented data for CT n=19 (CT-histopathology), mm:

Mean difference: 5.93.

95% upper limit: 17.53

95% lower limit: -5.66

Bland-Altman parameters calculated from presented data for MRI n=13 (MRI-histopathology), mm:

Mean difference: 8.55.

95% upper limit: 23.05

95% lower limit: -5.94

Ultrasound data was already included in the systematic review by Klein Nulent 2018 and therefore not extracted in this evidence table.

T-staging according to UICC 1997

Tumors <5mm were difficult to assess by CT and MRI, according to the authors. 19/38 not detected by CT, 13/24 not detected by MRI.

No agreement parameters were reported, however the mean difference and 95%LoA could be calculated from the presented data.

TNM edition: TNM5 (UICC TNM5, 1997)

Verma 2019

Instruments assessed:

MRI versus histopathology for tumor thickness (categorized, T-stage)

Setting and country:

Hospital, India

Sampling method:

Prospective over the course of 1.5 years

Funding and conflicts of interest:

authors declare that there was no funding and CoI

Inclusion criteria:

Biopsy proven SCC of the tongue planned for surgery.

Exclusion criteria:

Previous history of head and neck cancer, prior surgery or radiotherapy to the neck

N total at baseline:

N=50

Sample characteristics¹:

Mean age ± SD:

49 (SD not reported)

Sex (male/female):

G: 38M/12F

No other characteristics provided.

Describe the assessed measurement properties and their procedures:

Reliability

Measurement error

MRI was performed with 4mm slices (3T GE Signa). T1W1 axial / coronal, T2WI axial / coronal / saggital, coronal STIR, and postcontrast axial T1W were performed.

Tumor thickness on MRI was measured; A reference line was drawn as the longest tumor diameter anteroposteriorly (axial view), mediolaterally (coronal view), and superoinferiorly (sagittal view). Tumor thicknes was the distance from the reference line to the deepest infiltration point and to the most projecting point of the tumor.

Tumor thickness was assessed with histopathology in three dimensions. No further procedures were reported.

Incomplete outcome data:

From sample (and subgroups if applicable)

G: %

SG: %

Reason:

How were missing data handled?

Length of follow-up (if applicable):

Time between preoperative MRI and histopathological assessment was unclear.

Loss-to-follow-up (if applicable):

From sample (and subgroups if applicable)

Was the distribution of the (total) scores in the study sample described? (yes/no):

MRI T-stage, n:

T1: 21

T2: 19

T3: 10

Histopathological T-stage, n:

T1: 24

T2: 18

T3: 8

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Measurement error

Kappa for the agreement of tumor thickness (T-stage) as measured by MRI and histopathology was not reported. However, the 3x3 table was reported from which a kappa could be calculated:

Kappa = 0.65

TNM edition: TNM7 and TNM8 (AJCC7/8), however kappa was calculated based on TNM7 cassification

Vidiri 2019

Instruments assessed:

MRI versus histopathology for depth of invasion and tumor thickness (categorized, t-stage)

Setting and country:

Retrospective database between 2013 and 2018

Sampling method:

Database

Funding and conflicts of interest: the authors declared that there were no conflicts of interest and that there was no funding received for the research, authorship, and/or publication.

Inclusion criteria:

Oral tongue SCC, preoperative MRI within 3-4 weeks before surgery, availability of the depth of invasion data on the histopathological repost, presence of a measurable tumor on MRI.

Exclusion criteria:

Preoperative chemoradiation, mandible infiltration (T4a tumors).

N total at baseline:

N=43

Sample characteristics¹:

Median age (range):

65 (31-82)

Sex (male/female):

G: 18M/25F

pT-stage, n:

pT1: 10

pT2: 12

pT3: 21

Describe the assessed measurement properties and their procedures:

Reliability

MRI was performed with a 1.5T scanner (GE Optima MR 450W). T2W coronal, FSE T2W axial, pre-contrast T1W axial images were made. DWI were obtained with single-shot spin-echo and echo-planar imaging. Furthermore, post-contras T1 on de axial plane ande T1W images with liver acquisition on axial and coronal planes were made.

Depth of invasion for MRI was measured using a reference line (connectin thejunctions of tumor surface and of the normal mucosa surface), ignoring exophytic portions of the tumor. Invasion was meseared by drawing a line from the reference line to the deepest invasion.

Radiologist assessed the cT-stage as well.

MR images were assessed independently by an experienced and an inexperienced radiologist. They were blinded from histopathological results.

Resected tissue samples were fixed in formalin. Embedding, sectioning and staining (hematoxylin en eosin) was done for histopathological analyses. Depth of invasion was measured by drawing a plumb line from the level of the basement membrane of the closest normal mucosa to the deepest point of invasion.

Incomplete outcome data:

From sample (and subgroups if applicable)

How were missing data handled?

Length of follow-up (if applicable):

Time between MRI and histopathology is unclear; however, MRI was performed 3-4 weeks before surgical resection.

Loss-to-follow-up (if applicable):

From sample (and subgroups if applicable)

Was the distribution of the (total) scores in the study sample described? (yes/no):

Depth of invasion for T1 tumors, median mm:

Histopathology: 4

Radiologist A: 4.95

Radiologist B: 4.95

Depth of invasion for T2 tumors, median mm:

Histopathology: 7

Radiologist A: 6.7

Radiologist B: 7

Depth of invasion for T3 tumors, median mm:

Histopathology: 15

Radiologist A: 14.7

Radiologist B: 16

pT-stage, n:

pT1: 10

pT2: 12

pT3: 21

cT-stage experiences radiologist, n:

cT1: 7

cT2: 15

cT3: 21

cT-stage inexperienced radiologist, n:

cT1: 6

cT2: 19

cT3: 18

Percentage of the sample with the lowest score possible:

Not relevant

Percentage of the sample with the highest score possible:

Not relevant

Minimally important change/difference determined or referred? (yes/no)

Not relevant

Outcome measures and effect size (include 95%CI and p-value if available):

Reliability

Reliability

ICC for experienced MR radiologist and histopathologic result, depth of invasion:

ICC=0.90 (95%CI: 0.81-0.94)

ICC for inexperienced MR radiologist and histopathologic result, depth of invasion:

ICC=0.87 (95%CI: 0.76-0.92)

Measurement error

For depth of invasion

Bland-Altman plot (MRI experienced radiologist-histopathology), mm:

Mean difference: -0.3

95% upper limit: 4.9

95% lower limit: -5.5

Bland-Altman plot (MRI inexperienced radiologist-histopathology), mm:

Mean difference: -0.4

95% upper limit: 5.8

95% lower limit: -6.6

For tumor thickness (categorized, T-stage)

Kappa for the agreement of T-stage (MRI experienced radiologist versus pathological data):

K=0.74 (95%CI: 0.56-0.92)

Kappa for the agreement of T-stage (MRI inexperienced radiologist versus pathological data):

K=0.60 (95%CI: 0.40-0.80)

ICC (2,1) was used for absolute agreement.

TNM edition: TNM 8 (AJCC8)

¹ Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., Bouter, L. M., … de Vet, H. C. (2010). The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Quality of life research: an international journal of quality of life aspects of treatment, care and rehabilitation, 19(4), 539-49.

Table of quality assessment for systematic reviews of RCTs and observational studies

Based on AMSTAR checklist (Shea, 2007; BMC Methodol 7: 10; doi:10.1186/1471-2288-7-10) and PRISMA checklist (Moher, 2009; PLoS Med 6: e1000097; doi:10.1371/journal.pmed1000097)

Study

First author, year

Appropriate and clearly focused question?¹

Yes/no/unclear

Comprehensive and systematic literature search?²

Yes/no/unclear

Description of included and excluded studies?³

Yes/no/unclear

Description of relevant characteristics of included studies?⁴

Yes/no/unclear

Appropriate adjustment

for potential

confounders in

observational studies?⁵

Yes/no/unclear/

Not applicable

Assessment of scientific quality of included studies?⁶

Yes/no/unclear

Enough similarities between studies to make combining them reasonable?⁷

Yes/no/unclear

Potential risk of publication bias taken into account?⁸

Yes/no/unclear

Potential conflicts of interest reported?⁹

Yes/no/unclear

Klein Nulent 2018

Yes

Reason: Aim could have been more specified, however the inclusion/exclusion criteria are specific and probably reproducible.

Yes

Reason: Multiple databases were searched. Search period was described. Syntax is available in the supplementary materials.

No (partially)

Reason: Excluded studies were not described or referenced, however reasons for exclusion were provided in the study selection flow diagram.

Yes

Reason: table 2 provides the study characteristics.

Not applicable

Yes

Reason: QUADAS-2 tool was used.

Not applicable

Reason: meta-analyses was not performed

Reason: publication bias was not assessed. (publication bias is difficult to assess for clinimetric questions)

Reason: It was reported for the systematic review authors, but not for the included studies.

Research question (PICO) and inclusion criteria should be appropriate and predefined.
Search period and strategy should be described; at least Medline searched; for pharmacological questions at least Medline + EMBASE searched.
Potentially relevant studies that are excluded at final selection (after reading the full text) should be referenced with reasons.
Characteristics of individual studies relevant to research question (PICO), including potential confounders, should be reported.
Results should be adequately controlled for potential confounders by multivariate analysis (not applicable for RCTs).
Quality of individual studies should be assessed using a quality scoring tool or checklist (Jadad score, Newcastle-Ottawa scale, risk of bias table et cetera).
Clinical and statistical heterogeneity should be assessed; clinical: enough similarities in patient characteristics, intervention and definition of outcome measure to allow pooling? For pooled data: assessment of statistical heterogeneity using appropriate statistical tests (for example Chi-square, I²)?
An assessment of publication bias should include a combination of graphical aids (for example funnel plot, other available tests) and/or statistical tests (for example Egger regression test, Hedges-Olken). Note: If no test values or funnel plot included, score “no”. Score “yes” if mentions that publication bias could not be assessed because there were fewer than 10 included studies.
Sources of support (including commercial co-authorship) should be reported in both the systematic review and the included studies. Note: To get a “yes,” source of funding or support must be indicated for the systematic review AND for each of the included studies.

COSMIN risk of bias assessment of included studies

Reliability
Author: Alsaffar 2019
Instrument: clinical examination
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Reliability
Author: Alsaffar 2019
Instrument: MRI
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Measurement error

Author: Brouwer de Koning 2019

Instrument: ultrasound

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Measurement error

Author: Brouwer de Koning 2019

Instrument: MRI

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Reliability
Author: Choi 2017
Instrument: clinical examination (composite of endoscopic + palpation + CT or MRI)
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Reliability
Author: Goel 2016
Instrument: clinical examination
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Reliability
Author: Goel 2016
Instrument: MRI
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Reliability
Author: Iida 2018
Instrument: ultrasound
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Measurement error

Author: Mao 2019

Instrument: MRI

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Measurement error

Author: Shintani 2001

Instrument: CT

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Measurement error

Author: Shintani 2001

Instrument: MRI

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Measurement error

Author: Nair 2018

Instrument: Ultrasound

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Reliability
Author: Verma 2019
Instrument: MRI
	Very Good	Adequate	Doubtful	Inadequate	NA
Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
Were the test conditions similar for the measurements? e.g. type of administration, environment, instructions	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
For ordinal scores: Was a weighted kappa calculated?	Weighted kappa calculated		Unweighted kappa calculated or not described		Not applicable
For ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

Measurement error

Author: Vidiri 2019

Instrument: MRI (DOI)

Very Good

Adequate

Doubtful

Inadequate

Were patients stable in the interim period on the construct to be

measured?

Patients were stable

(evidence provided)

Assumable that

patients were stable

Unclear if patients

were stable

Patients were

NOT stable

Was the time interval appropriate?

Time interval

appropriate

Doubtful whether

time interval was

appropriate or time

interval was not

stated

Time interval

NOT

appropriate

Were the test conditions similar for the measurements? (e.g. type

of administration, environment, instructions)

Test conditions were

similar (evidence

provided)

Assumable that test

conditions were

similar

Unclear if test

conditions were

similar

Test conditions

were NOT

similar

For continuous scores: Was the Standard Error of Measurement

(SEM), Smallest Detectable Change (SDC) or Limits of Agreement

(LoA) calculated?

SEM, SDC, or LoA

calculated

Possible to calculate

LoA from the data

presented

SEM calculated

based on

Cronbach’s

alpha, or on SD

from another

population

Not applicable

For dichotomous/nominal/ordinal scores: Was the percentage

(positive and negative) agreement calculated?

% positive and

negative agreement

calculated

% agreement

calculated

% agreement

not calculated

Not applicable

Were there any other important flaws in the design or statistical

methods of the study?

No other important

methodological

flaws

Other minor

methodological flaws

Other

important

methodological

flaws

Table of excluded studies

Authors and year	Reason for exclusion
Angelelli 2017	No parameters of interest reported
Bashir 2011	No parameters of interest reported
Chammas 2011	Included in the systematic review by Klein Nulent 2018
Iwai 2002	No parameters of interest reported
Jayasankaran2017	No parameters of interest reported
Jung 2009	No parameters of interest reported
Junn 2017	No parameters of interest reported
Kodama 2010	Included in the systematic review by Klein Nulent 2018
Koopaie 2014	Article not available in English
Lam 2004	No parameters of interest reported
Lwin 2012	No parameters of interest reported
Madana 2015	No parameters of interest reported
Morand 2019	No parameters of interest reported
Moreno 2017	No parameters of interest reported
Park 2011	No parameters of interest reported
Preda 2006	No parameters of interest reported
Sarode 2010	Letter to the editor
Sarode 2012	Educational/opinion paper
Shintni 1997	No parameters of interest reported
Songra 2006	Included in the systematic review by Klein Nulent 2018
Weimar 2018	No parameters of interest reported
Yesuratnam 2014	Included in the systematic review by Klein Nulent 2018
Yuen 2008	No parameters of interest reported
Zhou 2008	Article not available in English

Verantwoording

Autorisatiedatum en geldigheid

Laatst beoordeeld :

Laatst geautoriseerd :

Geplande herbeoordeling : 01-01-2028

Initiatief en autorisatie

Initiatief:

Nederlandse Vereniging voor Keel-Neus-Oorheelkunde en Heelkunde van het Hoofd-Halsgebied

Geautoriseerd door:

Nederlandse Internisten Vereniging
Nederlandse Vereniging voor Keel-Neus-Oorheelkunde en Heelkunde van het Hoofd-Halsgebied
Nederlandse Vereniging voor Nucleaire geneeskunde
Nederlandse Vereniging voor Pathologie
Nederlandse Vereniging voor Plastische Chirurgie
Nederlandse Vereniging voor Radiologie
Nederlandse Vereniging voor Radiotherapie en Oncologie
Nederlandse Vereniging voor Mond- Kaak- en Aangezichtschirurgie

Algemene gegevens

De ontwikkeling/herziening van deze richtlijnmodule werd ondersteund door het Kennisinstituut van de Federatie Medisch Specialisten en werd gefinancierd uit de Stichting Kwaliteitsgelden Medisch Specialisten (SKMS). De financier heeft geen enkele invloed gehad op de inhoud van de richtlijnmodule.

Samenstelling werkgroep

Voor het ontwikkelen van de richtlijnmodule is in 2019 een multidisciplinaire werkgroep ingesteld, bestaande uit vertegenwoordigers van alle relevante specialismen die betrokken zijn bij de zorg voor patiënten met hoofd-halstumoren.

Werkgroep

Prof. Dr. R. de Bree, KNO-arts/hoofd-halschirurg, UMC Utrecht, Utrecht, NVKNO (voorzitter)
Dr. M.B. Karakullukcu, KNO-arts/hoofd-halschirurg, NKI, Amsterdam, NVKNO
Dr. H.P. Verschuur, KNO-arts/hoofd-halschirurg, Haaglanden MC, Den Haag, NVKNO
Dr. M. Walenkamp, AIOS-KNO, LUMC, Leiden, NVKNO
Dr. A. Sewnaik, KNO-arts/hoofd-halschirurg, Erasmus MC, Rotterdam, NVKNO
Drs. L.H.E. Karssemakers, MKA-chirurg-oncoloog/hoofd-hals chirurg, NKI, Amsterdam, NVMKA
Dr. M.J.H. Witjes, MKA-chirurg-oncoloog, UMC Groningen, Groningen, NVMKA
Drs. L.A.A. Vaassen, MKA-chirurg-oncoloog, Maastricht UMC+, Maastricht, NVMKA
Drs. W.L.J. Weijs, MKA-chirurg-oncoloog, Radboud UMC, Nijmegen, NVKMA
Drs. E.M. Zwijnenburg, Radiotherapeut-oncoloog, Radboud UMC, Nijmegen, NVRO
Dr. A. Al-Mamgani, Radiotherapeut-oncoloog, NKI, Amsterdam, NVRO
Prof. Dr. C.H.J. Terhaard, Radiotherapeut-oncoloog, UMC Utrecht, Utrecht, NVRO
Drs. J.G.M. Van den Hoek, Radiotherapeut-oncoloog, UMC Groningen, Groningen, NVRO
Dr. E. Van Meerten, Internist-oncoloog, Erasmus MC Kanker Instituut, Rotterdam, NIV
Dr. M. Slingerland, Internist-oncoloog, LUMC, Leiden, NIV
Drs. M.A. Huijing, Plastisch Chirurg, UMC Groningen, Groningen, NVPC
Prof. Dr. S.M. Willems, Klinisch patholoog, UMC Groningen, Groningen, NVVP
Prof. Dr. E. Bloemena, Klinisch patholoog, Amsterdam UMC, locatie Vumc, Amsterdam, NVVP
R.A. Burdorf, Voorzitter dagelijks bestuur patiëntenvereniging, Patiëntenvereniging HOOFD-HALS, PvHH
P.S. Verdouw, Hoofd infocentrum patiëntenvereniging, Patiëntenvereniging HOOFD-HALS, PvHH
A.A.M. Goossens, Verpleegkundig specialist oncologie, Haaglanden MC, Den Haag, V&VN
Dr. P. de Graaf, Radioloog, Amsterdam UMC, Amsterdam, NVvR
Dr. W.V. Vogel, Nucleair geneeskundige/radiotherapeut-oncoloog, NKI, Amsterdam, NVNG
Drs. G.J.C. Zwezerijnen, Nucleair geneeskundige, Amsterdam UMC, Amsterdam, NVNG

Klankbordgroep

Dr. C.M. Speksnijder, Fysiotherapeut, UMC Utrecht, Utrecht, KNGF
Ir. A. Kok, Diëtist, UMC Utrecht, Utrecht, NVD
Dr. M.M. Hakkesteegt, Logopedist, Erasmus MC, Rotterdam, NVLF
Drs. D.J.M. Buurman, Tandarts-MFP, Maastricht UMC+, Maastricht, KNMT
W. Van der Groot-Roggen, Mondhygiënist, UMC Groningen, Groningen, NVvM
Drs. D.J.S. Dona, Bedrijfsarts/Klinisch arbeidsgeneeskundige oncologie, Radboud UMC, Nijmegen, NVKA
Dr. M. Sloots, Ergotherapeut, UMC Utrecht, Utrecht
J. Poelstra, Medisch maatschappelijk werkster, op persoonlijke titel

Met ondersteuning van

Dr. J. Boschman, Senior adviseur, Kennisinstituut van de Federatie Medisch Specialisten
Dr. C. Gaasterland, Adviseur, Kennisinstituut van de Federatie Medisch Specialisten
Dr. A. Van der Hout, Adviseur, Kennisinstituut van de Federatie Medisch Specialisten
Dr. L. Oostendorp, Adviseur, Kennisinstituut van de Federatie Medisch Specialisten
Drs. M. Oerbekke, Adviseur, Kennisinstituut van de Federatie Medisch Specialisten
Drs. A. Hoeven, Junior adviseur, Kennisinstituut van de Federatie Medisch Specialisten

Belangenverklaringen

De Code ter voorkoming van oneigenlijke beïnvloeding door belangenverstrengeling is gevolgd. Alle werkgroepleden hebben schriftelijk verklaard of zij in de laatste drie jaar directe financiële belangen (betrekking bij een commercieel bedrijf, persoonlijke financiële belangen, onderzoeksfinanciering) of indirecte belangen (persoonlijke relaties, reputatiemanagement) hebben gehad. Gedurende de ontwikkeling of herziening van een module worden wijzigingen in belangen aan de voorzitter doorgegeven. De belangenverklaring wordt opnieuw bevestigd tijdens de commentaarfase.

Een overzicht van de belangen van werkgroepleden en het oordeel over het omgaan met eventuele belangen vindt u in onderstaande tabel. De ondertekende belangenverklaringen zijn op te vragen bij het secretariaat van het Kennisinstituut van de Federatie Medisch Specialisten.

Werkgroeplid	Functie	Nevenfuncties	Gemelde belangen	Ondernomen actie
*Bree, de*	KNO-arts/hoofd-halschirurg, UMC Utrecht	* Lid Algemeen Bestuur Patiëntenvereniging Hoofd-Hals (onbetaald) * Voorzitter Research Stuurgroep NWHHT * Lid Richtlijnen commissie NWHHT * Lid dagelijks bestuur NWHHT * Lid Clinical Audit Board van de Dutch Head and Neck Audit (DHNA) * Lid wetenschappelijk adviescommissie DORP * Voorzitter Adviescommissie onderzoek hoofd-halskanker (IKNL/PALGA/DHNA/NWHHT)	Geen	Geen
*Slingerland*	Internist-oncoloog, LUMC	* 2018-present: Treasurer of the "Dutch Association of Medical Oncology"(NVMO - vacancy fees) * 2018-present: Member of the "Dutch Working Group for Head-Neck Tumors" (NWHHT-Systemic therapy) * 2016-present: Member of the 'Dutch Working Group for Head-Neck Tumors" (NWHHT - study group steering group (coordinating)) * 2016-present: Member of the "Dutch Working Group for Head-Neck Tumors" (NWHHT - Elderly Platform) * 2012-present: Member "Working Group for Head-Neck Tumors" (WHHT) "University Cancer Centre"(UCK) Leiden - Den Haag * 2019: Member CAB DHNA	Deelname Nationaal expert forum hoofd-halskanker MSD dd 2-5-2018 * Deelname Checkmate studie, sponsor Bristol-Myers Squibb (BMS): An open label, randomized phase 3 clinical trial of nivolumab versus therapy of investigator's choice in recurrent or metastatic platinum-refractory squamous cell carcinoma of the head and neck (SCCHN) * Deelname Commence studie, sponsor Radboud University, in collaboration with Merck Serono International SA (among several Dutch medical centers): A phase lB-II study of the combination of cetuximab and methotrexate in recurrent of metastatic squamous cell carcinoma of the head and neck. A study of the Dutch Head and Neck Society, MOHN01/COMMENCE study. * Deelname HESPECTA studie: Phase I study: to determine the biological activity of two HPV16E6 specific peptides coupled to Amplivant®, a Toll-like receptor ligand in non-metastatic patients treated for HPV16-positive head and neck cancer. * Deelname PINCH studie (nog niet open): PD-L1 ImagiNg to predict durvalumab treatment response in HNSCC (PINCH) trial; patiënten met biopt bewezen locally recurrent of gemetastaseerd HNSCC * Deelname ISA 101b-HN-01-17 studie (nog niet open): A randomized, Double-blind, Placebo-Controlled, Phase 2 Study of Cemiplimab versus the combination of Cemiplimab with ISA101b in the Treatment of Subjects.	In de werkgroep participeren 2 internist-oncologen, zodat één van beide de voortrekker is van modules over systemische therapie. Actie: werkgroeplid is uitgesloten van besluitvorming bij modules die betrekking hebben op de onderwerpen van de gemelde onderzoeken: nivolumab, cetuximab + methotrexaat, Amplivant, durvalumab, cemiplimab.
*Meerten, van*	Internist-oncoloog, Erasmus MC Kanker Instituut	Geen	Op dit moment Principal Investigator voor NL van gerandomiseerde fase III trial naar toegevoegde waarde van pembrolizumab aan chemoradiotherapie bij patiënten met gevorderd hoofdhalskanker. Sponsor: GlaxoSmithKline Research & Development Ltd. Studie is nog lopend, resultaten zullen pas bekend zijn na verschijning van de richtlijn. In toekomst mogelijk participatie aan door industrie gesponsorde studies op gebied van behandeling van hoofdhalskanker	In de werkgroep participeren 2 internist-oncologen, zodat één van beide de voortrekker is van modules over systemische therapie. Actie: werkgroeplid is uitgesloten van besluitvorming bij modules die betrekking hebben op het onderwerp van het gemelde onderzoeken: de toegevoegde waarde van pembrolizumab bij patiënten met gevorderd hoofdhalskanker.
*Huijing*	Plastisch chirurg, UMC Groningen	Geen	Geen	Geen
*Sewnaik*	KNO-arts/hoofd Hals chirurg, Erasmus MC	Sectorhoofd Hoofd-Hals chirurgie	Geen	Geen
*Vaassen*	MKA-chirurg-oncoloog, Maastricht UMC+ / CBT Zuid-Limburg	Lid Bestuur NVMKA Waarnemend hoofd MKA-chirurgie MUMC	Geen	Geen
*Witjes*	MKA-chirurg-oncoloog, UMC Groningen	Geen	PI van KWF grant: RUG 2015 -8084: Image guided surgery for margin assessment of head & neck Cancer using cetuximab-IRDye800 cONjugate (ICON) geen financieel belang	Geen. Financiering door KWF werd niet als een belang ingeschat.
*Bloemena*	Klinisch patholoog, Amsterdam UMC (locatie Vumc) / Radboud UMC / Academisch Centrum voor Tandheelkunde Amsterdam (ACTA)	* Lid bestuur Nederlandse Vereniging voor Pathologie (NVVP) – vacatiegeld (tot 1-12-20) * Voorzitter Commissie Bij- en Nascholing (NVVP) * Voorzitter (tot 1-12-20) Wetenschappelijke Raad PALGA - onbezoldigd	Geen	Geen
*Willems*	Klinisch patholoog, UMC Groningen	Vice-vz PALGA, AB NWHHT, CAB DHNA, mede-vz en oprichter expertisegroep HH pathologie NL, Hoofdhalspathologie UMC Groningen	PDL1 trainer NL voor MSD Onderzoeksfinanciering van Pfizer, Roche, MSD, BMS, Lilly, Novartis, Bayer, Amge, AstraZeneca	Geen
*Karakullukcu*	KNO-arts/hoofd-hals chirurg, NKI/AVL	Geen	Geen	Geen
*Verschuur*	KNO-arts/Hoofd-hals chirurg, Haaglanden MC	* Opleider KNO-artsen * Dagvoorzitter	Geen	Geen
*Walenkamp*	AIOS KNO, LUMC	Geen	Geen	Geen
*Al-Mamgani*	Radiotherapeut-oncoloog, NKI/AVL	Geen	Geen	Geen
*Terhaard*	Radiotherapeut-oncoloog, UMC Utrecht	Niet van toepassing	Geen	Geen
*Hoek, van den*	Radiotherapeut-oncoloog UMCG	Niet van toepassing	Geen	Geen
*Zwijnenburg*	Radiotherapeut, Hoofd-hals Radboud UMC	Geen	Geen	Geen
*Burdorf*	Patiëntvertegenwoordiger	Geen	Geen	Geen
*Verdouw*	Hoofd Infocentrum patiëntenvereniging HOOFD HALS	Geen	Werkzaam bij de patiëntenvereniging. De achterban heeft baat bij een herziening van de richtlijn	Geen
*Karssemakers*	Hoofd-hals chirurg NKI/AVL MKA-chirurg-oncoloog Amsterdam UMC (locatie AMC) / vakgroep kaakchirurgie Amsterdam West	Niet van toepassing	Geen	Geen
*Goossens*	Verpleegkundig specialist, Haaglanden Medisch Centrum (HMC)	* Bestuurslid (penningmeester) PWHHT (onbetaald) * Lid Commissie voorlichting PVHH (onbetaald)	Geen	Geen
*Zwezerijnen*	Nucleair geneeskundige, Amsterdam UMC (locatie Vumc) PhD kandidaat, Amsterdam UMC (locatie Vumc)	Lid als nucleair geneeskundige in HOVON imaging werkgroep (bespreken van richtlijnen en opzetten/uitvoeren van wetenschappelijke studies met betrekking tot beeldvorming in de hematologie); onbetaald	Geen	Geen
*Vogel*	Nucleair geneeskundige/radiotherapeut-oncoloog, AVL	Geen	In de afgelopen jaren incidenteel advies of onderwijs, betaald door Bayer, maar niet gerelateerd aan hoofd-hals KWF-grant speekselklier toxiteit na behandeling. Geen belang bij de richtlijn	Geen
*Graaf, de*	Radioloog, Amsterdam UMC (locatie Vumc)	Bestuurslid sectie Hoofd-Hals radiologie (onbetaald)	Geen	Geen
*Weijs*	MKA-chirurg-oncoloog, Radboudumc	MKA-chirurg, Weijsheidstand B.V. Werkzaam als algemeen praktiserend MKA-chirurg, betaald (0,1 fte)	Geen	Geen

Inbreng patiëntenperspectief

Er werd aandacht besteed aan het patiëntenperspectief door het uitnodigen van de patiëntenvereniging HOOFD-HALS (PVHH) voor de Invitational conference en met afgevaardigden van de PVHH in de werkgroep. Het verslag hiervan (zie bijlagen) is besproken in de werkgroep. De verkregen input is meegenomen bij het opstellen van de uitgangsvragen, de keuze voor de uitkomstmaten en bij het opstellen van de overwegingen. De conceptrichtlijn is tevens voor commentaar voorgelegd aan de patiëntenvereniging HOOFD-HALS en de eventueel aangeleverde commentaren zijn bekeken en verwerkt.

Methode ontwikkeling

Evidence based

Werkwijze

AGREE

Deze richtlijnmodule is opgesteld conform de eisen vermeld in het rapport Medisch Specialistische Richtlijnen 2.0 van de adviescommissie Richtlijnen van de Raad Kwaliteit. Dit rapport is gebaseerd op het AGREE II instrument (Appraisal of Guidelines for Research & Evaluation II; Brouwers, 2010).

Knelpuntenanalyse en uitgangsvragen

Tijdens de voorbereidende fase inventariseerden de werkgroep de knelpunten in de zorg voor patiënten met hoofd-halstumoren. De werkgroep beoordeelde de aanbeveling(en) uit de eerdere richtlijnmodule (NVKNO, 2014) op noodzaak tot revisie. Tevens zijn er knelpunten aangedragen door de patiëntenvereniging en genodigde partijen tijdens de invitational conference (zie de bijlagen voor het verslag van de invitational conference). Op basis van de uitkomsten van de knelpuntenanalyse zijn door de werkgroep concept-uitgangsvragen opgesteld en definitief vastgesteld.

Uitkomstmaten

Na het opstellen van de zoekvraag behorende bij de uitgangsvraag inventariseerde de werkgroep welke uitkomstmaten voor de patiënt relevant zijn, waarbij zowel naar gewenste als ongewenste effecten werd gekeken. Hierbij werd een maximum van acht uitkomstmaten gehanteerd. De werkgroep waardeerde deze uitkomstmaten volgens hun relatieve belang bij de besluitvorming rondom aanbevelingen, als cruciaal (kritiek voor de besluitvorming), belangrijk (maar niet cruciaal) en onbelangrijk. Tevens definieerde de werkgroep tenminste voor de cruciale uitkomstmaten welke verschillen zij klinisch (patiënt) relevant vonden.

Methode literatuursamenvatting

Een uitgebreide beschrijving van de strategie voor zoeken en selecteren van literatuur en de beoordeling van de risk-of-bias van de individuele studies is te vinden onder ‘Zoeken en selecteren’ onder Onderbouwing. De beoordeling van de kracht van het wetenschappelijke bewijs wordt hieronder toegelicht.

Beoordelen van de kracht van het wetenschappelijke bewijs

De kracht van het wetenschappelijke bewijs werd bepaald volgens de GRADE-methode. GRADE staat voor ‘Grading Recommendations Assessment, Development and Evaluation’ (zie http://www.gradeworkinggroup.org/). De basisprincipes van de GRADE-methodiek zijn: het benoemen en prioriteren van de klinisch (patiënt) relevante uitkomstmaten, een systematische review per uitkomstmaat, en een beoordeling van de bewijskracht per uitkomstmaat op basis van de acht GRADE-domeinen (domeinen voor downgraden: risk of bias, inconsistentie, indirectheid, imprecisie, en publicatiebias; domeinen voor upgraden: dosis-effect relatie, groot effect, en residuele plausibele confounding).

GRADE onderscheidt vier gradaties voor de kwaliteit van het wetenschappelijk bewijs: hoog, redelijk, laag en zeer laag. Deze gradaties verwijzen naar de mate van zekerheid die er bestaat over de literatuurconclusie, in het bijzonder de mate van zekerheid dat de literatuurconclusie de aanbeveling adequaat ondersteunt (Schünemann, 2013; Hultcrantz, 2017).

GRADE	Definitie
Hoog	er is hoge zekerheid dat het ware effect van behandeling dicht bij het geschatte effect van behandeling ligt; het is zeer onwaarschijnlijk dat de literatuurconclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Redelijk	er is redelijke zekerheid dat het ware effect van behandeling dicht bij het geschatte effect van behandeling ligt; het is mogelijk dat de conclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Laag	er is lage zekerheid dat het ware effect van behandeling dicht bij het geschatte effect van behandeling ligt; er is een reële kans dat de conclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Zeer laag	er is zeer lage zekerheid dat het ware effect van behandeling dicht bij het geschatte effect van behandeling ligt; de literatuurconclusie is zeer onzeker.

Bij het beoordelen (graderen) van de kracht van het wetenschappelijk bewijs in richtlijnen volgens de GRADE-methodiek spelen grenzen voor klinische besluitvorming een belangrijke rol (Hultcrantz, 2017). Dit zijn de grenzen die bij overschrijding aanleiding zouden geven tot een aanpassing van de aanbeveling. Om de grenzen voor klinische besluitvorming te bepalen moeten alle relevante uitkomstmaten en overwegingen worden meegewogen. De grenzen voor klinische besluitvorming zijn daarmee niet één op één vergelijkbaar met het minimaal klinisch relevant verschil (Minimal Clinically Important Difference, MCID). Met name in situaties waarin een interventie geen belangrijke nadelen heeft en de kosten relatief laag zijn, kan de grens voor klinische besluitvorming met betrekking tot de effectiviteit van de interventie bij een lagere waarde (dichter bij het nuleffect) liggen dan de MCID (Hultcrantz, 2017).

Overwegingen (van bewijs naar aanbeveling)

Om te komen tot een aanbeveling zijn naast (de kwaliteit van) het wetenschappelijke bewijs ook andere aspecten belangrijk en worden meegewogen, zoals aanvullende argumenten uit bijvoorbeeld de biomechanica of fysiologie, waarden en voorkeuren van patiënten, kosten (middelenbeslag), aanvaardbaarheid, haalbaarheid en implementatie. Deze aspecten zijn systematisch vermeld en beoordeeld (gewogen) onder het kopje ‘Overwegingen’ en kunnen (mede) gebaseerd zijn op expert opinion. Hierbij is gebruik gemaakt van een gestructureerd format gebaseerd op het evidence-to-decision framework van de internationale GRADE Working Group (Alonso-Coello, 2016a; Alonso-Coello, 2016b). Dit evidence-to-decision framework is een integraal onderdeel van de GRADE-methodiek.

Formuleren van aanbevelingen

De aanbevelingen geven antwoord op de uitgangsvraag en zijn gebaseerd op het beschikbare wetenschappelijke bewijs en de belangrijkste overwegingen, en een weging van de gunstige en ongunstige effecten van de relevante interventies. De kracht van het wetenschappelijk bewijs en het gewicht dat door de werkgroep wordt toegekend aan de overwegingen, bepalen samen de sterkte van de aanbeveling. Conform de GRADE-methodiek sluit een lage bewijskracht van conclusies in de systematische literatuuranalyse een sterke aanbeveling niet a priori uit, en zijn bij een hoge bewijskracht ook zwakke aanbevelingen mogelijk (Agoritsas, 2017; Neumann, 2016). De sterkte van de aanbeveling wordt altijd bepaald door weging van alle relevante argumenten tezamen. De werkgroep heeft bij elke aanbeveling opgenomen hoe zij tot de richting en sterkte van de aanbeveling zijn gekomen.

In de GRADE-methodiek wordt onderscheid gemaakt tussen sterke en zwakke (of conditionele) aanbevelingen. De sterkte van een aanbeveling verwijst naar de mate van zekerheid dat de voordelen van de interventie opwegen tegen de nadelen (of vice versa), gezien over het hele spectrum van patiënten waarvoor de aanbeveling is bedoeld. De sterkte van een aanbeveling heeft duidelijke implicaties voor patiënten, behandelaars en beleidsmakers (zie onderstaande tabel). Een aanbeveling is geen dictaat, zelfs een sterke aanbeveling gebaseerd op bewijs van hoge kwaliteit (GRADE-gradering HOOG) zal niet altijd van toepassing zijn, onder alle mogelijke omstandigheden en voor elke individuele patiënt.

Implicaties van sterke en zwakke aanbevelingen voor verschillende richtlijngebruikers
	Sterke aanbeveling	Zwakke (conditionele) aanbeveling
Voor patiënten	De meeste patiënten zouden de aanbevolen interventie of aanpak kiezen en slechts een klein aantal niet.	Een aanzienlijk deel van de patiënten zouden de aanbevolen interventie of aanpak kiezen, maar veel patiënten ook niet.
Voor behandelaars	De meeste patiënten zouden de aanbevolen interventie of aanpak moeten ontvangen.	Er zijn meerdere geschikte interventies of aanpakken. De patiënt moet worden ondersteund bij de keuze voor de interventie of aanpak die het beste aansluit bij zijn of haar waarden en voorkeuren.
Voor beleidsmakers	De aanbevolen interventie of aanpak kan worden gezien als standaardbeleid.	Beleidsbepaling vereist uitvoerige discussie met betrokkenheid van veel stakeholders. Er is een grotere kans op lokale beleidsverschillen.

Organisatie van zorg

In de knelpuntenanalyse en bij de ontwikkeling van de richtlijnmodule is expliciet aandacht geweest voor de organisatie van zorg: alle aspecten die randvoorwaardelijk zijn voor het verlenen van zorg (zoals coördinatie, communicatie, (financiële) middelen, mankracht en infrastructuur). Randvoorwaarden die relevant zijn voor het beantwoorden van deze specifieke uitgangsvraag zijn genoemd bij de overwegingen. Meer algemene, overkoepelende, of bijkomende aspecten van de organisatie van zorg worden behandeld in de module Organisatie van zorg.

Commentaar- en autorisatiefase

De conceptrichtlijnmodule werd aan de betrokken (wetenschappelijke) verenigingen en (patiënt) organisaties voorgelegd ter commentaar. De commentaren werden verzameld en besproken met de werkgroep. Naar aanleiding van de commentaren werd de conceptrichtlijnmodule aangepast en definitief vastgesteld door de werkgroep. De definitieve richtlijnmodule werd aan de deelnemende (wetenschappelijke) verenigingen en (patiënt) organisaties voorgelegd voor autorisatie en door hen geautoriseerd dan wel geaccordeerd.

Literatuur

Agoritsas T, Merglen A, Heen AF, Kristiansen A, Neumann I, Brito JP, Brignardello-Petersen R, Alexander PE, Rind DM, Vandvik PO, Guyatt GH. UpToDate adherence to GRADE criteria for strong recommendations: an analytical survey. BMJ Open. 2017 Nov 16;7(11):e018593. doi: 10.1136/bmjopen-2017-018593. PubMed PMID: 29150475; PubMed Central PMCID: PMC5701989.

Alonso-Coello P, Schünemann HJ, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, Treweek S, Mustafa RA, Rada G, Rosenbaum S, Morelli A, Guyatt GH, Oxman AD; GRADE Working Group. GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 1: Introduction. BMJ. 2016 Jun 28;353:i2016. doi: 10.1136/bmj.i2016. PubMed PMID: 27353417.

Alonso-Coello P, Oxman AD, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, Treweek S, Mustafa RA, Vandvik PO, Meerpohl J, Guyatt GH, Schünemann HJ; GRADE Working Group. GRADE Evidence to Decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 2: Clinical practice guidelines. BMJ. 2016 Jun 30;353:i2089. doi: 10.1136/bmj.i2089. PubMed PMID: 27365494.

Brouwers MC, Kho ME, Browman GP, Burgers JS, Cluzeau F, Feder G, Fervers B, Graham ID, Grimshaw J, Hanna SE, Littlejohns P, Makarski J, Zitzelsberger L; AGREE Next Steps Consortium. AGREE II: advancing guideline development, reporting and evaluation in health care. CMAJ. 2010 Dec 14;182(18):E839-42. doi: 10.1503/cmaj.090449. Epub 2010 Jul 5. Review. PubMed PMID: 20603348; PubMed Central PMCID: PMC3001530.

Hultcrantz M, Rind D, Akl EA, Treweek S, Mustafa RA, Iorio A, Alper BS, Meerpohl JJ, Murad MH, Ansari MT, Katikireddi SV, Östlund P, Tranæus S, Christensen R, Gartlehner G, Brozek J, Izcovich A, Schünemann H, Guyatt G. The GRADE Working Group clarifies the construct of certainty of evidence. J Clin Epidemiol. 2017 Jul;87:4-13. doi: 10.1016/j.jclinepi.2017.05.006. Epub 2017 May 18. PubMed PMID: 28529184; PubMed Central PMCID: PMC6542664.

Medisch Specialistische Richtlijnen 2.0 (2012). Adviescommissie Richtlijnen van de Raad Kwalitieit. https://richtlijnendatabase.nl/over_deze_site/richtlijnontwikkeling.html.

Neumann I, Santesso N, Akl EA, Rind DM, Vandvik PO, Alonso-Coello P, Agoritsas T, Mustafa RA, Alexander PE, Schünemann H, Guyatt GH. A guide for health professionals to interpret and use recommendations in guidelines developed with the GRADE approach. J Clin Epidemiol. 2016 Apr;72:45-55. doi: 10.1016/j.jclinepi.2015.11.017. Epub 2016 Jan 6. Review. PubMed PMID: 26772609.

Schünemann H, Brożek J, Guyatt G, et al. GRADE handbook for grading quality of evidence and strength of recommendations. Updated October 2013. The GRADE Working Group, 2013. Available from http://gdt.guidelinedevelopment.org/central_prod/_design/client/handbook/handbook.html.

Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, Williams JW Jr, Kunz R, Craig J, Montori VM, Bossuyt P, Guyatt GH; GRADE Working Group. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008 May 17;336(7653):1106-10. doi: 10.1136/bmj.39500.677199.AE. Erratum in: BMJ. 2008 May 24;336(7654). doi: 10.1136/bmj.a139.

Schünemann, A Holger J (corrected to Schünemann, Holger J). PubMed PMID: 18483053; PubMed Central PMCID: PMC2386626.

Wessels M, Hielkema L, van der Weijden T. How to identify existing literature on patients' knowledge, views, and values: the development of a validated search filter. J Med Libr Assoc. 2016 Oct;104(4):320-324. PubMed PMID: 27822157; PubMed Central PMCID: PMC5079497.

Zoekverantwoording

Zoekacties zijn opvraagbaar. Neem hiervoor contact op met de Richtlijnendatabase.

Richtlijnendatabase

Hoofd-halstumoren

Hoofd-halstumoren

Bepaling invasiediepte

Uitgangsvraag

Aanbeveling

Overwegingen

Onderbouwing

Achtergrond

Samenvatting literatuur

Zoeken en selecteren

Referenties

Evidence tabellen

Verantwoording

Autorisatiedatum en geldigheid

Initiatief en autorisatie

Algemene gegevens

Samenstelling werkgroep

Belangenverklaringen

Inbreng patiëntenperspectief

Methode ontwikkeling

Werkwijze

Zoekverantwoording

Bijlagen