Introduction

This vignette explains how to use the functions:

  • calc_futime() to calculate follow-up time from index event until next event, death or end of follow-up date
  • pat_status() to determine patient status at end of follow-up
  • renumber_time_id() to calculate a consecutive index of events per case ID
  • reshape_long() to transpose dataset in wide format to data in long format
  • reshape_wide() to transpose dataset in long format to data in wide format (the wide format is required for many package functions)
  • sir_byfutime() to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up time
  • summarize_sir_results() to summarize detailed SIR results produced by sir_byfutime()
  • vital_status() to determine vital status whether patient is alive or dead at end of follow-up

For some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:

  • tidytable variants carry the suffix _tt()
  • tidyr variants carry the suffix _tidyr()

Calculate follow-up times

It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations

  1. Use the long version dataset

  2. Filter all cases in the long version of the dataset that are relevant for your analysis. Make sure that:

  • for each case_id the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest case_id (TUMID3 variable for ZfKD data, and SEQ_NUM for SEER data)
  • all case_ids might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per case_id (second smallest case_id) if it is to be counted
  • in the long version dataset a count_var should indicate whether the countable incident event (SPC) has occurred or not. Coded 0 for non-occurrence (or not counted event) and 1 for a counted incident event.
  1. Renumber filtered long dataset: In the filter long dataset, you should run the helper function msSPChelpR::renumber_time_id_dt() (or non-data.table variant msSPChelpR::renumber_time_id()) that will renumber all events per case_id and (if step 1 is fulfilled) will assign each index event with time_var_new = 1 and each second (possibly countable incident event) with time_var_new = 2. Any SIR related function will only count the second event, if additionally to time_var_new = 2 for this row also count_var = 1 is true.

  2. Reshape dataset: Run msSPChelpR::reshape_wide_dt() or non-data.table-variant msSPChelpR::reshape_wide(), so that dataset is transposed to wide format (1 row per case_id, creating variables such as count_var.2).

  3. Set flag for Second Primary Cancer diagnosis: After filtering and reshaping it is essential to set p_spc again. This variable will be used by later steps of the analysis.

  4. Determine patient status at a defined end of follow-up by using the msSPChelpR::pat_status() function. This date for end of follow-up must:

  • be in “YYYY-MM-DD” format and is always defined via the fu_end = parameter

  • must precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your fu_end must be fu_end = "2014-12-15" or earlier.

  • Based on the newly calculated patient status, you might want to exclude cases for which patient status cannot be determined

  1. Calculate follow-up time for the same dataset by using the msSPChelpR::calc_futime() function and the same fu_end as for step 6. By standard all functions of the msSPChelpR package require follow-up times as numeric years.

Calculate Standardized Incidence Ratios (SIR)

In order to calculate SIR using the package functions, the following data structure is needed: * Wide format data wide_df with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)

  • The dataset wide_df needs to contain the following variables (columns) per patient (row):
    • region_var - variable in df that contains information on region where case was incident.
    • agegroup_var - variable in df that contains information on age-group.
    • sex_var - variable in df that contains information on biological sex.
    • year_var - variable in df that contains information on year or year-period when case was incident.
    • site_var - variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.
    • futime_var - variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous chapter.

If your data has the required structure, you can calculate and summarize SIRs with the following two steps:

  1. Calculate SIR per SPC diagnosis with age, sex, region, period-specific strata using the msSPChelpR::sir_byfutime() function. For this calculation usually a reference dataset is required that defines the population standard rates. refrates_df must use the same category coding of age, sex, region, year and cancer_site as agegroup_var, sex_var, region_var, year_var and site_var
  • The theory behind calculating stratified SIRs is explained in the chapter on basics on SIRs
  1. Summarize SIR results using the msSPChelpR::summarize_sir_results() function on the stratified sir results produced by the previous step.

Theory behind SIRs

In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.

Examples

SEER lung cancer

Step 1 - Long dataset

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")

#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer
#> # A tibble: 113,999 × 15
#>    fake_id SEQ_NUM registry      sex   race  datebirth  t_datediag t_sit…¹ t_dco
#>    <chr>     <int> <chr>         <chr> <chr> <date>     <date>     <chr>   <chr>
#>  1 100004        1 SEER Reg 20 … Male  White 1926-01-01 1992-07-15 C50     hist…
#>  2 100004        2 SEER Reg 20 … Male  White 1926-01-01 2004-01-15 C54     hist…
#>  3 100004        3 SEER Reg 20 … Male  White 1926-01-01 2006-06-15 C34     hist…
#>  4 100004        4 SEER Reg 20 … Male  White 1926-01-01 2018-06-15 C14     DCO …
#>  5 100034        1 SEER Reg 21 … Male  White 1979-01-01 2000-06-15 C50     hist…
#>  6 100037        1 SEER Reg 01 … Fema… White 1938-01-01 1996-01-15 C54     hist…
#>  7 100038        1 SEER Reg 20 … Male  White 1989-01-01 1991-04-15 C50     hist…
#>  8 100038        2 SEER Reg 20 … Male  White 1989-01-01 2000-03-15 C80     hist…
#>  9 100039        1 SEER Reg 02 … Fema… White 1946-01-01 2003-08-15 C50     hist…
#> 10 100039        2 SEER Reg 02 … Fema… White 1946-01-01 2011-04-15 C34     hist…
#> # … with 113,989 more rows, 6 more variables: fc_age <int>, datedeath <date>,
#> #   p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>, t_yeardiag <chr>, and
#> #   abbreviated variable name ¹​t_site_icd

Step 2 - Filter long dataset

#filter for lung cancer
ids <- us_second_cancer %>%
  #detect ids with any lung cancer
  filter(t_site_icd == "C34") %>%
  select(fake_id) %>%
  as.vector() %>%
  unname() %>%
  unlist()

filtered_usdata <- us_second_cancer %>%
  #filter according to above detected ids with any lung cancer diagnosis
  filter(fake_id %in% ids) %>%
  arrange(fake_id)

filtered_usdata
#> # A tibble: 62,661 × 15
#>    fake_id SEQ_NUM registry      sex   race  datebirth  t_datediag t_sit…¹ t_dco
#>    <chr>     <int> <chr>         <chr> <chr> <date>     <date>     <chr>   <chr>
#>  1 100004        1 SEER Reg 20 … Male  White 1926-01-01 1992-07-15 C50     hist…
#>  2 100004        2 SEER Reg 20 … Male  White 1926-01-01 2004-01-15 C54     hist…
#>  3 100004        3 SEER Reg 20 … Male  White 1926-01-01 2006-06-15 C34     hist…
#>  4 100004        4 SEER Reg 20 … Male  White 1926-01-01 2018-06-15 C14     DCO …
#>  5 100039        1 SEER Reg 02 … Fema… White 1946-01-01 2003-08-15 C50     hist…
#>  6 100039        2 SEER Reg 02 … Fema… White 1946-01-01 2011-04-15 C34     hist…
#>  7 100039        3 SEER Reg 02 … Fema… White 1946-01-01 2018-01-15 C80     hist…
#>  8 100073        1 SEER Reg 01 … Male  White 1960-01-01 1993-11-15 C44     hist…
#>  9 100073        2 SEER Reg 01 … Male  White 1960-01-01 2003-12-15 C34     hist…
#> 10 100143        1 SEER Reg 02 … Male  White 1944-01-01 1992-03-15 C50     hist…
#> # … with 62,651 more rows, 6 more variables: fc_age <int>, datedeath <date>,
#> #   p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>, t_yeardiag <chr>, and
#> #   abbreviated variable name ¹​t_site_icd

Step 3 - Renumber time_id

renumbered_usdata <- filtered_usdata %>%
  renumber_time_id(new_time_id_var = "t_tumid", 
                   dattype = "seer",
                   case_id_var = "fake_id")

renumbered_usdata %>%
   select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
#> # A tibble: 62,661 × 5
#>    fake_id sex    t_site_icd t_datediag t_tumid
#>    <chr>   <chr>  <chr>      <date>       <int>
#>  1 100004  Male   C50        1992-07-15       1
#>  2 100004  Male   C54        2004-01-15       2
#>  3 100004  Male   C34        2006-06-15       3
#>  4 100004  Male   C14        2018-06-15       4
#>  5 100039  Female C50        2003-08-15       1
#>  6 100039  Female C34        2011-04-15       2
#>  7 100039  Female C80        2018-01-15       3
#>  8 100073  Male   C44        1993-11-15       1
#>  9 100073  Male   C34        2003-12-15       2
#> 10 100143  Male   C50        1992-03-15       1
#> # … with 62,651 more rows

Step 4 - Reshape to wide dataset

usdata_wide <- renumbered_usdata %>%
  reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)

#now the data is in the wide format as required by many package functions. 
#This means, each case is a row and several tumors per case ID are 
#add new columns to the data using the time_id as column name suffix.
usdata_wide
#> # A tibble: 31,997 × 127
#>    fake_id SEQ_NUM.1 regist…¹ sex.1 race.1 datebirt…² t_datedi…³ t_sit…⁴ t_dco.1
#>    <chr>       <int> <chr>    <chr> <chr>  <date>     <date>     <chr>   <chr>  
#>  1 100004          1 SEER Re… Male  White  1926-01-01 1992-07-15 C50     histol…
#>  2 100039          1 SEER Re… Fema… White  1946-01-01 2003-08-15 C50     histol…
#>  3 100073          1 SEER Re… Male  White  1960-01-01 1993-11-15 C44     histol…
#>  4 100143          1 SEER Re… Male  White  1944-01-01 1992-03-15 C50     histol…
#>  5 100182          1 SEER Re… Male  Other  1927-01-01 1991-09-15 C18     histol…
#>  6 100197          1 SEER Re… Fema… White  1945-01-01 2012-06-15 C34     histol…
#>  7 100208          1 SEER Re… Male  White  1970-01-01 2019-11-15 C34     histol…
#>  8 100230          1 SEER Re… Male  White  1947-01-01 1992-11-15 C44     histol…
#>  9 100234          1 SEER Re… Male  White  1988-01-01 2010-02-15 C34     DCO ca…
#> 10 100266          1 SEER Re… Fema… White  1956-01-01 2010-07-15 C34     histol…
#> # … with 31,987 more rows, 118 more variables: fc_age.1 <int>,
#> #   datedeath.1 <date>, p_alive.1 <chr>, p_dodmin.1 <date>,
#> #   fc_agegroup.1 <chr>, t_yeardiag.1 <chr>, SEQ_NUM.2 <int>, registry.2 <chr>,
#> #   sex.2 <chr>, race.2 <chr>, datebirth.2 <date>, t_datediag.2 <date>,
#> #   t_site_icd.2 <chr>, t_dco.2 <chr>, fc_age.2 <int>, datedeath.2 <date>,
#> #   p_alive.2 <chr>, p_dodmin.2 <date>, fc_agegroup.2 <chr>,
#> #   t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>, sex.3 <chr>, …

Step 5 - Recalculate p_spc


usdata_wide <- usdata_wide %>%
  dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ "No SPC",
                         !is.na(t_site_icd.2)           ~ "SPC developed",
                         TRUE ~ NA_character_)) %>%
  #create the same information as numeric variable count_spc
  dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ 1,
                            TRUE ~ 0))
usdata_wide %>%
   dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, 
                 t_datediag.1, t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 × 8
#>    fake_id sex.1  p_spc         count_spc t_site…¹ t_datedi…² t_sit…³ t_datedi…⁴
#>    <chr>   <chr>  <chr>             <dbl> <chr>    <date>     <chr>   <date>    
#>  1 100004  Male   SPC developed         0 C50      1992-07-15 C54     2004-01-15
#>  2 100039  Female SPC developed         0 C50      2003-08-15 C34     2011-04-15
#>  3 100073  Male   SPC developed         0 C44      1993-11-15 C34     2003-12-15
#>  4 100143  Male   SPC developed         0 C50      1992-03-15 C34     1995-07-15
#>  5 100182  Male   SPC developed         0 C18      1991-09-15 C34     1998-10-15
#>  6 100197  Female SPC developed         0 C34      2012-06-15 C50     2017-04-15
#>  7 100208  Male   No SPC                1 C34      2019-11-15 NA      NA        
#>  8 100230  Male   SPC developed         0 C44      1992-11-15 C34     2003-11-15
#>  9 100234  Male   No SPC                1 C34      2010-02-15 NA      NA        
#> 10 100266  Female No SPC                1 C34      2010-07-15 NA      NA        
#> # … with 31,987 more rows, and abbreviated variable names ¹​t_site_icd.1,
#> #   ²​t_datediag.1, ³​t_site_icd.2, ⁴​t_datediag.2

Step 6 - Determine patient status at end of FU

usdata_wide <- usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = FALSE, check = TRUE, 
             as_labelled_factor = TRUE)
#> # A tibble: 10 × 3
#>    p_alive.1 p_status                                                          n
#>    <chr>     <fct>                                                         <int>
#>  1 Alive     Patient alive after FC (with or without following SPC after …  5940
#>  2 Alive     Patient alive after SPC                                       11316
#>  3 Alive     NA - Patient not born before end of FU                            4
#>  4 Alive     NA - Patient did not develop cancer before end of FU            849
#>  5 Dead      Patient alive after FC (with or without following SPC after …   863
#>  6 Dead      Patient alive after SPC                                        1360
#>  7 Dead      Patient dead after FC                                          6208
#>  8 Dead      Patient dead after SPC                                         5325
#>  9 Dead      NA - Patient did not develop cancer before end of FU             68
#> 10 Dead      NA - Patient date of death is missing                            64
#> # A tibble: 7 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6803
#> 2 Patient alive after SPC                                                12676
#> 3 Patient dead after FC                                                   6208
#> 4 Patient dead after SPC                                                  5325
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> 7 NA - Patient date of death is missing                                     64

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, 
                 t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 × 8
#>    fake_id p_status     p_ali…¹ datedeat…² t_sit…³ t_datedi…⁴ t_sit…⁵ t_datedi…⁶
#>    <chr>   <fct>        <chr>   <date>     <chr>   <date>     <chr>   <date>    
#>  1 100004  Patient ali… Alive   NA         C50     1992-07-15 C54     2004-01-15
#>  2 100039  Patient ali… Alive   NA         C50     2003-08-15 C34     2011-04-15
#>  3 100073  Patient dea… Dead    2005-06-01 C44     1993-11-15 C34     2003-12-15
#>  4 100143  Patient ali… Alive   NA         C50     1992-03-15 C34     1995-07-15
#>  5 100182  Patient dea… Dead    2007-05-01 C18     1991-09-15 C34     1998-10-15
#>  6 100197  Patient ali… Alive   NA         C34     2012-06-15 C50     2017-04-15
#>  7 100208  NA - Patien… Alive   NA         C34     2019-11-15 NA      NA        
#>  8 100230  Patient dea… Dead    2008-05-01 C44     1992-11-15 C34     2003-11-15
#>  9 100234  Patient dea… Dead    2015-07-01 C34     2010-02-15 NA      NA        
#> 10 100266  Patient ali… Alive   NA         C34     2010-07-15 NA      NA        
#> # … with 31,987 more rows, and abbreviated variable names ¹​p_alive.1,
#> #   ²​datedeath.1, ³​t_site_icd.1, ⁴​t_datediag.1, ⁵​t_site_icd.2, ⁶​t_datediag.2

#alternatively, you can impute the date of death using lifedatmin_var
usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", 
             check = TRUE, as_labelled_factor = TRUE)
#> # A tibble: 9 × 3
#>   p_alive.1 p_status                                                           n
#>   <chr>     <fct>                                                          <int>
#> 1 Alive     Patient alive after FC (with or without following SPC after e…  5940
#> 2 Alive     Patient alive after SPC                                        11316
#> 3 Alive     NA - Patient not born before end of FU                             4
#> 4 Alive     NA - Patient did not develop cancer before end of FU             849
#> 5 Dead      Patient alive after FC (with or without following SPC after e…   867
#> 6 Dead      Patient alive after SPC                                         1361
#> 7 Dead      Patient dead after FC                                           6230
#> 8 Dead      Patient dead after SPC                                          5362
#> 9 Dead      NA - Patient did not develop cancer before end of FU              68
#> # A tibble: 6 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6807
#> 2 Patient alive after SPC                                                12677
#> 3 Patient dead after FC                                                   6230
#> 4 Patient dead after SPC                                                  5362
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> # A tibble: 31,997 × 130
#>    fake_id SEQ_NUM.1 regist…¹ sex.1 race.1 datebirt…² t_datedi…³ t_sit…⁴ t_dco.1
#>    <chr>       <int> <chr>    <chr> <chr>  <date>     <date>     <chr>   <chr>  
#>  1 100004          1 SEER Re… Male  White  1926-01-01 1992-07-15 C50     histol…
#>  2 100039          1 SEER Re… Fema… White  1946-01-01 2003-08-15 C50     histol…
#>  3 100073          1 SEER Re… Male  White  1960-01-01 1993-11-15 C44     histol…
#>  4 100143          1 SEER Re… Male  White  1944-01-01 1992-03-15 C50     histol…
#>  5 100182          1 SEER Re… Male  Other  1927-01-01 1991-09-15 C18     histol…
#>  6 100197          1 SEER Re… Fema… White  1945-01-01 2012-06-15 C34     histol…
#>  7 100208          1 SEER Re… Male  White  1970-01-01 2019-11-15 C34     histol…
#>  8 100230          1 SEER Re… Male  White  1947-01-01 1992-11-15 C44     histol…
#>  9 100234          1 SEER Re… Male  White  1988-01-01 2010-02-15 C34     DCO ca…
#> 10 100266          1 SEER Re… Fema… White  1956-01-01 2010-07-15 C34     histol…
#> # … with 31,987 more rows, 121 more variables: fc_age.1 <int>,
#> #   datedeath.1 <date>, p_alive.1 <chr>, p_dodmin.1 <date>,
#> #   fc_agegroup.1 <chr>, t_yeardiag.1 <chr>, SEQ_NUM.2 <int>, registry.2 <chr>,
#> #   sex.2 <chr>, race.2 <chr>, datebirth.2 <date>, t_datediag.2 <date>,
#> #   t_site_icd.2 <chr>, t_dco.2 <chr>, fc_age.2 <int>, datedeath.2 <date>,
#> #   p_alive.2 <chr>, p_dodmin.2 <date>, fc_agegroup.2 <chr>,
#> #   t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>, sex.3 <chr>, …
Step 6b - Remove patients irrelevant to analysis depending on status
usdata_wide <- usdata_wide %>%
  dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU",
                                 "NA - Patient did not develop cancer before end of FU",
                                 "NA - Patient date of death is missing"))

usdata_wide %>%
  dplyr::count(p_status)
#> # A tibble: 4 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6803
#> 2 Patient alive after SPC                                                12676
#> 3 Patient dead after FC                                                   6208
#> 4 Patient dead after SPC                                                  5325

Step 7 - Calculate FU time

usdata_wide <- usdata_wide %>%
   calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
               dattype = "seer", time_unit = "years", 
               lifedat_var = "datedeath.1", 
               fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")
#> # A tibble: 4 × 5
#>   p_status                                       mean_…¹ min_f…² max_f…³ media…⁴
#>   <fct>                                            <dbl>   <dbl>   <dbl>   <dbl>
#> 1 Patient alive after FC (with or without follo…    9.58  0.0438    27.0    8.29
#> 2 Patient alive after SPC                           8.69  0         26.9    7.50
#> 3 Patient dead after FC                             8.54  0         25.8    7.47
#> 4 Patient dead after SPC                            6.33  0         26.5    5.08
#> # … with abbreviated variable names ¹​mean_futime, ²​min_futime, ³​max_futime,
#> #   ⁴​median_futime

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)
#> # A tibble: 31,012 × 7
#>    fake_id p_status             p_fut…¹ p_ali…² datedeat…³ t_datedi…⁴ t_datedi…⁵
#>    <chr>   <fct>                  <dbl> <chr>   <date>     <date>     <date>    
#>  1 100004  Patient alive after…   11.5  Alive   NA         1992-07-15 2004-01-15
#>  2 100039  Patient alive after…    7.67 Alive   NA         2003-08-15 2011-04-15
#>  3 100073  Patient dead after …   10.1  Dead    2005-06-01 1993-11-15 2003-12-15
#>  4 100143  Patient alive after…    3.33 Alive   NA         1992-03-15 1995-07-15
#>  5 100182  Patient dead after …    7.08 Dead    2007-05-01 1991-09-15 1998-10-15
#>  6 100197  Patient alive after…    4.83 Alive   NA         2012-06-15 2017-04-15
#>  7 100230  Patient dead after …   11.0  Dead    2008-05-01 1992-11-15 2003-11-15
#>  8 100234  Patient dead after …    5.37 Dead    2015-07-01 2010-02-15 NA        
#>  9 100266  Patient alive after…    7.46 Alive   NA         2010-07-15 NA        
#> 10 100274  Patient dead after …    2.38 Dead    2006-06-01 2004-01-15 NA        
#> # … with 31,002 more rows, and abbreviated variable names ¹​p_futimeyrs,
#> #   ²​p_alive.1, ³​datedeath.1, ⁴​t_datediag.1, ⁵​t_datediag.2

Step 8 - Calculate SIR

sircalc_results <- usdata_wide %>%
  sir_byfutime(
    dattype = "seer",
    ybreak_vars = c("race.1", "t_dco.1"),
    xbreak_var = "none",
    futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
    count_var = "count_spc",
    refrates_df = us_refrates_icd2,
    calc_total_row = TRUE,
    calc_total_fu = TRUE,
    region_var = "registry.1",
    age_var = "fc_agegroup.1",
    sex_var = "sex.1",
    year_var = "t_yeardiag.1",
    race_var = "race.1",
    site_var = "t_site_icd.1", #using grouping by second cancer incidence
    futime_var = "p_futimeyrs",
    alpha = 0.05)
#> Calculating SIR ■■■■■■■■■                         27% | ETA:  3s
#> Calculating SIR ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      91% | ETA:  0s
#> Calculating SIR ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#>  40 strata are affected.
#>  - This might be caused by cases where SPC occured at the same day as first cancer.
#>  - You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> ! Check attribute `problems_not_empty` of results to see what strata are affected.
#>  
#> [INFO Unexpected Cases] There are observed cases in the results file that do not occur in the refrates_df.
#>  2682 strata are affected.
#> A possible explanation can be:
#>  - DCO cases or
#>  - diagnosis of second cancer occured in different time period than first cancer
#> ! Check attribute `notes_refcases` of results to see what strata are affected.
#> 

sircalc_results %>% print(n = 100)
#> # A tidytable: 421,296 × 22
#>     age     region      sex   race  year  yvar_…¹ yvar_…² fu_time t_site obser…³
#>     <chr>   <chr>       <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>    <dbl>
#>   1 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C14          0
#>   2 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C18          0
#>   3 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C34          0
#>   4 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C44          0
#>   5 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C50          0
#>   6 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C54          0
#>   7 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C64          0
#>   8 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall to 1 m… C80          0
#>   9 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C14          0
#>  10 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C18          0
#>  11 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C34          0
#>  12 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C44          0
#>  13 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C50          0
#>  14 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C54          0
#>  15 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C64          0
#>  16 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.0833… C80          0
#>  17 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C14          0
#>  18 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C18          0
#>  19 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C34          0
#>  20 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C44          0
#>  21 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C50          0
#>  22 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C54          0
#>  23 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C64          0
#>  24 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 0.167-… C80          0
#>  25 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C14          0
#>  26 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C18          0
#>  27 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C34          0
#>  28 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C44          0
#>  29 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C50          0
#>  30 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C54          0
#>  31 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C64          0
#>  32 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 1-5 ye… C80          0
#>  33 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C14          0
#>  34 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C18          0
#>  35 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C34          1
#>  36 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C44          0
#>  37 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C50          0
#>  38 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C54          0
#>  39 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C64          0
#>  40 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall 5-10 y… C80          0
#>  41 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C14          0
#>  42 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C18          0
#>  43 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C34          1
#>  44 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C44          0
#>  45 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C50          0
#>  46 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C54          0
#>  47 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C64          0
#>  48 00 - 04 SEER Reg 0… Fema… Black 1990… total_… Overall Total … C80          0
#>  49 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C14          0
#>  50 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C18          0
#>  51 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C34          0
#>  52 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C44          0
#>  53 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C50          0
#>  54 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C54          0
#>  55 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C64          0
#>  56 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   to 1 m… C80          0
#>  57 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C14          0
#>  58 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C18          0
#>  59 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C34          0
#>  60 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C44          0
#>  61 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C50          0
#>  62 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C54          0
#>  63 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C64          0
#>  64 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.0833… C80          0
#>  65 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C14          0
#>  66 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C18          0
#>  67 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C34          0
#>  68 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C44          0
#>  69 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C50          0
#>  70 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C54          0
#>  71 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C64          0
#>  72 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   0.167-… C80          0
#>  73 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C14          0
#>  74 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C18          0
#>  75 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C34          0
#>  76 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C44          0
#>  77 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C50          0
#>  78 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C54          0
#>  79 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C64          0
#>  80 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   1-5 ye… C80          0
#>  81 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C14          0
#>  82 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C18          0
#>  83 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C34          1
#>  84 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C44          0
#>  85 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C50          0
#>  86 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C54          0
#>  87 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C64          0
#>  88 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   5-10 y… C80          0
#>  89 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C14          0
#>  90 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C18          0
#>  91 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C34          1
#>  92 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C44          0
#>  93 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C50          0
#>  94 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C54          0
#>  95 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C64          0
#>  96 00 - 04 SEER Reg 0… Fema… Black 1990… race.1  Black   Total … C80          0
#>  97 00 - 04 SEER Reg 0… Fema… Black 1990… t_dco.1 histol… to 1 m… C14          0
#>  98 00 - 04 SEER Reg 0… Fema… Black 1990… t_dco.1 histol… to 1 m… C18          0
#>  99 00 - 04 SEER Reg 0… Fema… Black 1990… t_dco.1 histol… to 1 m… C34          0
#> 100 00 - 04 SEER Reg 0… Fema… Black 1990… t_dco.1 histol… to 1 m… C44          0
#> # … with 421,196 more rows, 12 more variables: expected <dbl>, sir <dbl>,
#> #   sir_lci <dbl>, sir_uci <dbl>, pyar <dbl>, n_base <dbl>,
#> #   ref_inc_cases <dbl>, ref_population_pyar <dbl>, ref_inc_crude_rate <dbl>,
#> #   fu_time_sort <int>, yvar_sort <int>, warning <chr>, and abbreviated
#> #   variable names ¹​yvar_name, ²​yvar_label, ³​observed

Step 9 - Summarize SIR results

#The summarize function is versatile. Her for example the summary by

sircalc_results %>%
  #summarize results across region, age, year and t_site
  summarize_sir_results(.,
                        summarize_groups = c("region", "age", "year", "race"),
                        summarize_site = TRUE,
                        output = "long",  output_information = "minimal",
                        add_total_row = "only",  add_total_fu = "no",
                        collapse_ci = FALSE,  shorten_total_cols = TRUE,
                        fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
                        xbreak_var_name = "none", site_var_name = "t_site",
                        alpha = 0.05
                        ) %>%
  dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)
#> Warning: The results file `sir_df` contains observed cases in i_observed that do not occur in the refrates_df (ref_inc_cases).
#> Therefore calculation of the variables n_base and ref_population_pyar is ambiguous.
#> We take the first value of each variable. Expect small inconsistencies in the calculation of n_base, ref_population_pyar and ref_inc_crude_rate across strata.
#> ! If you want to know more, please check the `warnings` column of `sir_df`.
#> # A tidytable: 7 × 8
#>   yvar_label fu_time              fu_time_…¹ t_site obser…² expec…³   sir sir_ci
#>   <chr>      <chr>                     <int> <chr>    <dbl>   <dbl> <dbl> <chr> 
#> 1 Overall    to 1 month                    1 Total      327    20.6 15.9  14.23…
#> 2 Overall    0.0833-0.167 years            2 Total       80    20.4  3.92 3.11 …
#> 3 Overall    0.167-1 years                 3 Total      724   196.   3.69 3.43 …
#> 4 Overall    1-5 years                     4 Total     2998   760.   3.95 3.81 …
#> 5 Overall    5-10 years                    5 Total     3089   605.   5.1  4.92 …
#> 6 Overall    10+ years                     6 Total     4241   500.   8.49 8.23 …
#> 7 Overall    Total 0 to Inf years          7 Total    11459  2102.   5.45 5.35 …
#> # … with abbreviated variable names ¹​fu_time_sort, ²​observed, ³​expected

Built with

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] msSPChelpR_0.9.0.9000 magrittr_2.0.3        dplyr_1.1.0          
#> 
#> loaded via a namespace (and not attached):
#>  [1] bslib_0.4.2       compiler_4.2.2    pillar_1.8.1      jquerylib_0.1.4  
#>  [5] forcats_1.0.0     tools_4.2.2       digest_0.6.31     timechange_0.2.0 
#>  [9] lubridate_1.9.2   jsonlite_1.8.4    evaluate_0.20     memoise_2.0.1    
#> [13] lifecycle_1.0.3   tibble_3.2.0      pkgconfig_2.0.3   rlang_1.0.6      
#> [17] cli_3.6.0         yaml_2.3.7        haven_2.5.2       pkgdown_2.0.7    
#> [21] xfun_0.37         fastmap_1.1.1     withr_2.5.0       stringr_1.5.0    
#> [25] knitr_1.42        hms_1.1.2         desc_1.4.2        generics_0.1.3   
#> [29] fs_1.6.1          vctrs_0.5.2       sass_0.4.5        systemfonts_1.0.4
#> [33] sjlabelled_1.2.0  rprojroot_2.0.3   tidyselect_1.2.0  data.table_1.14.8
#> [37] glue_1.6.2        R6_2.5.1          textshaping_0.3.6 fansi_1.0.4      
#> [41] rmarkdown_2.20    purrr_1.0.1       tidyr_1.3.0       ellipsis_0.3.2   
#> [45] htmltools_0.5.4   insight_0.19.0    tidytable_0.10.0  ragg_1.2.5       
#> [49] utf8_1.2.3        stringi_1.7.12    cachem_1.0.7