Introduction to the msSPChelpR package - from long dataset to SIR analyses • msSPChelpR

Introduction

This vignette explains how to use the functions:

calc_futime() to calculate follow-up time from index event until next event, death or end of follow-up date
pat_status() to determine patient status at end of follow-up
renumber_time_id() to calculate a consecutive index of events per case ID
reshape_long() to transpose dataset in wide format to data in long format
reshape_wide() to transpose dataset in long format to data in wide format (the wide format is required for many package functions)
sir_byfutime() to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up time
summarize_sir_results() to summarize detailed SIR results produced by sir_byfutime()
vital_status() to determine vital status whether patient is alive or dead at end of follow-up

For some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:

tidytable variants carry the suffix _tt()
tidyr variants carry the suffix _tidyr()

Recommended Workflows

Calculate follow-up times

It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations

Use the long version dataset
Filter all cases in the long version of the dataset that are relevant for your analysis. Make sure that:

for each case_id the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest case_id (TUMID3 variable for ZfKD data, and SEQ_NUM for SEER data)
all case_ids might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per case_id (second smallest case_id) if it is to be counted
in the long version dataset a count_var should indicate whether the countable incident event (SPC) has occurred or not. Coded 0 for non-occurrence (or not counted event) and 1 for a counted incident event.

Renumber filtered long dataset: In the filter long dataset, you should run the helper function msSPChelpR::renumber_time_id_dt() (or non-data.table variant msSPChelpR::renumber_time_id()) that will renumber all events per case_id and (if step 1 is fulfilled) will assign each index event with time_var_new = 1 and each second (possibly countable incident event) with time_var_new = 2. Any SIR related function will only count the second event, if additionally to time_var_new = 2 for this row also count_var = 1 is true.
Reshape dataset: Run msSPChelpR::reshape_wide_dt() or non-data.table-variant msSPChelpR::reshape_wide(), so that dataset is transposed to wide format (1 row per case_id, creating variables such as count_var.2).
Set flag for Second Primary Cancer diagnosis: After filtering and reshaping it is essential to set p_spc again. This variable will be used by later steps of the analysis.
Determine patient status at a defined end of follow-up by using the msSPChelpR::pat_status() function. This date for end of follow-up must:

be in “YYYY-MM-DD” format and is always defined via the fu_end = parameter
must precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your fu_end must be fu_end = "2014-12-15" or earlier.
Based on the newly calculated patient status, you might want to exclude cases for which patient status cannot be determined

Calculate follow-up time for the same dataset by using the msSPChelpR::calc_futime() function and the same fu_end as for step 6. By standard all functions of the msSPChelpR package require follow-up times as numeric years.

Calculate Standardized Incidence Ratios (SIR)

In order to calculate SIR using the package functions, the following data structure is needed: * Wide format data wide_df with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)

The dataset wide_df needs to contain the following variables (columns) per patient (row):
- region_var - variable in df that contains information on region where case was incident.
- agegroup_var - variable in df that contains information on age-group.
- sex_var - variable in df that contains information on biological sex.
- year_var - variable in df that contains information on year or year-period when case was incident.
- site_var - variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.
- futime_var - variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous chapter.

If your data has the required structure, you can calculate and summarize SIRs with the following two steps:

Calculate SIR per SPC diagnosis with age, sex, region, period-specific strata using the msSPChelpR::sir_byfutime() function. For this calculation usually a reference dataset is required that defines the population standard rates. refrates_df must use the same category coding of age, sex, region, year and cancer_site as agegroup_var, sex_var, region_var, year_var and site_var

The theory behind calculating stratified SIRs is explained in the chapter on basics on SIRs

Summarize SIR results using the msSPChelpR::summarize_sir_results() function on the stratified sir results produced by the previous step.

Theory behind SIRs

In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.

Examples

SEER lung cancer

Step 1 - Long dataset

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")

#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer
#> # A tibble: 113,999 × 16
#>    fake_id SEQ_NUM registry   sex   race  datebirth  t_datediag t_site_icd t_dco
#>    <chr>     <int> <chr>      <chr> <chr> <date>     <date>     <chr>      <chr>
#>  1 100004        1 SEER Reg … Male  White 1926-01-01 1992-07-15 C50        hist…
#>  2 100004        2 SEER Reg … Male  White 1926-01-01 2004-01-15 C54        hist…
#>  3 100004        3 SEER Reg … Male  White 1926-01-01 2006-06-15 C34        hist…
#>  4 100004        4 SEER Reg … Male  White 1926-01-01 2018-06-15 C14        DCO …
#>  5 100034        1 SEER Reg … Male  White 1979-01-01 2000-06-15 C50        hist…
#>  6 100037        1 SEER Reg … Fema… White 1938-01-01 1996-01-15 C54        hist…
#>  7 100038        1 SEER Reg … Male  White 1989-01-01 1991-04-15 C50        hist…
#>  8 100038        2 SEER Reg … Male  White 1989-01-01 2000-03-15 C80        hist…
#>  9 100039        1 SEER Reg … Fema… White 1946-01-01 2003-08-15 C50        hist…
#> 10 100039        2 SEER Reg … Fema… White 1946-01-01 2011-04-15 C34        hist…
#> # ℹ 113,989 more rows
#> # ℹ 7 more variables: t_hist <int>, fc_age <int>, datedeath <date>,
#> #   p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>, t_yeardiag <chr>

Step 2 - Filter long dataset

#filter for lung cancer
ids <- us_second_cancer %>%
  #detect ids with any lung cancer
  filter(t_site_icd == "C34") %>%
  select(fake_id) %>%
  as.vector() %>%
  unname() %>%
  unlist()

filtered_usdata <- us_second_cancer %>%
  #filter according to above detected ids with any lung cancer diagnosis
  filter(fake_id %in% ids) %>%
  arrange(fake_id)

filtered_usdata
#> # A tibble: 62,661 × 16
#>    fake_id SEQ_NUM registry   sex   race  datebirth  t_datediag t_site_icd t_dco
#>    <chr>     <int> <chr>      <chr> <chr> <date>     <date>     <chr>      <chr>
#>  1 100004        1 SEER Reg … Male  White 1926-01-01 1992-07-15 C50        hist…
#>  2 100004        2 SEER Reg … Male  White 1926-01-01 2004-01-15 C54        hist…
#>  3 100004        3 SEER Reg … Male  White 1926-01-01 2006-06-15 C34        hist…
#>  4 100004        4 SEER Reg … Male  White 1926-01-01 2018-06-15 C14        DCO …
#>  5 100039        1 SEER Reg … Fema… White 1946-01-01 2003-08-15 C50        hist…
#>  6 100039        2 SEER Reg … Fema… White 1946-01-01 2011-04-15 C34        hist…
#>  7 100039        3 SEER Reg … Fema… White 1946-01-01 2018-01-15 C80        hist…
#>  8 100073        1 SEER Reg … Male  White 1960-01-01 1993-11-15 C44        hist…
#>  9 100073        2 SEER Reg … Male  White 1960-01-01 2003-12-15 C34        hist…
#> 10 100143        1 SEER Reg … Male  White 1944-01-01 1992-03-15 C50        hist…
#> # ℹ 62,651 more rows
#> # ℹ 7 more variables: t_hist <int>, fc_age <int>, datedeath <date>,
#> #   p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>, t_yeardiag <chr>

Step 3 - Renumber `time_id`

renumbered_usdata <- filtered_usdata %>%
  renumber_time_id(new_time_id_var = "t_tumid", 
                   dattype = "seer",
                   case_id_var = "fake_id")

renumbered_usdata %>%
   select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
#> # A tibble: 62,661 × 5
#>    fake_id sex    t_site_icd t_datediag t_tumid
#>    <chr>   <chr>  <chr>      <date>       <int>
#>  1 100004  Male   C50        1992-07-15       1
#>  2 100004  Male   C54        2004-01-15       2
#>  3 100004  Male   C34        2006-06-15       3
#>  4 100004  Male   C14        2018-06-15       4
#>  5 100039  Female C50        2003-08-15       1
#>  6 100039  Female C34        2011-04-15       2
#>  7 100039  Female C80        2018-01-15       3
#>  8 100073  Male   C44        1993-11-15       1
#>  9 100073  Male   C34        2003-12-15       2
#> 10 100143  Male   C50        1992-03-15       1
#> # ℹ 62,651 more rows

Step 4 - Reshape to wide dataset

usdata_wide <- renumbered_usdata %>%
  reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)

#now the data is in the wide format as required by many package functions. 
#This means, each case is a row and several tumors per case ID are 
#add new columns to the data using the time_id as column name suffix.
usdata_wide
#> # A tibble: 31,997 × 136
#>    fake_id SEQ_NUM.1 registry.1            sex.1 race.1 datebirth.1 t_datediag.1
#>    <chr>       <int> <chr>                 <chr> <chr>  <date>      <date>      
#>  1 100004          1 SEER Reg 20 - Detroi… Male  White  1926-01-01  1992-07-15  
#>  2 100039          1 SEER Reg 02 - Connec… Fema… White  1946-01-01  2003-08-15  
#>  3 100073          1 SEER Reg 01 - San Fr… Male  White  1960-01-01  1993-11-15  
#>  4 100143          1 SEER Reg 02 - Connec… Male  White  1944-01-01  1992-03-15  
#>  5 100182          1 SEER Reg 02 - Connec… Male  Other  1927-01-01  1991-09-15  
#>  6 100197          1 SEER Reg 02 - Connec… Fema… White  1945-01-01  2012-06-15  
#>  7 100208          1 SEER Reg 02 - Connec… Male  White  1970-01-01  2019-11-15  
#>  8 100230          1 SEER Reg 01 - San Fr… Male  White  1947-01-01  1992-11-15  
#>  9 100234          1 SEER Reg 01 - San Fr… Male  White  1988-01-01  2010-02-15  
#> 10 100266          1 SEER Reg 01 - San Fr… Fema… White  1956-01-01  2010-07-15  
#> # ℹ 31,987 more rows
#> # ℹ 129 more variables: t_site_icd.1 <chr>, t_dco.1 <chr>, t_hist.1 <int>,
#> #   fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>, p_dodmin.1 <date>,
#> #   fc_agegroup.1 <chr>, t_yeardiag.1 <chr>, SEQ_NUM.2 <int>, registry.2 <chr>,
#> #   sex.2 <chr>, race.2 <chr>, datebirth.2 <date>, t_datediag.2 <date>,
#> #   t_site_icd.2 <chr>, t_dco.2 <chr>, t_hist.2 <int>, fc_age.2 <int>,
#> #   datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>, …

Step 5 - Recalculate `p_spc`


usdata_wide <- usdata_wide %>%
  dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ "No SPC",
                         !is.na(t_site_icd.2)           ~ "SPC developed",
                         TRUE ~ NA_character_)) %>%
  #create the same information as numeric variable count_spc
  dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ 1,
                            TRUE ~ 0))
usdata_wide %>%
   dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, 
                 t_datediag.1, t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 × 8
#>    fake_id sex.1  p_spc         count_spc t_site_icd.1 t_datediag.1 t_site_icd.2
#>    <chr>   <chr>  <chr>             <dbl> <chr>        <date>       <chr>       
#>  1 100004  Male   SPC developed         0 C50          1992-07-15   C54         
#>  2 100039  Female SPC developed         0 C50          2003-08-15   C34         
#>  3 100073  Male   SPC developed         0 C44          1993-11-15   C34         
#>  4 100143  Male   SPC developed         0 C50          1992-03-15   C34         
#>  5 100182  Male   SPC developed         0 C18          1991-09-15   C34         
#>  6 100197  Female SPC developed         0 C34          2012-06-15   C50         
#>  7 100208  Male   No SPC                1 C34          2019-11-15   NA          
#>  8 100230  Male   SPC developed         0 C44          1992-11-15   C34         
#>  9 100234  Male   No SPC                1 C34          2010-02-15   NA          
#> 10 100266  Female No SPC                1 C34          2010-07-15   NA          
#> # ℹ 31,987 more rows
#> # ℹ 1 more variable: t_datediag.2 <date>

Step 6 - Determine patient status at end of FU

usdata_wide <- usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = FALSE, check = TRUE, 
             as_labelled_factor = TRUE)
#> # A tibble: 10 × 3
#>    p_alive.1 p_status                                                          n
#>    <chr>     <fct>                                                         <int>
#>  1 Alive     Patient alive after FC (with or without following SPC after …  5986
#>  2 Alive     Patient alive after SPC                                       11421
#>  3 Alive     NA - Patient not born before end of FU                            4
#>  4 Alive     NA - Patient did not develop cancer before end of FU            873
#>  5 Dead      Patient alive after FC (with or without following SPC after …   909
#>  6 Dead      Patient alive after SPC                                        1294
#>  7 Dead      Patient dead after FC                                          6116
#>  8 Dead      Patient dead after SPC                                         5286
#>  9 Dead      NA - Patient did not develop cancer before end of FU             44
#> 10 Dead      NA - Patient date of death is missing                            64
#> # A tibble: 7 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6895
#> 2 Patient alive after SPC                                                12715
#> 3 Patient dead after FC                                                   6116
#> 4 Patient dead after SPC                                                  5286
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> 7 NA - Patient date of death is missing                                     64

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, 
                 t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 × 8
#>    fake_id p_status p_alive.1 datedeath.1 t_site_icd.1 t_datediag.1 t_site_icd.2
#>    <chr>   <fct>    <chr>     <date>      <chr>        <date>       <chr>       
#>  1 100004  Patient… Alive     NA          C50          1992-07-15   C54         
#>  2 100039  Patient… Alive     NA          C50          2003-08-15   C34         
#>  3 100073  Patient… Dead      2012-06-01  C44          1993-11-15   C34         
#>  4 100143  Patient… Alive     NA          C50          1992-03-15   C34         
#>  5 100182  Patient… Alive     NA          C18          1991-09-15   C34         
#>  6 100197  Patient… Alive     NA          C34          2012-06-15   C50         
#>  7 100208  NA - Pa… Dead      2019-11-15  C34          2019-11-15   NA          
#>  8 100230  Patient… Alive     NA          C44          1992-11-15   C34         
#>  9 100234  Patient… Alive     NA          C34          2010-02-15   NA          
#> 10 100266  Patient… Dead      2010-07-15  C34          2010-07-15   NA          
#> # ℹ 31,987 more rows
#> # ℹ 1 more variable: t_datediag.2 <date>

#alternatively, you can impute the date of death using lifedatmin_var
usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", 
             check = TRUE, as_labelled_factor = TRUE)
#> # A tibble: 9 × 3
#>   p_alive.1 p_status                                                           n
#>   <chr>     <fct>                                                          <int>
#> 1 Alive     Patient alive after FC (with or without following SPC after e…  5986
#> 2 Alive     Patient alive after SPC                                        11421
#> 3 Alive     NA - Patient not born before end of FU                             4
#> 4 Alive     NA - Patient did not develop cancer before end of FU             873
#> 5 Dead      Patient alive after FC (with or without following SPC after e…   913
#> 6 Dead      Patient alive after SPC                                         1295
#> 7 Dead      Patient dead after FC                                           6138
#> 8 Dead      Patient dead after SPC                                          5323
#> 9 Dead      NA - Patient did not develop cancer before end of FU              44
#> # A tibble: 6 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6899
#> 2 Patient alive after SPC                                                12716
#> 3 Patient dead after FC                                                   6138
#> 4 Patient dead after SPC                                                  5323
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> # A tibble: 31,997 × 139
#>    fake_id SEQ_NUM.1 registry.1            sex.1 race.1 datebirth.1 t_datediag.1
#>    <chr>       <int> <chr>                 <chr> <chr>  <date>      <date>      
#>  1 100004          1 SEER Reg 20 - Detroi… Male  White  1926-01-01  1992-07-15  
#>  2 100039          1 SEER Reg 02 - Connec… Fema… White  1946-01-01  2003-08-15  
#>  3 100073          1 SEER Reg 01 - San Fr… Male  White  1960-01-01  1993-11-15  
#>  4 100143          1 SEER Reg 02 - Connec… Male  White  1944-01-01  1992-03-15  
#>  5 100182          1 SEER Reg 02 - Connec… Male  Other  1927-01-01  1991-09-15  
#>  6 100197          1 SEER Reg 02 - Connec… Fema… White  1945-01-01  2012-06-15  
#>  7 100208          1 SEER Reg 02 - Connec… Male  White  1970-01-01  2019-11-15  
#>  8 100230          1 SEER Reg 01 - San Fr… Male  White  1947-01-01  1992-11-15  
#>  9 100234          1 SEER Reg 01 - San Fr… Male  White  1988-01-01  2010-02-15  
#> 10 100266          1 SEER Reg 01 - San Fr… Fema… White  1956-01-01  2010-07-15  
#> # ℹ 31,987 more rows
#> # ℹ 132 more variables: t_site_icd.1 <chr>, t_dco.1 <chr>, t_hist.1 <int>,
#> #   fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>, p_dodmin.1 <date>,
#> #   fc_agegroup.1 <chr>, t_yeardiag.1 <chr>, SEQ_NUM.2 <int>, registry.2 <chr>,
#> #   sex.2 <chr>, race.2 <chr>, datebirth.2 <date>, t_datediag.2 <date>,
#> #   t_site_icd.2 <chr>, t_dco.2 <chr>, t_hist.2 <int>, fc_age.2 <int>,
#> #   datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>, …

Step 6b - Remove patients irrelevant to analysis depending on status

usdata_wide <- usdata_wide %>%
  dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU",
                                 "NA - Patient did not develop cancer before end of FU",
                                 "NA - Patient date of death is missing"))

usdata_wide %>%
  dplyr::count(p_status)
#> # A tibble: 4 × 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6895
#> 2 Patient alive after SPC                                                12715
#> 3 Patient dead after FC                                                   6116
#> 4 Patient dead after SPC                                                  5286

Step 7 - Calculate FU time

usdata_wide <- usdata_wide %>%
   calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
               dattype = "seer", time_unit = "years", 
               lifedat_var = "datedeath.1", 
               fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")
#> # A tibble: 4 × 5
#>   p_status                       mean_futime min_futime max_futime median_futime
#>   <fct>                                <dbl>      <dbl>      <dbl>         <dbl>
#> 1 Patient alive after FC (with …        9.56     0.0438       27.0          8.29
#> 2 Patient alive after SPC               8.70     0            26.9          7.50
#> 3 Patient dead after FC                 8.60     0            25.9          7.54
#> 4 Patient dead after SPC                6.29     0            25.3          5.17

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)
#> # A tibble: 31,012 × 7
#>    fake_id p_status  p_futimeyrs p_alive.1 datedeath.1 t_datediag.1 t_datediag.2
#>    <chr>   <fct>           <dbl> <chr>     <date>      <date>       <date>      
#>  1 100004  Patient …       11.5  Alive     NA          1992-07-15   2004-01-15  
#>  2 100039  Patient …        7.67 Alive     NA          2003-08-15   2011-04-15  
#>  3 100073  Patient …       10.1  Dead      2012-06-01  1993-11-15   2003-12-15  
#>  4 100143  Patient …        3.33 Alive     NA          1992-03-15   1995-07-15  
#>  5 100182  Patient …        7.08 Alive     NA          1991-09-15   1998-10-15  
#>  6 100197  Patient …        4.83 Alive     NA          2012-06-15   2017-04-15  
#>  7 100230  Patient …       11.0  Alive     NA          1992-11-15   2003-11-15  
#>  8 100234  Patient …        7.87 Alive     NA          2010-02-15   NA          
#>  9 100266  Patient …        0    Dead      2010-07-15  2010-07-15   NA          
#> 10 100274  Patient …        7.38 Dead      2011-06-01  2004-01-15   NA          
#> # ℹ 31,002 more rows

Step 8 - Calculate SIR

sircalc_results <- usdata_wide %>%
  sir_byfutime(
    dattype = "seer",
    ybreak_vars = c("race.1", "t_dco.1"),
    xbreak_var = "none",
    futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
    count_var = "count_spc",
    refrates_df = us_refrates_icd2,
    calc_total_row = TRUE,
    calc_total_fu = TRUE,
    region_var = "registry.1",
    age_var = "fc_agegroup.1",
    sex_var = "sex.1",
    year_var = "t_yeardiag.1",
    race_var = "race.1",
    site_var = "t_site_icd.1", #using grouping by second cancer incidence
    futime_var = "p_futimeyrs",
    alpha = 0.05)
#> Calculating SIR ■■■■■■■■■■■■                      36% | ETA:  2s
#> Calculating SIR ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      91% | ETA:  0s
#> Calculating SIR ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#> ℹ 30 strata are affected.
#>  - This might be caused by cases where SPC occured at the same day as first cancer.
#>  - You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> ! Check attribute `problems_not_empty` of results to see what strata are affected.
#>  
#> [INFO Unexpected Cases] There are observed cases in the results file that do not occur in the refrates_df.
#> ℹ 2665 strata are affected.
#> A possible explanation can be:
#>  - DCO cases or
#>  - diagnosis of second cancer occured in different time period than first cancer
#> ! Check attribute `notes_refcases` of results to see what strata are affected.
#> 

sircalc_results %>% print(n = 100)
#> # A tidytable: 421,430 × 22
#>     age    region sex   race  year  yvar_name yvar_label fu_time t_site observed
#>     <chr>  <chr>  <chr> <chr> <chr> <chr>     <chr>      <chr>   <chr>     <dbl>
#>   1 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C14           0
#>   2 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C18           0
#>   3 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C34           0
#>   4 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C44           0
#>   5 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C50           0
#>   6 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C54           0
#>   7 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C64           0
#>   8 00 - … SEER … Fema… Black 1990… total_var Overall    to 1 m… C80           0
#>   9 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C14           0
#>  10 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C18           0
#>  11 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C34           0
#>  12 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C44           0
#>  13 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C50           0
#>  14 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C54           0
#>  15 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C64           0
#>  16 00 - … SEER … Fema… Black 1990… total_var Overall    0.0833… C80           0
#>  17 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C14           0
#>  18 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C18           0
#>  19 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C34           0
#>  20 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C44           0
#>  21 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C50           0
#>  22 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C54           0
#>  23 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C64           0
#>  24 00 - … SEER … Fema… Black 1990… total_var Overall    0.167-… C80           0
#>  25 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C14           0
#>  26 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C18           0
#>  27 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C34           0
#>  28 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C44           0
#>  29 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C50           0
#>  30 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C54           0
#>  31 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C64           0
#>  32 00 - … SEER … Fema… Black 1990… total_var Overall    1-5 ye… C80           0
#>  33 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C14           0
#>  34 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C18           0
#>  35 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C34           0
#>  36 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C44           0
#>  37 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C50           0
#>  38 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C54           0
#>  39 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C64           0
#>  40 00 - … SEER … Fema… Black 1990… total_var Overall    5-10 y… C80           0
#>  41 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C14           0
#>  42 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C18           0
#>  43 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C34           1
#>  44 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C44           0
#>  45 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C50           0
#>  46 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C54           0
#>  47 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C64           0
#>  48 00 - … SEER … Fema… Black 1990… total_var Overall    10+ ye… C80           0
#>  49 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C14           0
#>  50 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C18           0
#>  51 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C34           1
#>  52 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C44           0
#>  53 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C50           0
#>  54 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C54           0
#>  55 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C64           0
#>  56 00 - … SEER … Fema… Black 1990… total_var Overall    Total … C80           0
#>  57 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C14           0
#>  58 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C18           0
#>  59 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C34           0
#>  60 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C44           0
#>  61 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C50           0
#>  62 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C54           0
#>  63 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C64           0
#>  64 00 - … SEER … Fema… Black 1990… race.1    Black      to 1 m… C80           0
#>  65 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C14           0
#>  66 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C18           0
#>  67 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C34           0
#>  68 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C44           0
#>  69 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C50           0
#>  70 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C54           0
#>  71 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C64           0
#>  72 00 - … SEER … Fema… Black 1990… race.1    Black      0.0833… C80           0
#>  73 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C14           0
#>  74 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C18           0
#>  75 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C34           0
#>  76 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C44           0
#>  77 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C50           0
#>  78 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C54           0
#>  79 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C64           0
#>  80 00 - … SEER … Fema… Black 1990… race.1    Black      0.167-… C80           0
#>  81 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C14           0
#>  82 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C18           0
#>  83 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C34           0
#>  84 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C44           0
#>  85 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C50           0
#>  86 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C54           0
#>  87 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C64           0
#>  88 00 - … SEER … Fema… Black 1990… race.1    Black      1-5 ye… C80           0
#>  89 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C14           0
#>  90 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C18           0
#>  91 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C34           0
#>  92 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C44           0
#>  93 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C50           0
#>  94 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C54           0
#>  95 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C64           0
#>  96 00 - … SEER … Fema… Black 1990… race.1    Black      5-10 y… C80           0
#>  97 00 - … SEER … Fema… Black 1990… race.1    Black      10+ ye… C14           0
#>  98 00 - … SEER … Fema… Black 1990… race.1    Black      10+ ye… C18           0
#>  99 00 - … SEER … Fema… Black 1990… race.1    Black      10+ ye… C34           1
#> 100 00 - … SEER … Fema… Black 1990… race.1    Black      10+ ye… C44           0
#> # ℹ 421,330 more rows
#> # ℹ 12 more variables: expected <dbl>, sir <dbl>, sir_lci <dbl>, sir_uci <dbl>,
#> #   pyar <dbl>, n_base <dbl>, ref_inc_cases <dbl>, ref_population_pyar <dbl>,
#> #   ref_inc_crude_rate <dbl>, fu_time_sort <int>, yvar_sort <int>,
#> #   warning <chr>

Step 9 - Summarize SIR results

#The summarize function is versatile. Here for example the summary with minimal output

sircalc_results %>%
  #summarize results across region, age, year and t_site
  summarize_sir_results(.,
                        summarize_groups = c("region", "age", "year", "race"),
                        summarize_site = TRUE,
                        output = "long",  output_information = "minimal",
                        add_total_row = "only",  add_total_fu = "no",
                        collapse_ci = FALSE,  shorten_total_cols = TRUE,
                        fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
                        xbreak_var_name = "none", site_var_name = "t_site",
                        alpha = 0.05
                        ) %>%
  dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)
#> Warning: The results file `sir_df` contains observed cases in i_observed that do not occur in the refrates_df (ref_inc_cases).
#> Therefore calculation of the variables n_base and ref_population_pyar is ambiguous.
#> We take the first value of each variable. Expect small inconsistencies in the calculation of n_base, ref_population_pyar and ref_inc_crude_rate across strata.
#> ! If you want to know more, please check the `warnings` column of `sir_df`.
#> # A tidytable: 7 × 8
#>   yvar_label fu_time          fu_time_sort t_site observed expected   sir sir_ci
#>   <chr>      <chr>                   <int> <chr>     <dbl>    <dbl> <dbl> <chr> 
#> 1 Overall    to 1 month                  1 Total       306     20.6 14.9  13.25…
#> 2 Overall    0.0833-0.167 ye…            2 Total        74     20.4  3.62 2.84 …
#> 3 Overall    0.167-1 years               3 Total       717    196.   3.65 3.39 …
#> 4 Overall    1-5 years                   4 Total      2995    760.   3.94 3.8 -…
#> 5 Overall    5-10 years                  5 Total      3113    605.   5.14 4.96 …
#> 6 Overall    10+ years                   6 Total      4254    502.   8.47 8.22 …
#> 7 Overall    Total 0 to Inf …            7 Total     11459   2105.   5.44 5.34 …

Built with

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] msSPChelpR_0.9.1.9000 magrittr_2.0.3        dplyr_1.1.4          
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_1.8.8     compiler_4.3.2     tidyselect_1.2.0   stringr_1.5.1     
#>  [5] tidytable_0.10.2   tidyr_1.3.0        jquerylib_0.1.4    systemfonts_1.0.5 
#>  [9] textshaping_0.3.7  yaml_2.3.8         fastmap_1.1.1      R6_2.5.1          
#> [13] generics_0.1.3     sjlabelled_1.2.0   knitr_1.45         forcats_1.0.0     
#> [17] tibble_3.2.1       desc_1.4.3         insight_0.19.7     lubridate_1.9.3   
#> [21] bslib_0.6.1        pillar_1.9.0       rlang_1.1.3        utf8_1.2.4        
#> [25] cachem_1.0.8       stringi_1.8.3      xfun_0.41          fs_1.6.3          
#> [29] sass_0.4.8         timechange_0.3.0   memoise_2.0.1      cli_3.6.2         
#> [33] pkgdown_2.0.7      withr_3.0.0        digest_0.6.34      haven_2.5.4       
#> [37] hms_1.1.3          lifecycle_1.0.4    vctrs_0.6.5        data.table_1.14.10
#> [41] evaluate_0.23      glue_1.7.0         ragg_1.2.7         fansi_1.0.6       
#> [45] rmarkdown_2.25     purrr_1.0.2        tools_4.3.2        pkgconfig_2.0.3   
#> [49] htmltools_0.5.7

Introduction to the msSPChelpR package - from long dataset to SIR analyses

Marian Eberl

26 October 2020

Introduction

Recommended Workflows

Calculate follow-up times

Calculate Standardized Incidence Ratios (SIR)

Theory behind SIRs

Examples

SEER lung cancer

Step 1 - Long dataset

Step 2 - Filter long dataset

Step 3 - Renumber `time_id`

Step 4 - Reshape to wide dataset

Step 5 - Recalculate `p_spc`

Step 6 - Determine patient status at end of FU

Step 6b - Remove patients irrelevant to analysis depending on status

Step 7 - Calculate FU time

Step 8 - Calculate SIR

Step 9 - Summarize SIR results

Built with

Introduction to the msSPChelpR package - from long dataset to SIR analyses

Marian Eberl

26 October 2020

Introduction

Recommended Workflows

Calculate follow-up times

Calculate Standardized Incidence Ratios (SIR)

Theory behind SIRs

Examples

SEER lung cancer

Step 1 - Long dataset

Step 2 - Filter long dataset

Step 3 - Renumber time_id

Step 4 - Reshape to wide dataset

Step 5 - Recalculate p_spc

Step 6 - Determine patient status at end of FU

Step 6b - Remove patients irrelevant to analysis depending on status

Step 7 - Calculate FU time

Step 8 - Calculate SIR

Step 9 - Summarize SIR results

Built with

Step 3 - Renumber `time_id`

Step 5 - Recalculate `p_spc`