This is an user guide accompaining Masnadi F, Armelloni E N, Guicciardi S, Pellini G, Raicevich S, Mazzoldi C, Scanu M, Sabatini L, Tassetti A N, Ferrà C, Grati F, Bolognini L, Domenichetti F, Cacciamani R, Calì F, Polidori P, Fabi G, Luzi F, Giovanardi O, Bernarello V, Pasanisi E, Franceschini G, Breggion C, Bozzetta B, Sambo A, Prioli G, Gugnali A, Piccioni E, Fiori F, Caruso F, Scarcella G “Relative survival scenarios: an application to undersized common sole (Solea solea L.) in a beam trawl fishery in the Mediterranean Sea” and serve to facilitate the use of SOLEA: Survial toOL on a scEnario bAsis.

The tool aid the interpretation of fishing dicard survival data by the mean of a scenario approach.

Starting from Vitality assessment onboard, fishing conditions parameters and delayed survival experiment, the tool will individuate the main mortality drivers occurring within the reference fishery and use them to define scenarios on which overall survival is calculated.

Installation

Download Input data and R code from here by selecting the green square “clone or download” and store them in a unique folder.

To run the tool open Data_analysis.R and jump to lines 23-29 to set some basic input parameters.

  1. setwd: path to folder where data and code are stored must be written within the brackets. Alternatively, the wd might be set by click ctr+shift+h and navigate to the right folder

  2. surv_data: with the present code version, please type “Relative” to analyze delayed survival data with Cox Relative Hazard model (Therneau and Grambsch 2000) and return overall survival in relative terms. Option for future development: by typing “Absolute”, the script will use the Kaplan-Meier model (Kaplan and Meier 1958) to assess delayed mortality and will give overall survival in absolute terms.

  3. n_scenarios: write here how many scenarios you want to display. Note that a large number of scenarios (>4) require large amount of data

  4. censor: set duration of captive experiment (hours)

  5. filename: write here input file name

setwd("~/CNR/SOS/github/SOLEA") 
surv_data<-"Relative" ## Absolute if KM, relative if not
n_scenarios<-as.numeric(4)
censor<-as.numeric(120)
filename<-"Input_data.csv"

You will need several packages, these should be installed automatically when you run the code for the first time. If R still alerts you of a missing package, install it through the Tools option in RStudio.


list.of.packages <- c("tidyverse", "caret","rpart","rpart.plot","e1071","randomForest","survival","survminer","data.table","Boruta")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

Data required

The tool require one data file in .csv format (“Input_data.csv”). The data file need that within each row there are details of a single specimen assessed.

  • Survival data:
  1. Vitality_class: vitality category assessed onboard. See ICES (2014) for references.
  2. Status: onboard immediate mortality assessment, 0 for death during sorting, 1 for alive during sorting
  3. Survivability_days: period within the specimen remained alive in captivity (number of days). If the specimen were not assessed for delayed mortlaity, fill with NA
  • Fishing condition data: Include each variable important for your fishery. If need to use categories, prefers letters to numbers
print(summary(Data))
##   Towing_speed      Delta_T             Catch_weight  Air_exposure  
##  Min.   :5.500   Min.   :-16.200   Low        :932   Min.   :10.00  
##  1st Qu.:6.000   1st Qu.: -9.300   Medium_Low :247   1st Qu.:18.00  
##  Median :6.400   Median : -2.010   Medium_High:404   Median :25.00  
##  Mean   :6.546   Mean   : -2.888   High       :284   Mean   :28.41  
##  3rd Qu.:7.100   3rd Qu.:  2.500                     3rd Qu.:35.00  
##  Max.   :7.700   Max.   : 10.660                     Max.   :78.00  
##                                                                     
##  Vitality_class Towing_duration  Seabed_type Survivability_days status  
##  A   : 158      Min.   : 29.00   CFM:1259    Min.   :0.500      0:1085  
##  B   : 276      1st Qu.: 53.00   CMS:  67    1st Qu.:1.000      1: 782  
##  C   : 348      Median : 58.00   CSM: 535    Median :3.000              
##  Dead:1085      Mean   : 58.24   IFS:   6    Mean   :2.966              
##                 3rd Qu.: 65.00               3rd Qu.:5.000              
##                 Max.   :100.00               Max.   :5.000              
##                                              NA's   :1635

Core functions

From now on the user do not need to modify the code, except where it is explicited in the present document. It is suggested to run it line by line.

After setting right input parameters, run the code until line 53

Collinearity(Data)
## [1] "Check coplot in Figures folder"

The subfolder “Plot&Figures” is automatically created. There is now stored the .png coplot (Zuur et al. 2009), indicating collinearity between variables. If any value exceed 0.7, you might decide to exclude one of the variables from following steps. This must be performed manually: by typing “-” followed by the variable name within brackets (Towning speed in the example). If more variables need to be removed, separe them by a comma (e.g.: -Variable1, -Variable2, -Variable3).


Data<-Data %>% dplyr::select(-Towing_speed)

This is an automated procedure that analyze the importance of remaining variables and exclude the meaningless.

## # A tibble: 5 x 2
##   name            value    
##   <chr>           <fct>    
## 1 Delta_T         Confirmed
## 2 Catch_weight    Confirmed
## 3 Air_exposure    Confirmed
## 4 Towing_duration Confirmed
## 5 Seabed_type     Confirmed

The RF algorithm (Breiman 2001) is performed on selected features. OOB estimates will give the error rate calculated on the confusion matrix

## 
## Call:
##  randomForest(formula = status ~ ., data = db, importance = TRUE,      ntree = tree_best, mtry = try_best) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 25.44%
## Confusion matrix:
##     0   1 class.error
## 0 882 203   0.1870968
## 1 272 510   0.3478261
This is an additional analysis built on RF model. The analysis is automatically performed on all variables included in the analysis and the figured is saved in .tiff format in the “Figures&Plots” subfolder. On Y axis there is the Center Log Odds ratio, which represents the likelihood of variable on X axis to reduce immediate survival. For more information see Friedman (2001)

This step will divide the dataset in fishing scenarios. The numbers within the boxes are number of live specimens (right) and dead specimens (left) at the moment of sampling within the node. 0 or 1 indicate the prevalence of dead or alive specimens within the node.



This step is performed in coherence to what specified in the input parameters. The relationship between vitality class and delayed survival is assessed within the scenarios.
In the case of Relative survival it is performed a Cox Proportional Hazard model, which give the following table as result. Coef is the relative hazard compared to reference class (A). Se is the standard error of the coefficent, and ul and ll are the Upper Limit and the Lower Limit of the hazard distribution (95% percentile). Sign indicate if the value is significantly different from class A. Values will be used in the following steps ognly if significants, if not the value of class A (1) will be used


## # A tibble: 2 x 1
##   i$Indicator $sign $coef   $se    $ul   $ll
##   <chr>       <chr> <dbl> <dbl>  <dbl> <dbl>
## 1 B           s      1.39 0.445 0.104  0.598
## 2 C           s      1.48 0.475 0.0895 0.575
## # A tibble: 2 x 1
##   i$Indicator $sign  $coef   $se   $ul   $ll
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl>
## 1 B           ns    -0.333 0.398 0.640 3.04 
## 2 C           s      1.03  0.420 0.157 0.813
## # A tibble: 2 x 1
##   i$Indicator $sign $coef   $se   $ul   $ll
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B           ns    0.256 0.494 0.294  2.04
## 2 C           ns    0.626 0.449 0.222  1.29
## # A tibble: 2 x 1
##   i$Indicator $sign $coef   $se   $ul   $ll
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B           ns    0.611 0.379 0.258 1.14 
## 2 C           s     1.22  0.485 0.114 0.761
## # A tibble: 2 x 1
##   i$Indicator $sign $coef   $se   $ul   $ll
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B           s     0.537 0.208 0.388 0.879
## 2 C           s     1.23  0.226 0.188 0.454

Future option: in the case of Absolute survival it is performed a Kaplan-Meier model, which give the following table as result. Coef is the survival rate after censor period for the reference vitality class. Se is the standard error of the coefficent, and ul and ll are the Upper Limit and the Lower Limit of the survival distribution (95% percentile). Sign indicate if the value is significantly different from class A. Values will be used in the following steps ognly if significants, if not the value of class A will be used


## # A tibble: 3 x 1
##   i$Indicator $sign $coef    $se   $ul   $ll
##   <chr>       <chr> <dbl>  <dbl> <dbl> <dbl>
## 1 A           s     0.710 0.0815 0.889 0.567
## 2 B           s     0.25  0.108  0.584 0.107
## 3 C           ns    0.308 0.128  0.695 0.136
## # A tibble: 3 x 1
##   i$Indicator $sign  $coef      $se    $ul     $ll
##   <chr>       <chr>  <dbl>    <dbl>  <dbl>   <dbl>
## 1 A           s     0.0833   0.0798  0.544  0.0128
## 2 B           ns    0.286    0.0986  0.562  0.145 
## 3 C           s     0      NaN      NA     NA     
## # A tibble: 3 x 1
##   i$Indicator $sign $coef   $se   $ul    $ll
##   <chr>       <chr> <dbl> <dbl> <dbl>  <dbl>
## 1 A           s     0.375 0.121 0.706 0.199 
## 2 B           ns    0.364 0.145 0.795 0.166 
## 3 C           ns    0.231 0.117 0.623 0.0855
## # A tibble: 3 x 1
##   i$Indicator $sign $coef    $se   $ul    $ll
##   <chr>       <chr> <dbl>  <dbl> <dbl>  <dbl>
## 1 A           s     0.703 0.0751 0.867 0.570 
## 2 B           ns    0.525 0.0790 0.705 0.391 
## 3 C           ns    0.222 0.139  0.754 0.0655
## # A tibble: 3 x 1
##   i$Indicator $sign $coef    $se   $ul   $ll
##   <chr>       <chr> <dbl>  <dbl> <dbl> <dbl>
## 1 A           s     0.573 0.0505 0.681 0.482
## 2 B           s     0.398 0.0522 0.514 0.308
## 3 C           s     0.188 0.0563 0.338 0.104
The last step combine information on immediate mortality with delayed survival for the highlighted scenarios. The output diplayed is composed by three table per scenario. The first table gives the IS results, the second will give the weighted delayed survival results and the third table is the Overall survival (Relative Survival, RS) with confidence limit.


## [1] "Scenario : 3"
## # A tibble: 1 x 3
## # Groups:   IS [1]
##   IS    Percentage Scenar
##   <chr>      <dbl> <chr> 
## 1 1          0.221 3     
## # A tibble: 3 x 3
##   Indicator Percentage Scenar
##   <fct>          <dbl> <chr> 
## 1 A              0.289 3     
## 2 B              0.258 3     
## 3 C              0.454 3     
## # A tibble: 1 x 4
##      RS upper_ci low_ci Scenario
##   <dbl>    <dbl>  <dbl> <chr>   
## 1 0.105    0.132 0.0863 3       
## [1] "Scenario : 4"
## # A tibble: 1 x 3
## # Groups:   IS [1]
##   IS    Percentage Scenar
##   <chr>      <dbl> <chr> 
## 1 1          0.546 4     
## # A tibble: 3 x 3
##   Indicator Percentage Scenar
##   <fct>          <dbl> <chr> 
## 1 A              0.198 4     
## 2 B              0.446 4     
## 3 C              0.356 4     
## # A tibble: 1 x 4
##      RS upper_ci low_ci Scenario
##   <dbl>    <dbl>  <dbl> <chr>   
## 1 0.273    0.338  0.232 4       
## [1] "Scenario : 6"
## # A tibble: 1 x 3
## # Groups:   IS [1]
##   IS    Percentage Scenar
##   <chr>      <dbl> <chr> 
## 1 1          0.437 6     
## # A tibble: 3 x 3
##   Indicator Percentage Scenar
##   <fct>          <dbl> <chr> 
## 1 A              0.135 6     
## 2 B              0.326 6     
## 3 C              0.539 6     
## # A tibble: 1 x 4
##      RS upper_ci low_ci Scenario
##   <dbl>    <dbl>  <dbl> <chr>   
## 1 0.198    0.198  0.198 6       
## [1] "Scenario : 7"
## # A tibble: 1 x 3
## # Groups:   IS [1]
##   IS    Percentage Scenar
##   <chr>      <dbl> <chr> 
## 1 1          0.716 7     
## # A tibble: 3 x 3
##   Indicator Percentage Scenar
##   <fct>          <dbl> <chr> 
## 1 A              0.182 7     
## 2 B              0.390 7     
## 3 C              0.428 7     
## # A tibble: 1 x 4
##      RS upper_ci low_ci Scenario
##   <dbl>    <dbl>  <dbl> <chr>   
## 1 0.335    0.436  0.274 7       
## [1] "Scenario : Aggregate"
## # A tibble: 1 x 3
## # Groups:   IS [1]
##   IS    Percentage Scenar   
##   <chr>      <dbl> <chr>    
## 1 1          0.419 Aggregate
## # A tibble: 3 x 3
##   Indicator Percentage Scenar   
##   <fct>          <dbl> <chr>    
## 1 A              0.202 Aggregate
## 2 B              0.353 Aggregate
## 3 C              0.445 Aggregate
## # A tibble: 1 x 4
##      RS upper_ci low_ci Scenario 
##   <dbl>    <dbl>  <dbl> <chr>    
## 1 0.228    0.268  0.194 Aggregate

References

Breiman, Leo. 2001. “Random Forrest.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Friedman, J H. 2001. “Greedy function approximation: a gradient boosting machine.” Annals of Statistics 29 (5): 1189–1232. https://projecteuclid.org/euclid.aos/1013203451.

ICES. 2014. “Report of the Workshop on Methods for Estimating Discard Survival (WKMEDS). ICES HQ, Copenhagen, Denmark.” ICES CM 2014/ACOM:51. 114pp.

Kaplan, E. L., and Paul Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53 (282): 457–81. https://doi.org/10.1080/01621459.1958.10501452.

Therneau, Terry M., and Patricia M. Grambsch. 2000. Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4757-3294-8.

Zuur, Alain F., Elena N. Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. Vol. 58. Statistics for Biology and Health 12. Springer-Verlag New York, NY, USA. https://doi.org/10.1007/978-0-387-87458-6.