The logic behind ENMTML

We structured ENMTML as a single function with multiple arguments, which, once filled, require a single Ctrl+R to fit, project, evaluate models and present them to users in a clear and simple way.

The main function (ENMTML) has several arguments, which user’s need to specify according to their modeling needs.

As we know this is not an simple task, we indicate the papers which proposed those methods in our paper. Coupled with a better (but brief) explanation on those.

How to run?

ENMTML(pred_dir, 
       proj_dir = NULL, 
       result_dir = NULL,
       occ_file, 
       sp, 
       x, 
       y, 
       min_occ = 10,
       thin_occ = NULL, 
       eval_occ = NULL, 
       colin_var = NULL,
       imp_var = FALSE, 
       sp_accessible_area = NULL, 
       pseudoabs_method,
       pres_abs_ratio = 1, 
       part, save_part = FALSE, 
       save_final = TRUE,
       algorithm, 
       thr, 
       msdm = NULL, 
       ensemble = NULL,
       extrapolation = FALSE, 
       cores = 1)

See possible input options below

Function Arguments

  • pred_dir: character. Directory path with predictors (file formats supported are ASC, BILL, TIFF or TXT)

  • proj_dir: character. Directory path containing folders with predictors for different regions or time periods used to project models (file formats supported are ASC, BILL, TIFF, or TXT).

  • result_dir: character. Directory path with the folder in which model results will be recorded:

    • NULL: Results will be recorded in a default Result folder, at the same level as the pred_dir folder (usage result_dir=NULL).
    • Simple name: A folder with the specified name will be created at the same level as the pred_dir folder (e.g. usage result_dir="MyFolderName")
    • Complete path: A folder will be created at the specified path (e.g.  result_dir="C:/Users/mypc/Documents/MyFolderName").
  • occ_file: character. Directory path with the tab-delimited TXT file, which will contain at least three columns with information about species names, and the latitude and longitude of species occurrences.

  • sp: character. Name of the column with information about species names.

  • x: character. Name of the column with information about longitude.

  • y: character. Name of the column with information about latitude.

  • min_occ: integer. Minimum number of unique occurrences (species with less than this number will be excluded).

  • thin_occ: character. Perform spatial filtering (Thinning, based on spThin package) on the presences. For this augment it is necessary provide a vector in which its elements need to have the names ‘method’ or ‘method’ and ‘distance’ (more information below). Three thinning methods are available (default NULL):

    • MORAN Distance defined by Moran Variogram. Usage thin_occ=c(method='MORAN').
    • CELLSIZE Distance defined by 2x cellsize (Haversine Transformation). Usage thin_occ=c(method='CELLSIZE').
    • USER-DEFINED User defined distance. For this option it is necessary to provide a vector with two values. Usage thin_occ=c(method='USER-DEFINED', ditance='300'). The second numeric value refers to the distance in km that will be used for thinning. So distance=300 means that all records within a radius of 300 km will be deleted.
  • eval_occ: character. Directory path with tab-delimited TXT file with species names, latitude and longitude, these three columns must have the same columns names than the database used in the occ_file argument. This external occurrence database will be used to external models validation (i.e., it will no be use to model fitting). (default NULL).

  • colin_var: character. Method to reduce variable collinearity:

    • PCA: Perform a Principal Component Analysis on predictors and use Principal Components as environmental variables. Usage colin_var=c(method='PCA').
    • VIF: Variance Inflation Factor. Usage colin_var=c(method='VIF').
    • PEARSON: Select variables by Pearson correlation, a threshold of maximum correlation must be specified by user. Usage colin_var=c(method='PEARSON', threshold='0.7').
  • imp_var: logical. Perform variable importance and data for curves response for selected algorithms? (default FALSE)

  • sp_accessible_area: character. Restrict for each species the accessible area, i.e., the area used to model fitting. It is necessary to provide a vector for this argument. Three methods were implemented

    • BUFFER area used to model fitting delimited by a buffer with a width size equal to the maximum distance among pair of occurrences for each species. Usage sp_accessible_area=c(method='BUFFER', type='1').
    • BUFFER area used to model fitting delimted by a buffer with a width size defined by the user in km. Note this width size of buffer will be used for all species. Usage sp_accessible_area=c(method='BUFFER', type='2', width='300').
    • MASK this method consists in delimit the area used to model fitting based on the polygon where a species occurrences fall. For instance, it is possible delimit the calibration area based on ecoregion shapefile. For this option it is necessary inform the path to the file that will be used as mask. Next file format can be loaded ‘.bil’, ‘.asc’, ‘.tif’, ‘.shp’, and ‘.txt’. Usage sp_accessible_area=c(method='MASK', filepath='C:/Users/mycomputer/ecoregion/olson.shp')..
  • pseudoabs_method: character. Pseudo-absence allocation method. It is necessary to provide a vector for this argument. Only one method can be chosen. The next methods are implemented:

    • RND: Random allocation of pseudo-absences throughout the area used for model fitting. Usage pseudoabs_method=c(method='RND').
    • ENV_CONST: Pseudo-absences are environmentally constrained to a region with lower suitability values predicted by a Bioclim model. Usage pseudoabs_method=c(method=‘ENV_CONST’). Usage pseudoabs_method=c(method='ENV_CONST').
    • GEO_CONST: Pseudo-absences are allocated far from occurrences based on a geographical buffer. For this method it is necessary provie a second value wich express the buffer width in km. Usage pseudoabs_method=c(method='GEO_CONST', width='50').
    • GEO_ENV_CONST: Pseudo-absences are constrained environmentally (based on Bioclim model) but distributed geographically far from occurrences based on a geographical buffer. For this method it is necessary provide a second value which express the buffer width in km. Usage pseudoabs_method=c(method='GEO_ENV_CONST', width='50')
    • GEO_ENV_KM_CONST: Pseudo-absences are constrained on a three-level procedure; it is similar to the GEO_ENV_CONST with an additional step which distributes the pseudo-absences in the environmental space using k-means cluster analysis. For this method it is necessary provide a second value which express the buffer width in km. Usage pseudoabs_method=c(method='GEO_ENV_KM_CONST', width='50').
  • pres_abs_ratio: numeric. Presence-Absence ratio (values between 0 and 1).

  • part: character. Partition method for model’s validation. Only one method can be chosen. It is necessary to provide a vector for this argument. The next methods are implemented:

    • BOOT: Random bootstrap partition. Usage part=c(method='BOOT', replicates='2', proportion='0.7'). replicate refers to the number of replicates, it assumes a value >=1. proportion refers to the proportion of occurrences used for model fitting, it assumes a value >0 and <=1. In this example proportion='0.7' mean that 70% of data will be used for model training, while 30% for model testing.
    • KFOLD: Random partition in k-fold cross-validation. Usage part=c(method= 'KFOLD', folds='5'). folds refers to the number of folds for data partitioning, it assumes value >=1.
    • BANDS: Geographic partition structured as bands arranged in a latitudinal way (type 1) or longitudinal way (type 2). Usage part=c(method= 'BANDS', type='1'). type refers to the bands disposition.
    • BLOCK: Geographic partition structured as a checkerboard (a.k.a. block cross-validation). Usage part=c(method= 'BLOCK').
  • save_part: logical. If TRUE, function will save .tif files of partial models, i.e. model created by each occurrence partitions. (default FALSE).

  • save_final: logical. If TRUE, function will save .tif files of the final model, i.e. fitted with all occurrences data. (default TRUE)

  • algorithm: character. Algorithm to construct ecological niche models (it is possible to use more than one method):

    • BIO: Bioclim
    • MAH: Mahalanobis
    • DOM: Domain
    • ENF: Ecological-Niche Factor Analysis
    • MXS: Maxent simple (only linear and quadratic features, based on MaxNet package)
    • MXD: Maxent default features (all features, based on MaxNet package)
    • SVM: Support Vector Machine
    • GLM: Generalized Linear Model
    • GAM: Generalized Additive Model
    • BRT: Boosted Regression Tree
    • RDF: Random Forest
    • MLK: Maximum Likelihood
    • GAU: Gaussian Process Usage algorithm=c(‘BIO’, ‘SVM’, ‘GLM’, ‘GAM’, ‘GAU’).
  • thr: character. Threshold used for presence-absence predictions. It is possible to use more than one threshold type. It is necessary to provide a vector for this argument:

    • LPT: The highest threshold at which there is no omission. Usage thr=c(type='LPT').
    • MAX_TSS: Threshold at which the sum of the sensitivity and specificity is the highest. Usage thr=c(type='MAX_TSS').
    • MAX_KAPPA: The threshold at which kappa is the highest (“max kappa”). Usage thr=c(type='MAX_KAPPA').
    • SENSITIVITY: A threshold value specified by user. Usage thr=c(type='SENSITIVITY', sens='0.6'). ‘sens’ refers to models will be binarized using this suitability value. Note that this method assumes ‘sens’ value for all algorithm and species.
    • JACCARD: The threshold at which Jaccard is the highest. Usage thr=c(type='JACCARD').
    • SORENSEN: The threshold at which Sorensen is highest. Usage thr=c(type='SORENSEN').

    In the case of use more than one threshold type it is necessary concatenate the names of threshold types, e.g., thr=c(type=c('LPT', 'MAX_TSS', 'JACCARD')). When SENSITIVITY threshold is used in combination with other it is necessary specify the desired sensitivity value, e.g., thr=c(type=c('LPT', 'MAX_TSS', 'SENSITIVITY'), sens='0.8').

  • msdm: character. Include spatial restrictions to model projection. These methods restrict ecological niche models in order to have less potential prediction and turn models closer to species distribution models. They are classified in ‘a Priori’ and ‘a Posteriori’ methods. The first one encompasses method that include geographical layers as predictor of models’ fitting, whereas a Posteriori constrain models based on occurrence and suitability patterns. This argument is filled only with a method, in the case of use MCP-B method msdm is filled in a different way se below (default NULL):

    a Priori methods (layer created area added as a predictor at moment of model fitting):

    • XY: Create two layers latitude and longitude layer. Usage msdm=c(method='XY').
    • MIN: Create a layer with information of the distance from each cell to the closest occurrence. Usage msdm=c(method='MIN').
    • CML: Create a layer with information of the summed distance from each cell to all occurrences. Usagemsdm=c(method='CML').
    • KER: Create a layer with a Gaussian-Kernel on the occurrence data. Usage msdm=c(method='KER').

    a Posteriori methods:

    • OBR: Occurrence based restriction, uses the distance between points to exclude far suitable patches (Mendes et al., in prep). Usage msdm=c(method='OBR').
    • LR: Lower Quantile, select the nearest 25% patches (Mendes et al., in prep). Usage msdm=c(method='LR').
    • PRES: Select only the patches with confirmed occurrence data (Mendes et al, in prep). Usage msdm=c(method='PRES').
    • MCP: Excludes suitable cells outside the Minimum Convex Polygon (MCP) built based on occurrences data. Usage msdm=c(method='MCP').
    • MCP-B: Creates a buffer (with a width size defined by user in km) around the MCP. Usage msdm=c(method='MCP-B', width=100). In this case width=100 means that a buffer with 100km of width will be created around the MCP.
  • ensemble: character. Method used to ensemble different algorithms. It is possible to use more than one method. A vector must be provided for this argument. For SUP, W_MEAN or PCA_SUP method it is necessary provide an evaluation metric to ensemble arguments (i.e., AUC, Kappa, TSS, Jaccard, Sorensen or Fpb) see below. (default NULL):

    • MEAN: Simple average of the different models. Usage ensemble=c(method='MEAN').
    • W_MEAN: Weighted average of models based on their performance. An evaluation metric must be provided. Usage ensemble=c(method='W_MEAN', metric='TSS').
    • SUP: Average of the best models (e.g., TSS over the average). An evaluation metric must be provided. Usage ensemble=c(method='SUP', metric='TSS').
    • PCA: Performs a Principal Component Analysis (PCA) and returns the first axis. Usage ensemble=c(method='PCA').
    • PCA_SUP: PCA of the best models (e.g., TSS over the average). An evaluation metric must be provided. Usage ensemble=c(method='PCA_SUP', metric='Fpb').
    • PCA_THR: PCA performed only with those cells with suitability values above the selected threshold. Usage ensemble=c(method='PCA_THR').

    In the case of use more than one ensemble method it is necessary concatenate the names of ensemble methods within the argument, e.g., ensemble=c(method=c('MEAN', 'PCA')), ensemble=c(method=c('MEAN, 'W_MEAN', 'PCA_SUP'), metric='Fpb').

  • extrapolation logical. If TRUE the function will calculate extrapolation based on Mobility-Oriented Parity analysis (MOP) for current conditions. If the argument proj_dir is used, the extrapolation layers for other regions or time periods will also be calculated (default FALSE).

  • cores numeric. Define the number of CPU cores to run modeling procedures in parallel (default 1).

What are my results?

Within the result_dir folder you will find several sub-folders: Algorithm, Ensemble(decision-based), Projection(decision-based), Extrapolation(decision-based), BLOCK(decision-based), Extent Masks(decision-based).

There are also some .txt files (some txt will only be created under ceratin modeling settings):
Evaluation_Table.txt Contains the results for model evaluation, with several metrics
InfoModeling.txt Information of the chosen modeling parameters
Number_Unique_Occurrences.txt Number of unique occurrences for each species
Occurrences_Cleaned.txt Dataset produced after selecting a single occurrence per grid-cell(unique occurrences)
Occurrences_Filtered.txt Datasets produced after occurrences were corrected for sampling spatial bias (thinned occurrences)
Thresholds_Algorithm.txt Information about the thresholds used to create the presence-absence maps for each algorithm (Presence-absence maps are created from the Threshold of complete models)
Thresholds_Ensemble.txt Information about the thresholds used to create the presence-absence maps for ensembled models
**Moran_&_Mess** Contains information about autocorrelation and environmental similatiry between the datasets used to fit and evaluate the model

CITATION:

Andrade, A.F.A., Velazco, S.J.E., De Marco Jr, P., 2020. ENMTML: An R package for a straightforward construction of complex ecological niche models. Environmental Modelling & Software 125, 104615. https://doi.org/10.1016/j.envsoft.2019.104615

Test the package and give us feedback here or send an e-mail to or !