| Title: | Enhancing Data Quality of Biogeographic Ranges with Application for Marine Invertebrates |
|---|---|
| Description: | Provides step-by-step automation for integrating biodiversity data from multiple online aggregators, merging and cleaning datasets while addressing challenges such as taxonomic inconsistencies, georeferencing issues, and spatial or environmental outliers. Includes functions to extract environmental data and to define the biogeographic ranges in which species are most likely to occur. For methodological details see the associated publication.<doi: 10.1002/ecog.08203>. |
| Authors: | Priyanka Soni [aut, cre] (ORCID: <https://orcid.org/0000-0001-8358-1645>), Austin Hendy [ctb] (Manuscript mentoring and advising), David Bottjer [ctb] (Manuscript mentoring and advising), Vijay Barve [ctb] (Code development feedback and advising) |
| Maintainer: | Priyanka Soni <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.3 |
| Built: | 2026-05-14 05:54:38 UTC |
| Source: | https://github.com/sonipri/ecocleanr |
Get Decimal Places of Coordinate Values
decimal_places(coord)decimal_places(coord)
coord |
A coordinate value in the numeric format of decimal degree |
a numerical value which represent the number of decimal places for the coordiante
decimal_places(12.7000000) decimal_places(45.67788)decimal_places(12.7000000) decimal_places(45.67788)
Calculate geographic distance and mahalanobis distance to estimate outlier probability of a data point
distance_calc(data, latitude, longitude, env_layers, itr = 15, k = 3)distance_calc(data, latitude, longitude, env_layers, itr = 15, k = 3)
data |
data table with spatial and environmental variables |
latitude |
nested input from ec_flag_outlier |
longitude |
nested input from ec_flag_outlier |
env_layers |
header names of env variables. env_layers <- c("Temperature", "pH") |
itr |
iteration to run the clustering 100 or 1000 times |
k |
number of cluster to choose in each iteration |
A list of results that shows result of calculated distance for each iteration
data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18) ) env_layers <- c("temperature_mean", "temperature_min", " temperature_max") result_list <- distance_calc(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 100, k = 3 )data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18) ) env_layers <- c("temperature_mean", "temperature_min", " temperature_max") result_list <- distance_calc(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 100, k = 3 )
condition to run this function: all the data frames should have same fields follwing DwC standards: e.g. attribute_list <- c("source","catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode") Assign manually the source name in "source" field. example - gbif, obis, invertEBase etc Assign values of individual count or organism count into abundance. Most online sources has one of them updated with specimen count. this function depends on successful download of data files, it also allow to input csv files from local system
ec_db_merge( db_list, datatype = "modern", occurrenceStatus = "occurrenceStatus", basisOfRecord = "basisOfRecord" )ec_db_merge( db_list, datatype = "modern", occurrenceStatus = "occurrenceStatus", basisOfRecord = "basisOfRecord" )
db_list |
list of data frames which we want to merge. e.g. GBIF, iDigbio, InvertEBase and any local file. |
datatype |
default "modern". datatype accept text input as "modern" or "fossil" |
occurrenceStatus |
default name for occurrenceStatus column is occurrenceStatus but a different name can be inserted if required. |
basisOfRecord |
default name for basis of record column is basis of record but a different name can be inserted if required. |
A data frame of occurrence records filtered to include only those classified as "modern" or "fossil".
db1 <- data.frame( species = "A", decimalLongitude = c(-120, -117, NA, NA), decimalLatitude = c(20, 34, NA, NA), catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge(db_list = db_list, "modern")db1 <- data.frame( species = "A", decimalLongitude = c(-120, -117, NA, NA), decimalLatitude = c(20, 34, NA, NA), catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge(db_list = db_list, "modern")
Extract the Environmental data
ec_extract_env_layers( data, env_layers = env_layers, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_extract_env_layers( data, env_layers = env_layers, latitude = "decimalLatitude", longitude = "decimalLongitude" )
data |
data table which has coordinate information |
env_layers |
make a list of enviornmental layers which need to be extracted, example :BO_sstmean, BO_sstmax, BO_sstmin, BO_chomean, BO_phosphate or marspec layer, must check list_layer to know exact name of the layer code. |
latitude |
default assigned as "decimalLatitude" |
longitude |
default assigned as "decimalLongitude" |
A data table which has unique coordinates and env predictors
## Not run: env_layers <- c("BO_sstmean", "BO_chlomean", "BO_dissox", "BO_salinity") data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9) ) data_x <- ec_extract_env_layers(data, env_layers = env_layers, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: env_layers <- c("BO_sstmean", "BO_chlomean", "BO_dissox", "BO_salinity") data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9) ) data_x <- ec_extract_env_layers(data, env_layers = env_layers, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Flag the Occurrences those has Extreme Uncertainty Error Radius
ec_filter_by_uncertainty( data, uncertainty_col = "coordinateUncertaintyInMeters", percentile = 0.96, ask = TRUE, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_filter_by_uncertainty( data, uncertainty_col = "coordinateUncertaintyInMeters", percentile = 0.96, ask = TRUE, latitude = "decimalLatitude", longitude = "decimalLongitude" )
data |
data table which need to be cleaned with unwanted uncertainty values - extreme values |
uncertainty_col |
coordinateUncertaintyInMeters column |
percentile |
to derive threshold, e.g. extreme 5% uncertainty data points to be removed. give percentile value as 0.95 |
ask |
this allow user to decide if the uncertainty threshold value is okay or too high/low |
latitude |
default set on decimalLatitude, this column is use to filter records those does not have georeferences. |
longitude |
default set on decimalLongitude. |
A data frame as result of removing extreme uncertain occurrences
data <- data.frame( species = "A", decimalLongitude = c(-120, -117, NA, NA), decimalLatitude = c(20, 34, NA, NA), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"), coordinateUncertaintyInMeters = c(1000, 2000, 9999900, NA) ) data <- ec_filter_by_uncertainty( data, uncertainty_col = "coordinateUncertaintyInMeters", latitude = "decimalLatitude", longitude = "decimalLongitude", percentile = 0.96, ask = TRUE )data <- data.frame( species = "A", decimalLongitude = c(-120, -117, NA, NA), decimalLatitude = c(20, 34, NA, NA), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"), coordinateUncertaintyInMeters = c(1000, 2000, 9999900, NA) ) data <- ec_filter_by_uncertainty( data, uncertainty_col = "coordinateUncertaintyInMeters", latitude = "decimalLatitude", longitude = "decimalLongitude", percentile = 0.96, ask = TRUE )
Flag the occurrences those are not in east Atlantic and are inland
ec_flag_non_east_atlantic( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_flag_non_east_atlantic( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
## Not run: ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_east_atlantic( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_east_atlantic( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Flag occurrences those are not in east Pacific and are inland
ec_flag_non_east_pacific( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_flag_non_east_pacific( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
## Not run: ocean_names <- c("North Pacific Ocean", "South Pacific Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110), decimalLatitude = c(20, 34, 30) ) data$flag_non_region <- ec_flag_non_east_pacific( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: ocean_names <- c("North Pacific Ocean", "South Pacific Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110), decimalLatitude = c(20, 34, 30) ) data$flag_non_region <- ec_flag_non_east_pacific( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Flag Occurrences those are in wrong ocean basins and are inland
ec_flag_non_region( direction, ocean, buffer = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_flag_non_region( direction, ocean, buffer = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
direction |
values as "east" or "west". These values help to filter the shape files for east or west of select ocean (e.g. pacific) for both north and south hemisphere. |
ocean |
values such as "pacific" or "atlantic" |
buffer |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
## Not run: direction <- "east" buffer <- 25000 ocean <- "pacific" data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_region( direction, ocean, buffer = 50000, data ) ## End(Not run)## Not run: direction <- "east" buffer <- 25000 ocean <- "pacific" data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_region( direction, ocean, buffer = 50000, data ) ## End(Not run)
Flag Occurrences those are not in west Atlantic and are inland
ec_flag_non_west_atlantic( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_flag_non_west_atlantic( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
## Not run: ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_west_atlantic( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) data$flag_non_region <- ec_flag_non_west_atlantic( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Flag occurrences those are not in east Pacific and are inland
ec_flag_non_west_pacific( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_flag_non_west_pacific( ocean_names, buffer_distance = 50000, data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
## Not run: ocean_names <- c("North Pacific Ocean", "South Pacific Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110), decimalLatitude = c(20, 34, 30) ) data$flag_non_region <- ec_flag_non_west_pacific( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: ocean_names <- c("North Pacific Ocean", "South Pacific Ocean") buffer_distance <- 25000 data <- data.frame( species = "A", decimalLongitude = c(-120, -78, -110), decimalLatitude = c(20, 34, 30) ) data$flag_non_region <- ec_flag_non_west_pacific( ocean_names, buffer_distance, data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Flag Outlier Occurrences - using Spatial and Non-spatial Attributes
ec_flag_outlier( data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 50, k = 3, geo_quantile = 0.99, maha_quantile = 0.99 )ec_flag_outlier( data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 50, k = 3, geo_quantile = 0.99, maha_quantile = 0.99 )
data |
data table with spatial and environmental variables |
latitude |
default set to "deciamlLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
header names of env variables. env_layers <- c("Temperature", "pH") |
itr |
iteration to run the clustering 100 or 1000 times |
k |
number of cluster to choose in each iteration |
geo_quantile |
value with geo_quantile percentile would consider has threshold for geo_distance to derive the outlier. e.g. default 0.99 |
maha_quantile |
value with maha_quantile percentile would consider has threshold for maha_distance to derive the outlier. e.g. default 0.99 |
A column call flag_outlier which has outlier probability from 0 to 1. 1 is more towards outlier, 0 more towards good data points.
data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, 13, 14), BO_sstmin = c(9, 6, 10), BO_sstmax = c(14, 16, 18) ) env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax") res <- ec_flag_outlier(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 100, k = 3, geo_quantile = 0.99, maha_quantile = 0.99 ) data$outlier <- res$outlier iteration_list <- res$result$listdata <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, 13, 14), BO_sstmin = c(9, 6, 10), BO_sstmax = c(14, 16, 18) ) env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax") res <- ec_flag_outlier(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers, itr = 100, k = 3, geo_quantile = 0.99, maha_quantile = 0.99 ) data$outlier <- res$outlier iteration_list <- res$result$list
Flag occurrences those has bad precision
ec_flag_precision( data, latitude = "decimalLatitude", longitude = "decimalLongitude", threshold = 2 )ec_flag_precision( data, latitude = "decimalLatitude", longitude = "decimalLongitude", threshold = 2 )
data |
dataframe |
latitude |
decimalLatitude, this a field in the data file. We prefer to use decimalLatitude as accepeted name based on TDWG standards |
longitude |
decimalLongitude, this a field in the data file. We prefer to use decimalLongitude as accepeted name based on TDWG standards |
threshold |
set on 2 |
A column which has flagged records represents bad records based on low precision as well as rounding
data <- data.frame( species = "A", decimalLongitude = c(-120.67, -78, -110, -60, -75.5, -130.78, -10.2, 5.4), decimalLatitude = c(20.7, 34.6, 30.0, 10.5, 40.4, 25.66, 15.0, 35.9) ) data$flag_cordinate_precision <- ec_flag_precision( data, latitude = "decimalLongitude", longitude = "decimalLatitude", threshold = 2 )data <- data.frame( species = "A", decimalLongitude = c(-120.67, -78, -110, -60, -75.5, -130.78, -10.2, 5.4), decimalLatitude = c(20.7, 34.6, 30.0, 10.5, 40.4, 25.66, 15.0, 35.9) ) data$flag_cordinate_precision <- ec_flag_precision( data, latitude = "decimalLongitude", longitude = "decimalLatitude", threshold = 2 )
Filter records to georeference using GEOLocate
ec_flag_with_locality( data, uncertainty = "coordinateUncertaintyInMeters", locality = "locality", verbatimLocality = "verbatimLocality" )ec_flag_with_locality( data, uncertainty = "coordinateUncertaintyInMeters", locality = "locality", verbatimLocality = "verbatimLocality" )
data |
data table with occurrence information |
uncertainty |
Mendatory to have coordinateUncertaintyInMeters column in the data table |
locality |
Mandatory to have locality column in the data table. |
verbatimLocality |
Mandatory to have verbatimLocality in the data table. |
Records those does not have coordinates assigned but has locality and varbatim locality information to assign coordinates by using external tools such as GEOLocate
A column with flagged records as 1, which means these records has potential to be georeferenced.
data <- data.frame( coordinateUncertaintyInMeters = c(NA, "N/A", 50, "30", NA, "N/A", NA), locality = c("Santa Cruz", NA, "Los Angeles", "N/A", "", "San Diego", NA), verbatimLocality = c(NA, "CA coast", "", "N/A", "Long Beach", NA, "") ) data$flag_check_geolocate <- ec_flag_with_locality( data, uncertainty = "coordinateUncertaintyInMeters", locality = "locality", verbatimLocality = "verbatimLocality" )data <- data.frame( coordinateUncertaintyInMeters = c(NA, "N/A", 50, "30", NA, "N/A", NA), locality = c("Santa Cruz", NA, "Los Angeles", "N/A", "", "San Diego", NA), verbatimLocality = c(NA, "CA coast", "", "N/A", "Long Beach", NA, "") ) data$flag_check_geolocate <- ec_flag_with_locality( data, uncertainty = "coordinateUncertaintyInMeters", locality = "locality", verbatimLocality = "verbatimLocality" )
Map view of occurrence data points
ec_geographic_map( data, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_geographic_map( data, latitude = "decimalLatitude", longitude = "decimalLongitude" )
data |
Data table |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
A map view shows occurrence records.
## Not run: data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18) ) ec_geographic_map(data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18) ) ec_geographic_map(data, latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Map view to visualize data points with outlier probability 0 to 1 on a map view
ec_geographic_map_w_flag( data, flag_column, latitude = "decimalLatitude", longitude = "decimalLongitude" )ec_geographic_map_w_flag( data, flag_column, latitude = "decimalLatitude", longitude = "decimalLongitude" )
data |
Data table which has information of coordinates (decimalLongitude and decimalLatitude) and a column which has flags 0 to 1 |
flag_column |
column name which has flag, e.g. flag_outlier |
latitude |
default set on "decimalLatitude", change if the name of column is different. |
longitude |
default set on "decimalLongitude", change if the name of column is different. |
A geographic map which shows occurrence data points with the color gradient to show flagged records in warm color.
## Not run: data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18), flag_outlier = c(0, 0.5, 1) ) ec_geographic_map_w_flag(data, flag_column = "flag_outlier", latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)## Not run: data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), temperature_mean = c(12, 13, 14), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 18), flag_outlier = c(0, 0.5, 1) ) ec_geographic_map_w_flag(data, flag_column = "flag_outlier", latitude = "decimalLatitude", longitude = "decimalLongitude" ) ## End(Not run)
Impute Environmental Variables using Mean Values of occurrences within a certain radius
ec_impute_env_values( data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", radius_km = 10, iter = 3 )ec_impute_env_values( data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", radius_km = 10, iter = 3 )
data_x |
this is data_x which is the output of ec_extract_env_layers |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
radius_km |
radius to average the values of data points within the circle to imput the values for missing datta points |
iter |
number of times to iterate the imputation, e.g. 1 or 2 or 3 |
An updated table of data_x which has imputed values for the missing env variables, condition applies that the this imputation wont work if the data points are too sparse.
data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, NA, 14), BO_sstmin = c(9, NA, 10), BO_sstmax = c(14, NA, 18) ) radius_km <- 10 iter <- 3 data_x <- ec_impute_env_values(data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", radius_km, iter )data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, NA, 14), BO_sstmin = c(9, NA, 10), BO_sstmax = c(14, NA, 18) ) radius_km <- 10 iter <- 3 data_x <- ec_impute_env_values(data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", radius_km, iter )
Merge the Update Georeferenced Occurrence Points back to the Main Data File.
ec_merge_corrected_coordinates( data_corrected, data, catalog = "cleaned_catalog", latitude = "decimalLatitude", longitude = "decimalLongitude", uncertainty_col = "coordinateUncertaintyInMeters" )ec_merge_corrected_coordinates( data_corrected, data, catalog = "cleaned_catalog", latitude = "decimalLatitude", longitude = "decimalLongitude", uncertainty_col = "coordinateUncertaintyInMeters" )
data_corrected |
After assigning coordinate values using online georeference tools such as GeoLocate, upload the csv file back to R with the name call data_corrected, we hardcoded the field names as "corrected_longitude", "corrected_latitude" and "corrected_uncertainty" and "cleaned_catalog" for column names of data_corrected dataset" which will be merge with "decimalLongitude", "decimalLantitude", "coordinateUncertaintyInMeters" and "cleaned_catalog" of data table. |
data |
data table which needs to updated with the assign coordiantes |
catalog |
this is an important attribute to use matching the records back to the main data file. |
latitude |
default set to "decimalLatitude", this is a column name of data |
longitude |
default set to "decimalLongitude", this is a column name of data |
uncertainty_col |
this is a column name of data and default set to "coordinateUncertaintyInMeters" |
A data frame with updated coordinate information
data <- data.frame( species = "A", decimalLongitude = c(-120, -119.8, NA, NA), decimalLatitude = c(20, 34, NA, NA), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"), coordinateUncertaintyInMeters = c(9999, NA, NA, NA) ) data_corrected <- data.frame( corrected_longitude = c(-120, -119.8, 118, 118.3), corrected_latitude = c(20, 34, 33, 32.9), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), corrected_uncertainty = c(9999, NA, 5000, 1000) ) data<- ec_merge_corrected_coordinates(data_corrected, data, catalog = "cleaned_catalog", latitude = "decimalLatitude", longitude = "decimalLongitude", uncertainty_col = "coordinateUncertaintyInMeters" )data <- data.frame( species = "A", decimalLongitude = c(-120, -119.8, NA, NA), decimalLatitude = c(20, 34, NA, NA), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"), coordinateUncertaintyInMeters = c(9999, NA, NA, NA) ) data_corrected <- data.frame( corrected_longitude = c(-120, -119.8, 118, 118.3), corrected_latitude = c(20, 34, 33, 32.9), cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"), corrected_uncertainty = c(9999, NA, 5000, 1000) ) data<- ec_merge_corrected_coordinates(data_corrected, data, catalog = "cleaned_catalog", latitude = "decimalLatitude", longitude = "decimalLongitude", uncertainty_col = "coordinateUncertaintyInMeters" )
Scatter Plot between geo_distance vs maha_distance with geo- and maha- Quantile Threshold to Demonstrate the Outliers outside those threshold.
ec_plot_distance( x, geo_quantile = 0.99, maha_quantile = 0.99, iterative = TRUE, geo_distance = "geo_distance", maha_distance = "maha_distance" )ec_plot_distance( x, geo_quantile = 0.99, maha_quantile = 0.99, iterative = TRUE, geo_distance = "geo_distance", maha_distance = "maha_distance" )
x |
iteration_list derived from ec_flag_outlier can be used to plot these scatter plots between geo_distance vs maha_distance |
geo_quantile |
value with geo_quantile percentile would consider has threshold for geo_distance to derive the outlier. e.g. default 0.99 |
maha_quantile |
value with maha_quantile percentile would consider has threshold for maha_distance to derive the outlier. e.g. default 0.99 |
iterative |
= TRUE/FALSE, default set on TRUE, which provide a iterative loop to check maps of each iteration of listed outcome of outlier probability, if it is FALSE, loop exit with first iteration outcome of outlier probability. |
geo_distance |
default set on "geo_distance", this column has calculated distance - output of ec_flag_outlier |
maha_distance |
default set on "maha_distance", this column has calculated distance - output of ec_flag_outlier |
A list of plots for each iteration outcome
df1 <- data.frame( latitude = runif(5, 30, 35), longitude = runif(5, -120, -115), temperature = rnorm(5, 15, 2), pH = rnorm(5, 8, 0.1), geo_distance = runif(5, 0, 100), maha_distance = runif(5, 0, 10) ) df2 <- data.frame( latitude = runif(5, 30, 35), longitude = runif(5, -120, -115), temperature = rnorm(5, 16, 2), pH = rnorm(5, 7.9, 0.1), geo_distance = runif(5, 0, 100), maha_distance = runif(5, 0, 10) ) iteration_list <- list(df1, df2)#Store both data frames in a list iteration_list <- list(df1, df2) plot <- ec_plot_distance(iteration_list, geo_quantile = 0.99, maha_quantile = 0.99, iterative = TRUE)df1 <- data.frame( latitude = runif(5, 30, 35), longitude = runif(5, -120, -115), temperature = rnorm(5, 15, 2), pH = rnorm(5, 8, 0.1), geo_distance = runif(5, 0, 100), maha_distance = runif(5, 0, 10) ) df2 <- data.frame( latitude = runif(5, 30, 35), longitude = runif(5, -120, -115), temperature = rnorm(5, 16, 2), pH = rnorm(5, 7.9, 0.1), geo_distance = runif(5, 0, 100), maha_distance = runif(5, 0, 10) ) iteration_list <- list(df1, df2)#Store both data frames in a list iteration_list <- list(df1, df2) plot <- ec_plot_distance(iteration_list, geo_quantile = 0.99, maha_quantile = 0.99, iterative = TRUE)
Plot cleaned data overlay overall occurrence data to demonstrate accepted ranges of spatial and non-spatial attributes
ec_plot_var_range( data, summary_df, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )ec_plot_var_range( data, summary_df, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )
data |
data table which even has outlier data points |
summary_df |
summmary output of final cleaned data, after executing function ec_var_summary |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
list of environmental variables |
A plot which shows spatial and environmental variables with the acceptable range for species habitability
data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9, -116.5), decimalLatitude = c(32.9, 33.5, 31.9, 32.4), temperature_mean = c(12, 13, 14, 11), temperature_min = c(9, 6, 10, 10), temperature_max = c(14, 16, 18, 17), flag_outlier = c(0, 0.5, 1, 0.7) ) # this data table has data points which was considered as outliers data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.5), decimalLatitude = c(32.9, 33.5, 32.4), temperature_mean = c(12, 13, 11), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 17), flag_outlier = c(0, 0.5, 0.7) ) # cleaned data base after removing outliers >x probability. # in this example, removed data points >0.7 probability to be # considering outliers env_layers <- c("temperature_mean", "temperature_min", "temperature_max") summary_df <- ec_var_summary(data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers ) # this is the final cleaned data table which # will be used to derive summary of acceptable niche ec_plot_var_range(data, summary_df, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9, -116.5), decimalLatitude = c(32.9, 33.5, 31.9, 32.4), temperature_mean = c(12, 13, 14, 11), temperature_min = c(9, 6, 10, 10), temperature_max = c(14, 16, 18, 17), flag_outlier = c(0, 0.5, 1, 0.7) ) # this data table has data points which was considered as outliers data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.5), decimalLatitude = c(32.9, 33.5, 32.4), temperature_mean = c(12, 13, 11), temperature_min = c(9, 6, 10), temperature_max = c(14, 16, 17), flag_outlier = c(0, 0.5, 0.7) ) # cleaned data base after removing outliers >x probability. # in this example, removed data points >0.7 probability to be # considering outliers env_layers <- c("temperature_mean", "temperature_min", "temperature_max") summary_df <- ec_var_summary(data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers ) # this is the final cleaned data table which # will be used to derive summary of acceptable niche ec_plot_var_range(data, summary_df, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )
Remove Duplicate Records from the Merged Data
ec_rm_duplicate(data, catalogNumber = "catalogNumber", abundance = "abundance")ec_rm_duplicate(data, catalogNumber = "catalogNumber", abundance = "abundance")
data |
this is merge data frame which is a output file after running ec_db_merge |
catalogNumber |
this is a mandatory field which consider unique for each occurrence record. |
abundance |
this is a mandatory field which has created while data extraction by combining individual count and quantity fields (may vary from one source to another, we aim to standardize those as "abundance"). |
This function will provide a cleaned_catalog column as output, which has catalog numbers standardize and removed duplicates based on generated cleaned_catalog and abundance columns of data. mandatory fields are catalogNumber, source and abundance
A data frame which has unique catalog numbers. the output file will have cleaned_catalog field instead of catalogNumber. Also the unique record will be chosen with the abundance value if there is any.
db1 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge(db_list = db_list, "modern") ecodata <- ec_rm_duplicate(merge_modern_data, catalogNumber = "catalogNumber", abundance = "abundance" )db1 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge(db_list = db_list, "modern") ecodata <- ec_rm_duplicate(merge_modern_data, catalogNumber = "catalogNumber", abundance = "abundance" )
Remove Duplicate Records from the Merged Data based on occurrenceID
ec_rm_duplicate_occurid( data, occurrenceID = "occurrenceID", abundance = "abundance" )ec_rm_duplicate_occurid( data, occurrenceID = "occurrenceID", abundance = "abundance" )
data |
this is merge data frame which is a output file after running ec_db_merge |
occurrenceID |
this is a mandatory field which consider unique for each occurrence record. |
abundance |
this is a mandatory field which has created while data extraction by combining individual count and quantity fields (may vary from one source to another, we aim to standardize those as "abundance"). |
This function will provide a cleaned_occurrenceID column as output, which has occurrenceID standardize and removed duplicates based on generated cleaned_occurrenceID and abundance columns of data. mandatory fields are occurrenceID, source and abundance
A data frame which has unique occurrenceID. the output file will have cleaned_occurrenceID field instead of occurrenceID. Also the unique record will be chosen with the abundance value if there is any.
db1 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), occurrenceID = c("12345", "898828", "LACM8289", "SDNHM6276"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), occurrenceID = c("12345", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge( db_list = db_list, "modern" ) ecodata <- ec_rm_duplicate_occurid( merge_modern_data, occurrenceID = "occurrenceID", abundance = "abundance" )db1 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), occurrenceID = c("12345", "898828", "LACM8289", "SDNHM6276"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db1", abundance = c(1, NA, 8, 23) ) db2 <- data.frame( species = "A", decimalLongitude = c(-120.2, -117.1, NA, NA), decimalLatitude = c(20.2, 34.1, NA, NA), occurrenceID = c("12345", "898828", "LACM82898", "SDNHM62767"), occurrenceStatus = c("present", "", "ABSENT", "Present"), basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"), source = "db2", abundance = c(1, 2, 3, 19) ) db_list <- list(db1, db2) merge_modern_data <- ec_db_merge( db_list = db_list, "modern" ) ecodata <- ec_rm_duplicate_occurid( merge_modern_data, occurrenceID = "occurrenceID", abundance = "abundance" )
Trail Zeros from the Coordinate Values
ec_trail_zero(coord)ec_trail_zero(coord)
coord |
A coordinate value in the numeric format of decimal degree |
A numerical trailed coordinate value.
ec_trail_zero(12.7000000) ec_trail_zero(45.000000)ec_trail_zero(12.7000000) ec_trail_zero(45.000000)
A Summary Table of Final Cleaned Spatial and Environmental Variables
ec_var_summary( data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )ec_var_summary( data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )
data |
data table after cleaning the records |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
an array of col names of enviornmental layers |
A summary table with the mean, min and max values of final cleaned spatial and environmental variables
data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9, -116.5), decimalLatitude = c(32.9, 33.5, 31.9, 32.4), BO_sstmean = c(12, 13, 14, 11), BO_sstmin = c(9, 6, 10, 10), BO_sstmax = c(14, 16, 18, 17) ) env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax") ec_var_summary(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9, -116.5), decimalLatitude = c(32.9, 33.5, 31.9, 32.4), BO_sstmean = c(12, 13, 14, 11), BO_sstmin = c(9, 6, 10, 10), BO_sstmax = c(14, 16, 18, 17) ) env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax") ec_var_summary(data, latitude = "decimalLatitude", longitude = "decimalLongitude", env_layers )
Check Accepted Synonyms from WoRMs Taxonomy
ec_worms_synonym( species_name, data, scientificName = "scientificName", verbose = TRUE )ec_worms_synonym( species_name, data, scientificName = "scientificName", verbose = TRUE )
species_name |
input species name.e.g. Mexacanthina lugubris |
data |
data table which has information of all occurrence data of the selected species |
scientificName |
default set to scientificName, this is a column in the data extracted from online sources, may have various synonyms of species_name. |
verbose |
default value as TRUE |
A table with two columns, column one represent the accepted synonyms, and column two demonstrate the unique species names from the occurrence data base with the number of records tagged under species names.
## Not run: species_name <- "Mexacanthina lugubris" data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) comparison <- ec_worms_synonym( species_name, data, scientificName = "scientificName", verbose = TRUE ) print(comparison) ## End(Not run)## Not run: species_name <- "Mexacanthina lugubris" data <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5), decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35) ) comparison <- ec_worms_synonym( species_name, data, scientificName = "scientificName", verbose = TRUE ) print(comparison) ## End(Not run)
This data file is consider as raw data file after merging and removing duplicate records of all data sources. e.g. this file is an output of occurrence records of mollusc species "Mexacanthina lugubris" with all modern records extracted from GBIF, OBIS, IDIGBIO and InvertEBase
ecodataecodata
A data frame with 1115 rows and 19 variables:
index
Type of record (e.g., preserved specimen, fossil)
Presence or absence of the organism
Code of the institution that holds the record
Original recorded date of the event
Full scientific name of the organism
Number of individuals observed
Reported quantity of the organism
Calculated or standardized abundance value
Latitude in decimal degrees
Longitude in decimal degrees
Uncertainty in coordinates (meters)
Named place where the occurrence was recorded
Original text for locality description
Municipality or town of the occurrence
County where the record was observed
State or province name
Country name
Standardized catalog number for de-duplication
used rgbif for GBIF, ridigbio for iDigBio, robis for OBIS and rsymbiota for InvertEBase
This data shows the final cleaned occurrence records
ecodata_cleanedecodata_cleaned
A data frame with 698 rows and 35 variables:
Index
Type of occurrence record (e.g., preserved specimen, fossil)
Indicates presence or absence of the species
Code of the institution that provided the record
Original text for the event or collection date
Scientific name of the organism
Number of individuals recorded
Reported quantity (unit may vary)
Standardized or calculated abundance value
Latitude in decimal degrees
Longitude in decimal degrees
Spatial uncertainty of coordinates in meters
Named location where the record was collected
Original locality text as provided by the source
Municipality or town of occurrence
County of occurrence
State or province of occurrence
Country of occurrence
Standardized catalog number used for de-duplication
Number of decimal places in the latitude coordinate
Number of decimal places in the longitude coordinate
Flag for low coordinate precision
Flag for invalid or impossible coordinates
Flag for identical latitude and longitude (likely erroneous)
Flag for coordinates at (0,0)
Flag for coordinates placed at a country or region centroid
Flag for coordinates matching GBIF headquarters (artifact)
Flag for coordinates matching institution location
Flag for coordinates outside the study region
Flag for outliers based on clustering of spatial and environmental variables
Mean sea surface temperature from Bio-ORACLE
Maximum sea surface temperature from Bio-ORACLE
Minimum sea surface temperature from Bio-ORACLE
Chlorophyll concentration from Bio-ORACLE
Dissolved oxygen level from Bio-ORACLE
Generated after filtering outlier data points
This data file created by using GEOLocate tool and we only kept 4 columns. These georeference information will be merge back with the main data file ecodata
ecodata_correctedecodata_corrected
A data frame with 433 rows and 4 variables:
catalog number
latitude values assigned by GEOLocate
longitude values assigned by GEOLocate
uncertainty values assigned by GEOLocate
this file was created manually after extracting the csv file from GEOLocate online software to assign coordiante and uncertainty values for the records has locality information
This data file created after running ec_flag_outlier function. It has records with outlier probability
ecodata_with_outliersecodata_with_outliers
A data frame with 713 rows and 35 variables:
index
Type of occurrence record (e.g., preserved specimen, fossil)
Indicates presence or absence of the species
Code of the institution that provided the record
Original text for the event or collection date
Scientific name of the organism
Number of individuals recorded
Reported quantity (unit may vary)
Standardized or calculated abundance value
Latitude in decimal degrees
Longitude in decimal degrees
Spatial uncertainty of coordinates in meters
Named location where the record was collected
Original locality text as provided by the source
Municipality or town of occurrence
County of occurrence
State or province of occurrence
Country of occurrence
Standardized catalog number used for de-duplication
Number of decimal places in the latitude coordinate
Number of decimal places in the longitude coordinate
Flag for low coordinate precision
Flag for invalid or impossible coordinates
Flag for identical latitude and longitude (likely erroneous)
Flag for coordinates at (0,0)
Flag for coordinates placed at a country or region centroid
Flag for coordinates matching GBIF headquarters (artifact)
Flag for coordinates matching institution location
Flag for coordinates outside the study region
Flag for outliers based clustering of spatial and env variables
Mean sea surface temperature from Bio-ORACLE
Maximum sea surface temperature from Bio-ORACLE
Minimum sea surface temperature from Bio-ORACLE
Chlorophyll concentration from Bio-ORACLE
Dissolved oxygen level from Bio-ORACLE
this file was created manually after extracting the csv file from GEOLocate online software to assign coordiante and uncertainty values for the records has locality information
This data was created to get unique combination of coordinate values to extract env variables from bio-oracle and merge back in main data table - ecodata
ecodata_xecodata_x
A data frame with 705 rows and 6 variables:
species name
Latitude in decimal degrees
Longitude in decimal degrees
Mean sea surface temperature from Bio-ORACLE
Maximum sea surface temperature from Bio-ORACLE
Minimum sea surface temperature from Bio-ORACLE
this file has unique coordinate information with unique values of enviornemnt variables
This is a data dump downloaded from invertEbase, as the R package link with InverEbase is currently archive and not maintained, we are providing an example file.
example_sp_invertebaseexample_sp_invertebase
A data frame with 710 rows and 20 variables:
invertEbase
CatalogNumber
type of observations
presence or absent
Institution code
when was this occurrence created
species name
abundance
abundance
abundance
Latitude in decimal degrees
Longitude in decimal degrees
uncertainty of coordiantes
location information
verbatim location information
municipality
country
State or Provinces
county
country code
this file is downloaded file from invertEBase for species - "Mexacanthina lugubris" and modified field names based on TDWG standard.
Calculate Harversine distance
haversine_kmeans(data, latitude, longitude, k)haversine_kmeans(data, latitude, longitude, k)
data |
is a dataframe with spatial attributes - Latitude and Logitude |
latitude |
nested imput from ec_flag_outlier |
longitude |
nested imput from ec_flag_outlier |
k |
is number of cluster required for the data set you have. Normally visual inspection can give a sense on number of clusters. Cautious to have more than expected clusters to fit all data points, as overfitting can end up inluding bad data points in the analysis. e.g. k = 3 |
A data frame with centroid and clusters using Harversine distance matrix
data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, 13, 14), BO_sstmin = c(9, 6, 10), BO_sstmax = c(14, 16, 18) ) result <- haversine_kmeans( data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", k = 3 )data_x <- data.frame( scientificName = "Mexacanthina lugubris", decimalLongitude = c(-117, -117.8, -116.9), decimalLatitude = c(32.9, 33.5, 31.9), BO_sstmean = c(12, 13, 14), BO_sstmin = c(9, 6, 10), BO_sstmax = c(14, 16, 18) ) result <- haversine_kmeans( data_x, latitude = "decimalLatitude", longitude = "decimalLongitude", k = 3 )