This file is indexed.

/usr/lib/R/site-library/ensembldb/doc/ensembldb.Rmd is in r-bioc-ensembldb 2.2.2-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
---
title: "Generating an using Ensembl based annotation packages"
author: "Johannes Rainer"
graphics: yes
package: ensembldb
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Generating an using Ensembl based annotation packages}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{ensembldb,EnsDb.Hsapiens.v75,BiocStyle,AnnotationHub,ggbio,Gviz,magrittr}
---


# Introduction

The `ensembldb` package provides functions to create and use transcript centric
annotation databases/packages. The annotation for the databases are directly
fetched from Ensembl <sup><a id="fnr.1" class="footref" href="#fn.1">1</a></sup> using their Perl API. The functionality and data is
similar to that of the `TxDb` packages from the `GenomicFeatures` package, but, in
addition to retrieve all gene/transcript models and annotations from the
database, the `ensembldb` package provides also a filter framework allowing to
retrieve annotations for specific entries like genes encoded on a chromosome
region or transcript models of lincRNA genes. From version 1.7 on, `EnsDb`
databases created by the `ensembldb` package contain also protein annotation data
(see Section [11](#org5bd9a97) for the database layout and an overview of
available attributes/columns). For more information on the use of the protein
annotations refer to the *proteins* vignette.

Another main goal of this package is to generate *versioned* annotation
packages, i.e. annotation packages that are build for a specific Ensembl
release, and are also named according to that (e.g. `EnsDb.Hsapiens.v75` for
human gene definitions of the Ensembl code database version 75). This ensures
reproducibility, as it allows to load annotations from a specific Ensembl
release also if newer versions of annotation packages/releases are available. It
also allows to load multiple annotation packages at the same time in order to
e.g. compare gene models between Ensembl releases.

In the example below we load an Ensembl based annotation package for Homo
sapiens, Ensembl version 75. The `EnsDb` object providing access to the underlying
SQLite database is bound to the variable name `EnsDb.Hsapiens.v75`.

```{r  load-libs, warning=FALSE, message=FALSE }
library(EnsDb.Hsapiens.v75)

## Making a "short cut"
edb <- EnsDb.Hsapiens.v75
## print some informations for this package
edb

## For what organism was the database generated?
organism(edb)
 
```

```{r  no-network, echo = FALSE, results = "hide" }
## Disable code chunks that require network connection - conditionally
## disable this on Windows only. This is to avoid TIMEOUT errors on the
## Bioconductor Windows build maching (issue #47).
use_network <- FALSE
 
```


# Using `ensembldb` annotation packages to retrieve specific annotations

One of the strengths of the `ensembldb` package and the related `EnsDb` databases is
its implementation of a filter framework that enables to efficiently extract
data sub-sets from the databases. The `ensembldb` package supports most of the
filters defined in the `AnnotationFilter` Bioconductor package and defines some
additional filters specific to the data stored in `EnsDb` databases. Filters can
be passed directly to all methods extracting data from an `EnsDb` (such as `genes`,
`transcripts` or `exons`). Alternatively it is possible with the `addFilter` or `filter`
functions to add a filter directly to an `EnsDb` which will then be used in all
queries on that object.

The `supportedFilters` method can be used to get an overview over all supported
filter classes, each of them (except the `GRangesFilter`) working on a single
column/field in the database.

```{r  filters }
supportedFilters(edb)
 
```

These filters can be divided into 3 main filter types:

-   `IntegerFilter`: filter classes extending this basic object can take a single
    numeric value as input and support the conditions `=, !`, >, <, >= and <=. All
    filters that work on chromosomal coordinates, such as the `GeneEndFilter` extend
    `IntegerFilter`.
-   `CharacterFilter`: filter classes extending this object can take a single or
    multiple character values as input and allow conditions: `=, !`, "startsWith"
    and "endsWith". All filters working on IDs extend this class.
-   `GRangesFilter`: takes a `GRanges` object as input and supports all conditions
    that `findOverlaps` from the `IRanges` package supports ("any", "start", "end",
    "within", "equal"). Note that these have to be passed using the parameter `type`
    to the constructor function.

The supported filters are:

-   `EntrezFilter`: allows to filter results based on NCBI Entrezgene
    identifiers of the genes.
-   `ExonEndFilter`: filter using the chromosomal end coordinate of exons.
-   `ExonIdFilter`: filter based on the (Ensembl) exon identifiers.
-   `ExonRankFilter`: filter based on the rank (index) of an exon within the
    transcript model. Exons are always numbered from 5' to 3' end of the
    transcript, thus, also on the reverse strand, the exon 1 is the most 5' exon
    of the transcript.
-   `ExonStartFilter`: filter using the chromosomal start coordinate of exons.
-   `GeneBiotypeFilter`: filter using the gene biotypes defined in the Ensembl
    database; use the `listGenebiotypes` method to list all available biotypes.
-   `GeneEndFilter`: filter using the chromosomal end coordinate of gene.
-   `GeneIdFilter`: filter based on the Ensembl gene IDs.
-   `GenenameFilter`: filter based on the names (symbols) of the genes.
-   `GeneStartFilter`: filter using the chromosomal start coordinate of gene.
-   `GRangesFilter`: allows to retrieve all features (genes, transcripts or exons)
    that are either within (setting parameter `type` to "within") or partially
    overlapping (setting `type` to "any") the defined genomic region/range. Note
    that, depending on the called method (`genes`, `transcripts` or `exons`) the start
    and end coordinates of either the genes, transcripts or exons are used for the
    filter. For methods `exonsBy`, `cdsBy` and `txBy` the coordinates of `by` are used.
-   `SeqNameFilter`: filter by the name of the chromosomes the genes are encoded
    on.
-   `SeqStrandFilter`: filter for the chromosome strand on which the genes are
    encoded.
-   `SymbolFilter`: filter on gene symbols; note that no database columns *symbol* is
    available in an `EnsDb` database and hence the gene name is used for filtering.
-   `TxBiotypeFilter`: filter on the transcript biotype defined in Ensembl; use
    the `listTxbiotypes` method to list all available biotypes.
-   `TxEndFilter`: filter using the chromosomal end coordinate of transcripts.
-   `TxIdFilter`: filter on the Ensembl transcript identifiers.
-   `TxNameFilter`: filter on the Ensembl transcript names (currently identical to
    the transcript IDs).
-   `TxStartFilter`: filter using the chromosomal start coordinate of transcripts.

In addition to the above listed *DNA-RNA-based* filters, *protein-specific*
filters are also available: 

-   `ProtDomIdFilter`: filter by the protein domain ID.
-   `ProteinIdFilter`: filter by Ensembl protein ID filters.
-   `UniprotDbFilter`: filter by the name of the Uniprot database.
-   `UniprotFilter`: filter by the Uniprot ID.
-   `UniprotMappingTypeFilter`: filter by the mapping type of Ensembl protein IDs to
    Uniprot IDs.

These can however only be used on `EnsDb` databases that provide protein
annotations, i.e. for which a call to `hasProteinData` returns `TRUE`.

`EnsDb` databases for more recent Ensembl versions (starting from Ensembl 87)
provide also evidence levels for individual transcripts in the `tx_support_level`
database column. Such databases support also a `TxSupportLevelFilter` filter to
use this columns for filtering.

A simple use case for the filter framework would be to get all transcripts for
the gene *BCL2L11*. To this end we specify a `GenenameFilter` with the value
*BCL2L11*. As a result we get a `GRanges` object with `start`, `end`, `strand` and `seqname`
being the start coordinate, end coordinate, chromosome name and strand for the
respective transcripts. All additional annotations are available as metadata
columns. Alternatively, by setting `return.type` to "DataFrame", or "data.frame"
the method would return a `DataFrame` or `data.frame` object instead of the default
`GRanges`.

```{r  transcripts }
Tx <- transcripts(edb, filter = list(GenenameFilter("BCL2L11")))

Tx

## as this is a GRanges object we can access e.g. the start coordinates with
head(start(Tx))

## or extract the biotype with
head(Tx$tx_biotype)
 
```

The parameter `columns` of the extractor methods (such as `exons`, `genes` or
`transcripts)` allows to specify which database attributes (columns) should be
retrieved. The `exons` method returns by default all exon-related columns, the
`transcripts` all columns from the transcript database table and the `genes` all
from the gene table. Note however that in the example above we got also a column
`gene_name` although this column is not present in the transcript database
table. By default the methods return also all columns that are used by any of
the filters submitted with the `filter` argument (thus, because a `GenenameFilter`
was used, the column `gene_name` is also returned). Setting
`returnFilterColumns(edb) <- FALSE` disables this option and only the columns
specified by the `columns` parameter are retrieved.

Instead of passing a filter *object* to the method it is also possible to provide
a filter *expression* written as a `formula`. The `formula` has to be written in the
form `~ <field> <condition> <value>` with `<field>` being the field (database
column) in the database, `<condition>` the condition for the filter object and
`<value>` its value. Use the `supportedFilter` method to get the field names
corresponding to each filter class.

```{r  transcripts-filter-expression }
## Use a filter expression to perform the filtering.
transcripts(edb, filter = ~ genename == "ZBTB16")
 
```

Filter expression have to be written as a formula (i.e. starting with a `~`) in
the form *column name* followed by the logical condition.

Alternatively, `EnsDb` objects can be filtered directly using the `filter`
function. In the example below we use the `filter` function to filter the `EnsDb`
object and pass that filtered database to the `transcripts` method using the `%>%`
from the `magrittr` package.

```{r  transcripts-filter }
library(magrittr)

filter(edb, ~ symbol == "BCL2" & tx_biotype != "protein_coding") %>% transcripts
 
```

Adding a filter to an `EnsDb` enables this filter (globally) on all subsequent
queries on that object. We could thus filter an `EnsDb` to (virtually) contain
only features encoded on chromosome Y.

```{r  filter-Y }
edb_y <- addFilter(edb, SeqNameFilter("Y"))

## All subsequent filters on that EnsDb will only work on features encoded on
## chromosome Y
genes(edb_y)

## Get all lincRNAs on chromosome Y
genes(edb_y, filter = ~ gene_biotype == "lincRNA")
 
```

To get an overview of database tables and available columns the function
`listTables` can be used. The method `listColumns` on the other hand lists columns
for the specified database table.

```{r  list-columns }
## list all database tables along with their columns
listTables(edb)

## list columns from a specific table
listColumns(edb, "tx")
 
```

Thus, we could retrieve all transcripts of the biotype *nonsense\_mediated\_decay*
(which, according to the definitions by Ensembl are transcribed, but most likely
not translated in a protein, but rather degraded after transcription) along with
the name of the gene for each transcript. Note that we are changing here the
`return.type` to `DataFrame`, so the method will return a `DataFrame` with the
results instead of the default `GRanges`.

```{r  transcripts-example2 }
Tx <- transcripts(edb,
                  columns = c(listColumns(edb , "tx"), "gene_name"),
                  filter = TxBiotypeFilter("nonsense_mediated_decay"),
                  return.type = "DataFrame")
nrow(Tx)
Tx
 
```

For protein coding transcripts, we can also specifically extract their coding
region. In the example below we extract the CDS for all transcripts encoded on
chromosome Y.

```{r  cdsBy }
yCds <- cdsBy(edb, filter = SeqNameFilter("Y"))
yCds
 
```

Using a `GRangesFilter` we can retrieve all features from the database that are
either within or overlapping the specified genomic region. In the example
below we query all genes that are partially overlapping with a small region on
chromosome 11. The filter restricts to all genes for which either an exon or an
intron is partially overlapping with the region.

```{r  genes-GRangesFilter }
## Define the filter
grf <- GRangesFilter(GRanges("11", ranges = IRanges(114000000, 114000050),
                             strand = "+"), type = "any")

## Query genes:
gn <- genes(edb, filter = grf)
gn

## Next we retrieve all transcripts for that gene so that we can plot them.
txs <- transcripts(edb, filter = GenenameFilter(gn$gene_name))
 
```

```{r  tx-for-zbtb16, message=FALSE, fig.align='center', fig.width=7.5, fig.height=5 }
plot(3, 3, pch = NA, xlim = c(start(gn), end(gn)), ylim = c(0, length(txs)),
     yaxt = "n", ylab = "")
## Highlight the GRangesFilter region
rect(xleft = start(grf), xright = end(grf), ybottom = 0, ytop = length(txs),
     col = "red", border = "red")
for(i in 1:length(txs)) {
    current <- txs[i]
    rect(xleft = start(current), xright = end(current), ybottom = i-0.975,
         ytop = i-0.125, border = "grey")
    text(start(current), y = i-0.5, pos = 4, cex = 0.75, labels = current$tx_id)
}
 
```

As we can see, 4 transcripts of the gene ZBTB16 are also overlapping the
region. Below we fetch these 4 transcripts. Note, that a call to `exons` will
not return any features from the database, as no exon is overlapping with the
region.

```{r  transcripts-GRangesFilter }
transcripts(edb, filter = grf)
 
```

The `GRangesFilter` supports also `GRanges` defining multiple regions and a
query will return all features overlapping any of these regions. Besides using
the `GRangesFilter` it is also possible to search for transcripts or exons
overlapping genomic regions using the `exonsByOverlaps` or
`transcriptsByOverlaps` known from the `GenomicFeatures` package. Note that the
implementation of these methods for `EnsDb` objects supports also to use filters
to further fine-tune the query.

The functions `listGenebiotypes` and `listTxbiotypes` can be used to get an overview
of allowed/available gene and transcript biotype

```{r  biotypes }
## Get all gene biotypes from the database. The GeneBiotypeFilter
## allows to filter on these values.
listGenebiotypes(edb)

## Get all transcript biotypes from the database.
listTxbiotypes(edb)
 
```

Data can be fetched in an analogous way using the `exons` and `genes`
methods. In the example below we retrieve `gene_name`, `entrezid` and the
`gene_biotype` of all genes in the database which names start with "BCL2".

```{r  genes-BCL2 }
## We're going to fetch all genes which names start with BCL. To this end
## we define a GenenameFilter with partial matching, i.e. condition "like"
## and a % for any character/string.
BCLs <- genes(edb,
	      columns = c("gene_name", "entrezid", "gene_biotype"),
	      filter = GenenameFilter("BCL", condition = "startsWith"),
	      return.type = "DataFrame")
nrow(BCLs)
BCLs
 
```

Sometimes it might be useful to know the length of genes or transcripts
(i.e. the total sum of nucleotides covered by their exons). Below we calculate
the mean length of transcripts from protein coding genes on chromosomes X and Y
as well as the average length of snoRNA, snRNA and rRNA transcripts encoded on
these chromosomes. For the first query we combine two `AnnotationFilter` objects
using an `AnnotationFilterList` object, in the second we define the query using a
filter expression.

```{r  example-AnnotationFilterList }
## determine the average length of snRNA, snoRNA and rRNA genes encoded on
## chromosomes X and Y.
mean(lengthOf(edb, of = "tx", filter = AnnotationFilterList(
                                  GeneBiotypeFilter(c("snRNA", "snoRNA", "rRNA")),
                                  SeqNameFilter(c("X", "Y")))))

## determine the average length of protein coding genes encoded on the same
## chromosomes.
mean(lengthOf(edb, of = "tx", filter = ~ gene_biotype == "protein_coding" &
                                  seq_name %in% c("X", "Y")))
 
```

Not unexpectedly, transcripts of protein coding genes are longer than those of
snRNA, snoRNA or rRNA genes.

At last we extract the first two exons of each transcript model from the
database.

```{r  example-first-two-exons }
## Extract all exons 1 and (if present) 2 for all genes encoded on the
## Y chromosome
exons(edb, columns = c("tx_id", "exon_idx"),
      filter = list(SeqNameFilter("Y"),
                    ExonRankFilter(3, condition = "<")))
 
```


# Extracting gene/transcript/exon models for RNASeq feature counting

For the feature counting step of an RNAseq experiment, the gene or transcript
models (defined by the chromosomal start and end positions of their exons) have
to be known. To extract these from an Ensembl based annotation package, the
`exonsBy`, `genesBy` and `transcriptsBy` methods can be used in an analogous way as in
`TxDb` packages generated by the `GenomicFeatures` package.  However, the
`transcriptsBy` method does not, in contrast to the method in the `GenomicFeatures`
package, allow to return transcripts by "cds". While the annotation packages
built by the `ensembldb` contain the chromosomal start and end coordinates of
the coding region (for protein coding genes) they do not assign an ID to each
CDS.

A simple use case is to retrieve all genes encoded on chromosomes X and Y from
the database.

```{r  transcriptsBy-X-Y }
TxByGns <- transcriptsBy(edb, by = "gene", filter = SeqNameFilter(c("X", "Y")))
TxByGns
 
```

Since Ensembl contains also definitions of genes that are on chromosome variants
(supercontigs), it is advisable to specify the chromosome names for which the
gene models should be returned.

In a real use case, we might thus want to retrieve all genes encoded on the
*standard* chromosomes. In addition it is advisable to use a `GeneIdFilter` to
restrict to Ensembl genes only, as also *LRG* (Locus Reference Genomic)
genes<sup><a id="fnr.2" class="footref" href="#fn.2">2</a></sup> are defined in the database, which are partially redundant with
Ensembl genes.

```{r  exonsBy-RNAseq, message = FALSE, eval = FALSE }
## will just get exons for all genes on chromosomes 1 to 22, X and Y.
## Note: want to get rid of the "LRG" genes!!!
EnsGenes <- exonsBy(edb, by = "gene", filter = AnnotationFilterList(
                                          SeqNameFilter(c(1:22, "X", "Y")),
                                          GeneIdFilter("ENSG", "startsWith")))
 
```

The code above returns a `GRangesList` that can be used directly as an input for
the `summarizeOverlaps` function from the `GenomicAlignments` package <sup><a id="fnr.3" class="footref" href="#fn.3">3</a></sup>.

Alternatively, the above `GRangesList` can be transformed to a `data.frame` in
*SAF* format that can be used as an input to the `featureCounts` function of the
`Rsubread` package <sup><a id="fnr.4" class="footref" href="#fn.4">4</a></sup>.

```{r  toSAF-RNAseq, message = FALSE, eval=FALSE }
## Transforming the GRangesList into a data.frame in SAF format
EnsGenes.SAF <- toSAF(EnsGenes)
 
```

Note that the ID by which the `GRangesList` is split is used in the SAF
formatted `data.frame` as the `GeneID`. In the example below this would be the
Ensembl gene IDs, while the start, end coordinates (along with the strand and
chromosomes) are those of the the exons.

In addition, the `disjointExons` function (similar to the one defined in
`GenomicFeatures`) can be used to generate a `GRanges` of non-overlapping exon
parts which can be used in the `DEXSeq` package.

```{r  disjointExons, message = FALSE, eval=FALSE }
## Create a GRanges of non-overlapping exon parts.
DJE <- disjointExons(edb, filter = AnnotationFilterList(
			      SeqNameFilter(c(1:22, "X", "Y")),
			      GeneIdFilter("ENSG%", "startsWith")))
 
```


# Retrieving sequences for gene/transcript/exon models

The methods to retrieve exons, transcripts and genes (i.e. `exons`, `transcripts`
and `genes`) return by default `GRanges` objects that can be used to retrieve
sequences using the `getSeq` method e.g. from BSgenome packages. The basic
workflow is thus identical to the one for `TxDb` packages, however, it is not
straight forward to identify the BSgenome package with the matching genomic
sequence. Most BSgenome packages are named according to the genome build
identifier used in UCSC which does not (always) match the genome build name used
by Ensembl. Using the Ensembl version provided by the `EnsDb`, the correct genomic
sequence can however be retrieved easily from the `AnnotationHub` using the
`getGenomeFaFile`. If no Fasta file matching the Ensembl version is available, the
function tries to identify a Fasta file with the correct genome build from the
*closest* Ensembl release and returns that instead.

In the code block below we retrieve first the `FaFile` with the genomic DNA
sequence, extract the genomic start and end coordinates for all genes defined in
the package, subset to genes encoded on sequences available in the `FaFile` and
extract all of their sequences. Note: these sequences represent the sequence
between the chromosomal start and end coordinates of the gene.

```{r  transcript-sequence-AnnotationHub, message = FALSE, eval = FALSE }
library(EnsDb.Hsapiens.v75)
library(Rsamtools)
edb <- EnsDb.Hsapiens.v75

## Get the FaFile with the genomic sequence matching the Ensembl version
## using the AnnotationHub package.
Dna <- getGenomeFaFile(edb)

## Get start/end coordinates of all genes.
genes <- genes(edb)
## Subset to all genes that are encoded on chromosomes for which
## we do have DNA sequence available.
genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]

## Get the gene sequences, i.e. the sequence including the sequence of
## all of the gene's exons and introns.
geneSeqs <- getSeq(Dna, genes)
 
```

To retrieve the (exonic) sequence of transcripts (i.e. without introns) we can
use directly the `extractTranscriptSeqs` method defined in the `GenomicFeatures` on
the `EnsDb` object, eventually using a filter to restrict the query.

```{r  transcript-sequence-extractTranscriptSeqs, message = FALSE, eval = FALSE }
## get all exons of all transcripts encoded on chromosome Y
yTx <- exonsBy(edb, filter = SeqNameFilter("Y"))

## Retrieve the sequences for these transcripts from the FaFile.
library(GenomicFeatures)
yTxSeqs <- extractTranscriptSeqs(Dna, yTx)
yTxSeqs

## Extract the sequences of all transcripts encoded on chromosome Y.
yTx <- extractTranscriptSeqs(Dna, edb, filter = SeqNameFilter("Y"))

## Along these lines, we could use the method also to retrieve the coding sequence
## of all transcripts on the Y chromosome.
cdsY <- cdsBy(edb, filter = SeqNameFilter("Y"))
extractTranscriptSeqs(Dna, cdsY)
 
```

Note: in the next section we describe how transcript sequences can be retrieved
from a `BSgenome` package that is based on UCSC, not Ensembl.


# Integrating annotations from Ensembl based  `EnsDb` packages with UCSC based annotations

Sometimes it might be useful to combine (Ensembl based) annotations from `EnsDb`
packages/objects with annotations from other Bioconductor packages, that might
base on UCSC annotations. To support such an integration of annotations, the
`ensembldb` packages implements the `seqlevelsStyle` and `seqlevelsStyle<-` from the
`GenomeInfoDb` package that allow to change the style of chromosome naming.  Thus,
sequence/chromosome names other than those used by Ensembl can be used in, and
are returned by, the queries to `EnsDb` objects as long as a mapping for them is
provided by the `GenomeInfoDb` package (which provides a mapping mostly between
UCSC, NCBI and Ensembl chromosome names for the *main* chromosomes).

In the example below we change the seqnames style to UCSC.

```{r  seqlevelsStyle, message = FALSE }
## Change the seqlevels style form Ensembl (default) to UCSC:
seqlevelsStyle(edb) <- "UCSC"

## Now we can use UCSC style seqnames in SeqNameFilters or GRangesFilter:
genesY <- genes(edb, filter = ~ seq_name == "chrY")
## The seqlevels of the returned GRanges are also in UCSC style
seqlevels(genesY)
 
```

Note that in most instances no mapping is available for sequences not
corresponding to the main chromosomes (i.e. contigs, patched chromosomes
etc). What is returned in cases in which no mapping is available can be
specified with the global `ensembldb.seqnameNotFound` option. By default (with
`ensembldb.seqnameNotFound` set to "ORIGINAL"), the original seqnames (i.e. the
ones from Ensembl) are returned. With `ensembldb.seqnameNotFound` "MISSING" each
time a seqname can not be found an error is thrown. For all other cases
(e.g. `ensembldb.seqnameNotFound = NA`) the value of the option is returned.

```{r  seqlevelsStyle-2, message = FALSE }
seqlevelsStyle(edb) <- "UCSC"

## Getting the default option:
getOption("ensembldb.seqnameNotFound")

## Listing all seqlevels in the database.
seqlevels(edb)[1:30]

## Setting the option to NA, thus, for each seqname for which no mapping is available,
## NA is returned.
options(ensembldb.seqnameNotFound=NA)
seqlevels(edb)[1:30]

## Resetting the option.
options(ensembldb.seqnameNotFound = "ORIGINAL")
 
```

Next we retrieve transcript sequences from genes encoded on chromosome Y using
the `BSGenome` package for the human genome from UCSC. The specified version
`hg19` matches the genome build of Ensembl version 75, i.e. `GRCh37`. Note that
while we changed the style of the seqnames to UCSC we did not change the naming
of the genome release.

```{r  extractTranscriptSeqs-BSGenome, warning = FALSE, message = FALSE }
library(BSgenome.Hsapiens.UCSC.hg19)
bsg <- BSgenome.Hsapiens.UCSC.hg19

## Get the genome version
unique(genome(bsg))
unique(genome(edb))
## Although differently named, both represent genome build GRCh37.

## Extract the full transcript sequences.
yTxSeqs <- extractTranscriptSeqs(bsg, exonsBy(edb, "tx",
					      filter = SeqNameFilter("chrY")))

yTxSeqs

## Extract just the CDS
Test <- cdsBy(edb, "tx", filter = SeqNameFilter("chrY"))
yTxCds <- extractTranscriptSeqs(bsg, cdsBy(edb, "tx",
                                           filter = SeqNameFilter("chrY")))
yTxCds
 
```

At last changing the seqname style to the default value `"Ensembl"`.

```{r  seqlevelsStyle-restore }
seqlevelsStyle(edb) <- "Ensembl"
 
```


# Interactive annotation lookup using the `shiny` web app

In addition to the `genes`, `transcripts` and `exons` methods it is possibly to
search interactively for gene/transcript/exon annotations using the internal,
`shiny` based, web application. The application can be started with the
`runEnsDbApp()` function. The search results from this app can also be returned
to the R workspace either as a `data.frame` or `GRanges` object.


# Plotting gene/transcript features using `ensembldb` and `Gviz` and `ggbio`

The `Gviz` package provides functions to plot genes and transcripts along with
other data on a genomic scale. Gene models can be provided either as a
`data.frame`, `GRanges`, `TxDB` database, can be fetched from biomart and can
also be retrieved from `ensembldb`.

Below we generate a `GeneRegionTrack` fetching all transcripts from a certain
region on chromosome Y.

Note that if we want in addition to work also with BAM files that were aligned
against DNA sequences retrieved from Ensembl or FASTA files representing genomic
DNA sequences from Ensembl we should change the `ucscChromosomeNames` option from
`Gviz` to `FALSE` (i.e. by calling `options(ucscChromosomeNames = FALSE)`).  This is
not necessary if we just want to retrieve gene models from an `EnsDb` object, as
the `ensembldb` package internally checks the `ucscChromosomeNames` option and,
depending on that, maps Ensembl chromosome names to UCSC chromosome names.

```{r  gviz-plot, message=FALSE, fig.align='center', fig.width=7.5, fig.height=2.3 }
## Loading the Gviz library
library(Gviz)
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

## Retrieving a Gviz compatible GRanges object with all genes
## encoded on chromosome Y.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "Y",
                                start = 20400000, end = 21400000)
## Define a genome axis track
gat <- GenomeAxisTrack()

## We have to change the ucscChromosomeNames option to FALSE to enable Gviz usage
## with non-UCSC chromosome names.
options(ucscChromosomeNames = FALSE)

plotTracks(list(gat, GeneRegionTrack(gr)))

options(ucscChromosomeNames = TRUE)
 
```

Above we had to change the option `ucscChromosomeNames` to `FALSE` in order to
use it with non-UCSC chromosome names. Alternatively, we could however also
change the `seqnamesStyle` of the `EnsDb` object to `UCSC`. Note that we have to
use now also chromosome names in the *UCSC style* in the `SeqNameFilter`
(i.e. "chrY" instead of `Y`).

```{r  message=FALSE }
seqlevelsStyle(edb) <- "UCSC"
## Retrieving the GRanges objects with seqnames corresponding to UCSC chromosome names.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                start = 20400000, end = 21400000)
seqnames(gr)
## Define a genome axis track
gat <- GenomeAxisTrack()
plotTracks(list(gat, GeneRegionTrack(gr)))
 
```

We can also use the filters from the `ensembldb` package to further refine what
transcripts are fetched, like in the example below, in which we create two
different gene region tracks, one for protein coding genes and one for lincRNAs.

```{r  gviz-separate-tracks, message=FALSE, warning=FALSE, fig.align='center', fig.width=7.5, fig.height=2.25 }
protCod <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                     start = 20400000, end = 21400000,
                                     filter = GeneBiotypeFilter("protein_coding"))
lincs <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                   start = 20400000, end = 21400000,
                                   filter = GeneBiotypeFilter("lincRNA"))

plotTracks(list(gat, GeneRegionTrack(protCod, name = "protein coding"),
                GeneRegionTrack(lincs, name = "lincRNAs")), transcriptAnnotation = "symbol")

## At last we change the seqlevels style again to Ensembl
seqlevelsStyle <- "Ensembl"
 
```

Alternatively, we can also use `ggbio` for plotting. For `ggplot` we can directly
pass the `EnsDb` object along with optional filters (or as in the example below a
filter expression as a `formula`).

```{r  pplot-plot, message=FALSE, fig.align='center', fig.width=7.5, fig.height=4 }
library(ggbio)

## Create a plot for all transcripts of the gene SKA2
autoplot(edb, ~ genename == "SKA2")
 
```

To plot the genomic region and plot genes from both strands we can use a
`GRangesFilter`.

```{r  pplot-plot-2, message=FALSE, fig.align='center', fig.width=7.5, fig.height=4 }
## Get the chromosomal region in which the gene is encoded
ska2 <- genes(edb, filter = ~ genename == "SKA2")
strand(ska2) <- "*"
autoplot(edb, GRangesFilter(ska2), names.expr = "gene_name")
 
```


# Using `EnsDb` objects in the `AnnotationDbi` framework

Most of the methods defined for objects extending the basic annotation package
class `AnnotationDbi` are also defined for `EnsDb` objects (i.e. methods
`columns`, `keytypes`, `keys`, `mapIds` and `select`). While these methods can
be used analogously to basic annotation packages, the implementation for `EnsDb`
objects also support the filtering framework of the `ensembldb` package.

In the example below we first evaluate all the available columns and keytypes in
the database and extract then the gene names for all genes encoded on chromosome
X.

```{r  AnnotationDbi, message = FALSE }
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

## List all available columns in the database.
columns(edb)

## Note that these do *not* correspond to the actual column names
## of the database that can be passed to methods like exons, genes,
## transcripts etc. These column names can be listed with the listColumns
## method.
listColumns(edb)

## List all of the supported key types.
keytypes(edb)

## Get all gene ids from the database.
gids <- keys(edb, keytype = "GENEID")
length(gids)

## Get all gene names for genes encoded on chromosome Y.
gnames <- keys(edb, keytype = "GENENAME", filter = SeqNameFilter("Y"))
head(gnames)
 
```

In the next example we retrieve specific information from the database using the
`select` method. First we fetch all transcripts for the genes *BCL2* and
*BCL2L11*. In the first call we provide the gene names, while in the second call
we employ the filtering system to perform a more fine-grained query to fetch
only the protein coding transcripts for these genes.

```{r  select, message = FALSE, warning=FALSE }
## Use the /standard/ way to fetch data.
select(edb, keys = c("BCL2", "BCL2L11"), keytype = "GENENAME",
       columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE"))

## Use the filtering system of ensembldb
select(edb, keys = ~ genename %in% c("BCL2", "BCL2L11") &
                tx_biotype == "protein_coding",
       columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE"))
 
```

Finally, we use the `mapIds` method to establish a mapping between ids and
values. In the example below we fetch transcript ids for the two genes from the
example above.

```{r  mapIds, message = FALSE }
## Use the default method, which just returns the first value for multi mappings.
mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME")

## Alternatively, specify multiVals="list" to return all mappings.
mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME",
       multiVals = "list")

## And, just like before, we can use filters to map only to protein coding transcripts.
mapIds(edb, keys = list(GenenameFilter(c("BCL2", "BCL2L11")),
                        TxBiotypeFilter("protein_coding")), column = "TXID",
       multiVals = "list")
 
```

Note that, if the filters are used, the ordering of the result does no longer
match the ordering of the genes.


# Important notes

These notes might explain eventually unexpected results (and, more importantly,
help avoiding them):

-   The ordering of the results returned by the `genes`, `exons`, `transcripts` methods
    can be specified with the `order.by` parameter. The ordering of the results does
    however **not** correspond to the ordering of values in submitted filter
    objects. The exception is the `select` method. If a character vector of values
    or a single filter is passed with argument `keys` the ordering of results of
    this method matches the ordering of the key values or the values of the
    filter.

-   Results of `exonsBy`, `transcriptsBy` are always ordered by the `by` argument.

-   The CDS provided by `EnsDb` objects **always** includes both, the start and the
    stop codon.

-   Transcripts with multiple CDS are at present not supported by `EnsDb`.

-   At present, `EnsDb` support only genes/transcripts for which all of their
    exons are encoded on the same chromosome and the same strand.

-   Since a single Ensembl gene ID might be mapped to multiple NCBI Entrezgene IDs
    methods such as `genes`, `transcripts` etc return a `list` in the `"entrezid"` column
    of the resulting result object.


# Getting or building `EnsDb` databases/packages

Some of the code in this section is not supposed to be automatically executed
when the vignette is built, as this would require a working installation of the
Ensembl Perl API, which is not expected to be available on each system. Also,
building `EnsDb` from alternative sources, like GFF or GTF files takes some time
and thus also these examples are not directly executed when the vignette is
build.


## Getting `EnsDb` databases

Some `EnsDb` databases are available as `R` packages from Bioconductor and can be
simply installed with the `biocLite` function from the `BiocInstaller` package. The
name of such annotation packages starts with *EnsDb* followed by the abbreviation
of the organism and the Ensembl version on which the annotation
bases. `EnsDb.Hsapiens.v86` provides thus an `EnsDb` database for homo sapiens with
annotations from Ensembl version 86.

Since Bioconductor version 3.5 `EnsDb` databases can also be retrieved directly
from `AnnotationHub`.

```{r  AnnotationHub-query, message = FALSE, eval = use_network }
library(AnnotationHub)
## Load the annotation resource.
ah <- AnnotationHub()

## Query for all available EnsDb databases
query(ah, "EnsDb")
 
```

We can simply fetch one of the databases.

```{r  AnnotationHub-query-2, message = FALSE, eval = use_network }
ahDb <- query(ah, pattern = c("Xiphophorus Maculatus", "EnsDb", 87))
## What have we got
ahDb
 
```

Fetch the `EnsDb` and use it.

```{r  AnnotationHub-fetch, message = FALSE, eval = FALSE }
ahEdb <- ahDb[[1]]

## retriebe all genes
gns <- genes(ahEdb)
 
```

We could even make an annotation package from this `EnsDb` object using the
`makeEnsembldbPackage` and passing `dbfile(dbconn(ahEdb))` as `ensdb` argument.


## Building annotation packages


### Directly from Ensembl databases

The `fetchTablesFromEnsembl` function uses the Ensembl Perl API
to retrieve the required annotations from an Ensembl database (e.g. from the
main site *ensembldb.ensembl.org*). Thus, to use this functionality to build
databases, the Ensembl Perl API needs to be installed (see <sup><a id="fnr.5" class="footref" href="#fn.5">5</a></sup> for details).

Below we create an `EnsDb` database by fetching the required data directly from
the Ensembl core databases. The `makeEnsembldbPackage` function is then used to
create an annotation package from this `EnsDb` containing all human genes for
Ensembl version 75.

```{r  edb-from-ensembl, message = FALSE, eval = FALSE }
library(ensembldb)

## get all human gene/transcript/exon annotations from Ensembl (75)
## the resulting tables will be stored by default to the current working
## directory
fetchTablesFromEnsembl(75, species = "human")

## These tables can then be processed to generate a SQLite database
## containing the annotations (again, the function assumes the required
## txt files to be present in the current working directory)
DBFile <- makeEnsemblSQLiteFromTables()

## and finally we can generate the package
makeEnsembldbPackage(ensdb = DBFile, version = "0.99.12",
                     maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>",
                     author = "J Rainer")
 
```

The generated package can then be build using `R CMD build EnsDb.Hsapiens.v75`
and installed with `R CMD INSTALL EnsDb.Hsapiens.v75*`.  Note that we could
directly generate an `EnsDb` instance by loading the database file, i.e. by
calling `edb <- EnsDb(DBFile)` and work with that annotation object.

To fetch and build annotation packages for plant genomes (e.g. arabidopsis
thaliana), the *Ensembl genomes* should be specified as a host, i.e. setting
`host` to "mysql-eg-publicsql.ebi.ac.uk", `port` to `4157` and `species` to
e.g. "arabidopsis thaliana".


### From a GTF or GFF file

Alternatively, the `ensDbFromAH`, `ensDbFromGff`, `ensDbFromGRanges` and `ensDbFromGtf`
functions allow to build EnsDb SQLite files from a `GRanges` object or GFF/GTF
files from Ensembl (either provided as files or *via* `AnnotationHub`). These
functions do not depend on the Ensembl Perl API, but require a working internet
connection to fetch the chromosome lengths from Ensembl as these are not
provided within GTF or GFF files. Also note that protein annotations are usually
not available in GTF or GFF files, thus, such annotations will not be included
in the generated `EnsDb` database - protein annotations are only available in
`EnsDb` databases created with the Ensembl Perl API (such as the ones provided
through `AnnotationHub` or as Bioconductor packages).

In the next example we create an `EnsDb` database using the `AnnotationHub`
package and load also the corresponding genomic DNA sequence matching the
Ensembl version. We thus first query the `AnnotationHub` package for all
resources available for `Mus musculus` and the Ensembl release 77. Next we
create the `EnsDb` object from the appropriate `AnnotationHub` resource.  We
then use the `getGenomeFaFile` method on the `EnsDb` to directly look up and
retrieve the correct or best matching `FaFile` with the genomic DNA sequence. At
last we retrieve the sequences of all exons using the `getSeq` method.

```{r  gtf-gff-edb, message = FALSE, eval = FALSE }
## Load the AnnotationHub data.
library(AnnotationHub)
ah <- AnnotationHub()

## Query all available files for Ensembl release 77 for
## Mus musculus.
query(ah, c("Mus musculus", "release-77"))

## Get the resource for the gtf file with the gene/transcript definitions.
Gtf <- ah["AH28822"]
## Create a EnsDb database file from this.
DbFile <- ensDbFromAH(Gtf)
## We can either generate a database package, or directly load the data
edb <- EnsDb(DbFile)


## Identify and get the FaFile object with the genomic DNA sequence matching
## the EnsDb annotation.
Dna <- getGenomeFaFile(edb)
library(Rsamtools)
## We next retrieve the sequence of all exons on chromosome Y.
exons <- exons(edb, filter = SeqNameFilter("Y"))
exonSeq <- getSeq(Dna, exons)

## Alternatively, look up and retrieve the toplevel DNA sequence manually.
Dna <- ah[["AH22042"]]
 
```

In the example below we load a `GRanges` containing gene definitions for genes
encoded on chromosome Y and generate a `EnsDb` SQLite database from that
information.

```{r  EnsDb-from-Y-GRanges, message = FALSE, eval = use_network }
## Generate a sqlite database from a GRanges object specifying
## genes encoded on chromosome Y
load(system.file("YGRanges.RData", package = "ensembldb"))
Y

## Create the EnsDb database file
DB <- ensDbFromGRanges(Y, path = tempdir(), version = 75,
		       organism = "Homo_sapiens")

## Load the database
edb <- EnsDb(DB)
edb
 
```

Alternatively we can build the annotation database using the `ensDbFromGtf`
`ensDbFromGff` functions, that extract most of the required data from a GTF
respectively GFF (version 3) file which can be downloaded from Ensembl
(e.g. from <ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens> for human gene
definitions from Ensembl version 75; for plant genomes etc, files can be
retrieved from <ftp://ftp.ensemblgenomes.org>). All information except the
chromosome lengths, the NCBI Entrezgene IDs and protein annotations can be
extracted from these GTF files. The function also tries to retrieve chromosome
length information automatically from Ensembl.

Below we create the annotation from a gtf file that we fetch directly from Ensembl.

```{r  EnsDb-from-GTF, message = FALSE, eval = FALSE }
library(ensembldb)

## the GTF file can be downloaded from
## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/
gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz"
## generate the SQLite database file
DB <- ensDbFromGtf(gtf = gtffile)

## load the DB file directly
EDB <- EnsDb(DB)

## alternatively, build the annotation package
## and finally we can generate the package
makeEnsembldbPackage(ensdb = DB, version = "0.99.12",
                     maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>",
                     author = "J Rainer")
 
```


# Database layout<a id="org5bd9a97"></a>

The database consists of the following tables and attributes (the layout is also
shown in Figure [165](#orgfd622d5)). Note that the protein-specific annotations
might not be available in all `EnsDB` databases (e.g. such ones created with
`ensembldb` version < 1.7 or created from GTF or GFF files).

-   **gene**: all gene specific annotations.
    -   `gene_id`: the Ensembl ID of the gene.
    -   `gene_name`: the name (symbol) of the gene.
    -   `gene_biotype`: the biotype of the gene.
    -   `gene_seq_start`: the start coordinate of the gene on the sequence (usually
        a chromosome).
    -   `gene_seq_end`: the end coordinate of the gene on the sequence.
    -   `seq_name`: the name of the sequence (usually the chromosome name).
    -   `seq_strand`: the strand on which the gene is encoded.
    -   `seq_coord_system`: the coordinate system of the sequence.
    -   `description`: the description of the gene.

-   **entrezgene**: mapping of Ensembl genes to NCBI Entrezgene identifiers. Note that
    this mapping can be a one-to-many mapping.
    -   `gene_id`: the Ensembl gene ID.
    -   `entrezid`: the NCBI Entrezgene ID.

-   **tx**: all transcript related annotations. Note that while no `tx_name` column
    is available in this database column, all methods to retrieve data from the
    database support also this column. The returned values are however the ID of
    the transcripts.
    
    -   `tx_id`: the Ensembl transcript ID.
    -   `tx_biotype`: the biotype of the transcript.
    -   `tx_seq_start`: the start coordinate of the transcript.
    -   `tx_seq_end`: the end coordinate of the transcript.
    -   `tx_cds_seq_start`: the start coordinate of the coding region of the
        transcript (NULL for non-coding transcripts).
    -   `tx_cds_seq_end`: the end coordinate of the coding region of the transcript.
    -   `gene_id`: the gene to which the transcript belongs.
    
    `EnsDb` databases for more recent Ensembl releases have also a column
    `tx_support_level` providing the evidence level for a transcript (1 high
    evidence, 5 low evidence, NA no evidence calculated).

-   **exon**: all exon related annotation.
    -   `exon_id`: the Ensembl exon ID.
    -   `exon_seq_start`: the start coordinate of the exon.
    -   `exon_seq_end`: the end coordinate of the exon.

-   **tx2exon**: provides the n:m mapping between transcripts and exons.
    -   `tx_id`: the Ensembl transcript ID.
    -   `exon_id`: the Ensembl exon ID.
    -   `exon_idx`: the index of the exon in the corresponding transcript, always
        from 5' to 3' of the transcript.

-   **chromosome**: provides some information about the chromosomes.
    -   `seq_name`: the name of the sequence/chromosome.
    -   `seq_length`: the length of the sequence.
    -   `is_circular`: whether the sequence in circular.

-   **protein**: provides protein annotation for a (coding) transcript.
    -   `protein_id`: the Ensembl protein ID.
    -   `tx_id`: the transcript ID which CDS encodes the protein.
    -   `protein_sequence`: the peptide sequence of the protein (translated from the
        transcript's coding sequence after applying eventual RNA editing).

-   **uniprot**: provides the mapping from Ensembl protein ID(s) to Uniprot ID(s). Not
    all Ensembl proteins are annotated to Uniprot IDs, also, each Ensembl protein
    might be mapped to multiple Uniprot IDs.
    -   `protein_id`: the Ensembl protein ID.
    -   `uniprot_id`: the Uniprot ID.
    -   `uniprot_db`: the Uniprot database in which the ID is defined.
    -   `uniprot_mapping_type`: the type of the mapping method that was used to assign
        the Uniprot ID to an Ensembl protein ID.

-   **protein\_domain**: provides protein domain annotations and mapping to proteins.
    -   `protein_id`: the Ensembl protein ID on which the protein domain is present.
    -   `protein_domain_id`: the ID of the protein domain (from the protein domain
        source).
    -   `protein_domain_source`: the source/analysis method in/by which the protein
        domain was defined (such as pfam etc).
    -   `interpro_accession`: the Interpro accession ID of the protein domain.
    -   `prot_dom_start`: the start position of the protein domain within the
        protein's sequence.
    -   `prot_dom_end`: the end position of the protein domain within the protein's
        sequence.

-   **metadata**: some additional, internal, informations (Genome build, Ensembl
    version etc).
    -   `name`
    -   `value`

-   *virtual* columns:
    -   `symbol`: the database does not have such a database column, but it is still
        possible to use it in the `columns` parameter. This column is *symlinked* to the
        `gene_name` column.
    -   `tx_name`: similar to the `symbol` column, this column is *symlinked* to the `tx_id`
        column.

The database layout: as already described above, protein related annotations
(green) might not be available in each `EnsDb` database.

![img](images/dblayout.png "Database layout.")


# Footnotes

<sup><a id="fn.1" href="#fnr.1">1</a></sup> <http://www.ensembl.org>

<sup><a id="fn.2" href="#fnr.2">2</a></sup> <http://www.lrg-sequence.org>

<sup><a id="fn.3" href="#fnr.3">3</a></sup> <http://www.ncbi.nlm.nih.gov/pubmed/23950696>

<sup><a id="fn.4" href="#fnr.4">4</a></sup> <http://www.ncbi.nlm.nih.gov/pubmed/24227677>

<sup><a id="fn.5" href="#fnr.5">5</a></sup> <http://www.ensembl.org/info/docs/api/api_installation.html>