/usr/bin/rsem-run-ebseq is in rsem 1.2.31+dfsg-1.
This file is owned by root:root, with mode 0o755.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | #!/usr/bin/env perl
use Getopt::Long;
use Pod::Usage;
use FindBin;
use lib $FindBin::RealBin;
use rsem_perl_utils;
use Env qw(@PATH);
@PATH = ("$FindBin::RealBin/EBSeq", @PATH);
use strict;
my $ngvF = "";
my $help = 0;
GetOptions("ngvector=s" => \$ngvF,
"h|help" => \$help) or pod2usage(-exitval => 2, -verbose => 2);
pod2usage(-verbose => 2) if ($help == 1);
pod2usage(-msg => "Invalid number of arguments!", -exitval => 2, -verbose => 2) if (scalar(@ARGV) != 3);
pod2usage(-msg => "ngvector file cannot be named as #! # is reserved for other purpose!", -exitval => 2, -verbose => 2) if ($ngvF eq "#");
my $command = "";
my @conditions = split(/,/, $ARGV[1]);
pod2usage(-msg => "At least 2 conditions are required for differential expression analysis!", -exitval => 2, -verbose => 2) if (scalar(@conditions) < 2);
if ($ngvF eq "") { $ngvF = "#"; }
$" = " ";
$command = "rsem-for-ebseq-find-DE $FindBin::RealBin/EBSeq $ngvF $ARGV[0] $ARGV[2] @conditions";
&runCommand($command)
__END__
=head1 NAME
rsem-run-ebseq
=head1 PURPOSE
Wrapper for EBSeq to perform differential expression analysis.
=head1 SYNOPSIS
rsem-run-ebseq [options] data_matrix_file conditions output_file
=head1 ARGUMENTS
=over
=item B<data_matrix_file>
This file is a m by n matrix. m is the number of genes/transcripts and n is the number of total samples. Each element in the matrix represents the expected count for a particular gene/transcript in a particular sample. Users can use 'rsem-generate-data-matrix' to generate this file from expression result files.
=item B<conditions>
Comma-separated list of values representing the number of replicates for each condition. For example, "3,3" means the data set contains 2 conditions and each condition has 3 replicates. "2,3,3" means the data set contains 3 conditions, with 2, 3, and 3 replicates for each condition respectively.
=item B<output_file>
Output file name.
=back
=head1 OPTIONS
=over
=item B<--ngvector> <file>
This option provides the grouping information required by EBSeq for isoform-level differential expression analysis. The file can be generated by 'rsem-generate-ngvector'. Turning this option on is highly recommended for isoform-level differential expression analysis. (Default: off)
=item B<-h/--help>
Show help information.
=back
=head1 DESCRIPTION
This program is a wrapper over EBSeq. It performs differential expression analysis and can work on two or more conditions. All genes/transcripts and their associated statistcs are reported in one output file. This program does not control false discovery rate and call differential expressed genes/transcripts. Please use 'rsem-control-fdr' to control false discovery rate after this program is finished.
=head1 OUTPUT
=over
=item B<output_file>
This file reports the calculated statistics for all genes/transcripts. It is written as a matrix with row and column names. The row names are the genes'/transcripts' names. The column names are for the reported statistics.
If there are only 2 different conditions among the samples, four statistics (columns) will be reported for each gene/transcript. They are "PPEE", "PPDE", "PostFC" and "RealFC". "PPEE" is the posterior probability (estimated by EBSeq) that a gene/transcript is equally expressed. "PPDE" is the posterior probability that a gene/transcript is differentially expressed. "PostFC" is the posterior fold change (condition 1 over condition2) for a gene/transcript. It is defined as the ratio between posterior mean expression estimates of the gene/transcript for each condition. "RealFC" is the real fold change (condition 1 over condition2) for a gene/transcript. It is the ratio of the normalized within condition 1 mean count over normalized within condition 2 mean count for the gene/transcript. Fold changes are calculated using EBSeq's 'PostFC' function. The genes/transcripts are reported in descending order of their "PPDE" values.
If there are more than 2 different conditions among the samples, the output format is different. For differential expression analysis with more than 2 conditions, EBSeq will enumerate all possible expression patterns (on which conditions are equally expressed and which conditions are not). Suppose there are k different patterns, the first k columns of the output file give the posterior probability of each expression pattern is true. Patterns are defined in a separate file, 'output_file.pattern'. The k+1 column gives the maximum a posteriori (MAP) expression pattern for each gene/transcript. The k+2 column gives the posterior probability that not all conditions are equally expressed (column name "PPDE"). The genes/transcripts are reported in descending order of their "PPDE" column values. For details on how EBSeq works for more than 2 conditions, please refer to EBSeq's manual.
=item B<output_file.normalized_data_matrix>
This file contains the median normalized version of the input data matrix.
=item B<output_file.pattern>
This file is only generated when there are more than 2 conditions. It defines all possible expression patterns over the conditions using a matrix with names. Each row of the matrix refers to a different expression pattern and each column gives the expression status of a different condition. Two conditions are equally expressed if and only if their statuses are the same.
=item B<output_file.condmeans>
This file is only generated when there are more than 2 conditions. It gives the normalized mean count value for each gene/transcript at each condition. It is formatted as a matrix with names. Each row represents a gene/transcript and each column represent a condition. The order of genes/transcripts is the same as 'output_file'. This file can be used to calculate fold changes between conditions which users are interested in.
=back
=head1 EXAMPLES
1) We're interested in isoform-level differential expression analysis and there are two conditions. Each condition has 5 replicates. We have already collected the data matrix as 'IsoMat.txt' and generated ngvector as 'ngvector.ngvec':
rsem-run-ebseq --ngvector ngvector.ngvec IsoMat.txt 5,5 IsoMat.results
The results will be in 'IsoMat.results' and 'IsoMat.results.normalized_data_matrix' contains the normalized data matrix.
2) We're interested in gene-level analysis and there are 3 conditions. The first condition has 3 replicates and the other two has 4 replicates each. The data matrix is named as 'GeneMat.txt':
rsem-run-ebseq GeneMat.txt 3,4,4 GeneMat.results
Four files, 'GeneMat.results', 'GeneMat.results.normalized_data_matrix', 'GeneMat.results.pattern', and 'GeneMat.results.condmeans', will be generated.
=cut
|