================================================================================
    PeakAnalyzer Overview 
================================================================================


////////////////////////////////////////////////////////////////////////////////
INSTALLATION
////////////////////////////////////////////////////////////////////////////////

Unpack the PeakAnalyzer.zip file, and move to the PeakAnalyzer folder. This folder 
contains two jar files: 
1. PeakAnalyzer.jar 
2. PeakAnalyzerGui.jar 
PeakAnalyzer.jar is the executable jar file you have to run. The directory also 
includes a "Data" directory which contains annotation files for human and mouse. 

////////////////////////////////////////////////////////////////////////////////
REQUIREMENTS
////////////////////////////////////////////////////////////////////////////////

1. Java 1.5 or later installed. If running PeakAnalyzer under Windows, Java 1.6_10
or later is recommended.
2. A "Data" folder MUST be present in the folder where you run PeakAnalyzer.
3. You have to be connected to the internet in order to retrieve subpeak sequences.
4. R is required in order to generate result plots.

You can download R from http://www.r-project.org/.
R bin directory should be defined in the PATH

On Linux/Unix, using bash, you need to include the R/bin directory to the PATH by
adding this line to your ~/.bashrc :  

$ export PATH="<location of R installation directory>/R-X.X.X/bin:$PATH"

If you use the C shell, add the following to your .cshrc file
setenv PATH <location of R installation directory>/R-X.X.X/bin:${PATH}

Using Windows, you need to open the System Properties dialog.
In the Advanced section click the Environment Variables button. 
Then in the Environment Variables window, highlight the Path variable in the 
Systems Variable section and click the Edit button. 
Add or modify the path lines with the path to the R/bin directory. 
For example: "C:\Program Files\R\R-X.X.X\bin"

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

To launch the program, double click on "PeakAnalyzer.jar", or open a terminal 
window, navigate to the PeakAnalyzer folder and type:
java -jar PeakAnalyzer.jar

////////////////////////////////////////////////////////////////////////////////
DOCUMENTATION
////////////////////////////////////////////////////////////////////////////////

PeakAnalyzer is a Java GUI application comprising two main utilities:
1. PeakAnnotator - for annotating genomic loci 
2. PeaksSplitter - for subdividing broad peaks into individual binding sites  

The following documentation describes the parameters of each GUI window.

First window - "Choose Utility"
==================================
In this window you have to choose which application to run. The options are:
1. Peak Annotation - choose this option if you want to perform bulk annotation of 
genomic locations, such as identifing location within genes or closest up- or 
downstream transcription start site
2. Split Peaks - choose this option if you want to split enrichment areas into 
individual binding sites. This should be done prior to de novo motif analysis.

Peak Annotation window
=========================
In this window you can choose which annotation utility to run. The options are:
1. NDG - For each locus, search for its Nearest Downstream Gene on both the 
forward and reverse strand. If the position of the locus is within a gene, 
the program describes in which part of that gene the locus is situated (for 
example exon, first intron, etc.). 
2. TSS - For each locus, find its closest TSS (transcription start site). In 
order to do this, the program searches for the closest either upstream or 
downstream gene compared to the genomic coordinate of the locus.
3. ODS - Overlapping two data sets (peak files), to identify common and 
unique genomic locations. Uses random regions matched for chromosome and
length to calculate an enrichment over random and p-value.

Nearest Downstream Genes/Nearest TSS window
============================================
Under this window, you have to select the input parameters for the NDG/TSS 
utilities. The parameters are:

*** Peak file
This is a REQUIRED parameter for the "NDG" and "TSS" utilities. 
The file lists the genomic coordinates that were found by a peak calling program 
or obtained in some other way. The format should be tab or space delimited, where 
each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** Annotation file
This is a REQUIRED parameter for the "NDG" and "TSS" utilities.
The file lists the features/genes of interest and their location in the genome, in 
one of two formats:
1. GTF format - can be download from Ensembl ftp site at: 
		http://www.ensembl.org/info/data/ftp/index.html
		GTF FILES ARE EXPECTED TO CONTAIN THE SUFFIX ".gtf"
2. BED format - that can be downloaded from the UCSC table browser tool. 
		The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

Requirements for BED file format - NDG utility:
The following fields (columns) should be present: 
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount, 
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) should be present: 
chrom, chromStart, chrEnd, strand,

Please note that according to BED format, lower-numbered fields (columns) must 
always be present if higher-numbered fields are used. Hence, although the field 
"name" is not required for TSS, it should be specified in the file. (Just 
inserting any character in column number 4 in the file is sufficient). 

Sample annotation files for human and mouse are provided with the program, and are 
located in the sub directory "Data" in the PeakAnalyzer folder.

*** Gene type options
When the annotation file is of GTF format, the user has the option to choose the 
type of genes that will be used for annotation, either "Coding genes only" or 
"Coding and non-coding genes". The latter includes genes such as miRNAs and other 
non-coding RNAs.

*** Symbol file
This is an optional parameter for both the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols and can be downloaded for 
example from the UCSC table browser. It is necessary when using BED format 
annotation file, since these do not contain gene symbols, whereas for the Ensembl 
GTF annotation files a symbol file is not required. 

*** Output folder
This is a REQUIRED parameter.
An output directory must be specified, for Peak Annotation to put the output files 
in. 

*** Prefix
String to add to output file names, for example in case the same peak files are 
analyzed using different parameters.
If you choose to run the program using the same input files several times, 
and you don't use the prefix option, then the output files will be overwritten.

Overlap peak lists window
===========================
You will get to this window if you chose the "Overlap" utility

The input parameters are:

*** Peak file1
This is a REQUIRED parameter.
The file lists the genomic coordinates that were found by a peak calling  program 
or obtained in some other way. The format should be tab or space delimited, where 
each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** Peak file2
This is a REQUIRED parameter.
A second peak file to be compared with Peak file1.

*** Output folder
This is a REQUIRED parameter.
An output directory must be specified, for Peak Annotation to put the output files 
in. 

*** Prefix
String to add to output file names, for example in case the same peak files are 
analyzed using different parameters.

*** Randomization
You can choose this box if you would like to calculate the significance of the
intersection and the fold enrichment over random.
If this option is checked, random regions matched to the first regions file will
be generated, and intersect with the second.
In order to create random data sets, you have to provide ChrLength file

*** ChrLength file
File containing the size of each chromosome, for example:
chr1    197195432
chr2    181748087


Split Peaks window 
====================
You will get to this window if you chose the "Split Peaks" option.

Input parameters:

*** Peak File
This is a REQUIRED parameter.
The file lists the genomic coordinates that were found by a peak calling program 
or obtained in some other way. The format should be tab or space delimited, where 
each locus is described by its "chromosome", "start" and "end" location.
THIS FILE SHOULD BE SORTED BY CHROMOSOME AND START POSITION
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** WIG type
This is a REQUIRED parameter.
This can be a WIG file OR a WIG folder that contains one WIG file for each 
chromosome, where the WIG file describes the signals (usually number of reads) 
along the genome. Split Peaks supports WIG files in VariableStep or Bedgraph formats.
The wig header lines, "track type" and "variableStep" (when the file is of VariableStep format)
are required. 
The files can be zipped or gzipped, so it's not necessary to uncompress them. WIG 
file names for each chromosme (under WIG folder) should contain the word "chr" +  
chromosome number, for example "my.chr12.wig".

*** Output Folder
This is a REQUIRED parameter.
An output directory must be specified, for Split Peaks to put the output files in. 

*** Prefix
String to add to output file names, for example in case the same peak files are 
analyzed using different parameters.

*** Fetch subpeak sequences
Check this box if you would like to fetch the subpeak sequences surrounding the 
summit regions (where for example binding site are located). You have to be 
connected to the internet in order to fetch sequences.

Split and Fetch sequence parameters window
============================================

Split parameters:

*** Separation float
This value determines when a peak will be  separated into subpeaks. Local maxima 
regions are found within each peak and the height of neighboring local maxima are 
compared. The lowest value is multiplied by this separation float number to yield 
the minimum depth required to separate the two peaks.
For example, a value of 0.5 means that the height of the valley should be less 
than half the height of its summits in order for them to be separated.

*** Minimum height
Height cutoff. Only subpeaks with at least this number of reads in their summit 
region will be reported.

Fetch sequence parameters:

Please note that the sequences are fetched from the latest build of the genome.

*** Organism
The sequences are fetched directly from the Ensembl DAS database. The user has to 
specify the organism, and PeakAnalyzer will fetch the corresponding sequences.

*** Length
Length of sequence to fetch (default 60).
The sequences are fetched near the summit region, so if the length  is 60, 30 bp 
will be fetched upstream to the peak summit position, and 30 bp downstream.

*** Amount
Number of best subpeak sequences to fetch (those with the highest numbers of reads 
in their summit region). These sequences can be used as input for motif prediction 
tools such as MEME. 
The default number is 300. This is the maximum number of sequences the web-based 
version of MEME will accept (more sequences can be input when run locally).


////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

Peak Annotation outputs:
------------------------

The output of the "NDG" utility are three tab delimited files:
**************************************************************

A. "peakFileName.ndg.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test, the output file will be 
"myPeaks.ndg.test".
This file describes the closest downstream genes for each genomic locus, and 
contains the following fields:
        1. Chromosome
        2. Start
        3. End - These first three columns describe the location of the peak in 
		the genome.
        4. # Overlapped_Genes - Number of transcripts that overlap with the 
		genomic loci.
           More details about these genes are reported in the second output file 
		described below.
        5. Downstream_FW_Gene - ID of the closest downstream gene on the forward 
		strand.
        6. Symbol - Symbol of the closest downstream gene on the forward strand.
        7. Distance - Distance of the peak to its closest downstream gene on the 
		forward strand.
        8. Downstream_REV_gene - ID of the closest downstream gene on the reverse 
		strand.
        9. Symbol - Symbol of the closest downstream gene on the reverse strand.
        10. Distance - Distance of the peak to its closest downstream gene on the 
		reverse strand.

B. "peakFileName.overlap.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.overlap.test".
This file describes the transcripts overlapping the peaks, if any such are found.
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in 
		the genome.
        4. OverlapGene  - Overlapping gene ID
        5. Symbol       - Overlapping gene symbol
        6. Overlap_Begin - In which part of the gene does the peak's start position 
		overlap
        7. Overlap_Center - In which part of the gene does the peak's central 
		position overlap
        8. Overlap_End  - In which part of the gene does the peak's end position 
		overlap

C. "peakFileName.summary.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.summary.test".
This file contains the following fields
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in 
		the genome.
        4. OverlapGene  - Overlapping gene Symbol.
        5. Downstream Gene - Nearest downstream gene.
        6. Distance - Distance between the peak and its nearest downstream gene.

The output of the TSS option is a tab delimited file:
*****************************************************

"peakFileName.tss.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.tss.test"

This file contains the following fields:
        1. Chromosome
        2. Start
        3. End  - These first three columns describe the location of the peak in 
		the genome.
        4. Distance     - The distance from the peak to its closest TSS.
        5. GeneStart    - The start location of the closest gene on the genome.
        6. GeneEnd      - The end location of the closest gene on the genome.
        4. ClosestTSS_ID - ID of the closest gene.
        5. Symbol       - Symbol of the closest gene.
        6. Strand       - Strand of closest gene.

The output of the "Overlap" option are three tab delimited files:
******************************************************************

A. "peakFile1_peakFile2.overlap.txt"

For example, if the input peak files are "myPeaks1.txt" and "myPeaks2.txt", the 
output file will be "myPeaks1_myPeaks2.overlap.txt"

Each line in this file describes an overlap event between two genomic loci, and 
has the following fields:
        1. Chromosome
        2. peakFile1_Start      - Start location of the first genomic locus
        3. peakFile1_End        - End location of the first genomic locus
        4. peakFile1_Name       - Name of the first genomic locus (if it exist in 
		the input file)
        5. peakFile2_Start      - Start location of the second genomic locus
        6. peakFile2_End        - End location of the second genomic locus
        7. peakFile2_Name       - Name of the second genomic locus (if it exist in 
		the input file)

B+C. Unique files - one file for each genomic input file, which describes the unique peaks.


PeakSplitter output
-----------------
If you specified a text under the "prefix" parameter, all output file names will 
start with the text you mentioned.

1. peakFileName.subpeaks.inputFileNameSuffix
For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.subpeaks.test"

This is a tabular file, which contains information about subpeaks, including 
chromosome name, start position of subpeak, end position of subpeak, number of 
reads in peak summit position  and subpeak summit position related to the start 
position of subpeak region.

2. peakFileName(without suffix).bestSubpeaks.fa
For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.bestSubpeaks.fa

This is a fasta file, containing the sequences of the best subpeaks (those with 
highest number of reads in their summit position).
The fasta file can be uploaded to a motif prediction program such as MEME.

Plots
*****
There is an option to generate summary plots of the data.
In order to generate them you need to push the button "Generate plot"
that will appear after the program has been executed.

Nearest downstream genes (NDG) plots:
1. Peaks overlapping genes 
The position of peaks within genes are plotted. This is plotted based
on the location of the central point of the peak region.
Sometimes, the central point fall out of a known gene (although the peak itself
overlap the gene), in this case, the overlapping region is defined as "Intergenic".
2. Distance to NDG
The distance is calculated between the central point of the peak to the TSS
of the nearest downstream gene.
Distance is always a positive value.

Transcription start site (TSS) plot:
1. Distance from TSS
The distance is calculated between the central point of the peak to the TSS
of the nearest gene. 
Since the distance is calculated to nearest TSS rather than the nearest 
downstream TSS, the values can be both positive and negative.