================================================================================ PeakAnalyzer Overview ================================================================================ //////////////////////////////////////////////////////////////////////////////// INSTALLATION //////////////////////////////////////////////////////////////////////////////// Unpack the PeakAnalyzer.zip file, and move to the PeakAnalyzer folder. This folder contains two jar files: 1. PeakAnalyzer.jar 2. PeakAnalyzerGui.jar PeakAnalyzer.jar is the executable jar file you have to run. The directory also includes a "Data" directory which contains annotation files for human and mouse. //////////////////////////////////////////////////////////////////////////////// REQUIREMENTS //////////////////////////////////////////////////////////////////////////////// 1. Java 1.5 or later installed. If running PeakAnalyzer under Windows, Java 1.6_10 or later is recommended. 2. A "Data" folder MUST be present in the folder where you run PeakAnalyzer. 3. You have to be connected to the internet in order to retrieve subpeak sequences. 4. R is required in order to generate result plots. You can download R from http://www.r-project.org/. R bin directory should be defined in the PATH On Linux/Unix, using bash, you need to include the R/bin directory to the PATH by adding this line to your ~/.bashrc : $ export PATH="/R-X.X.X/bin:$PATH" If you use the C shell, add the following to your .cshrc file setenv PATH /R-X.X.X/bin:${PATH} Using Windows, you need to open the System Properties dialog. In the Advanced section click the Environment Variables button. Then in the Environment Variables window, highlight the Path variable in the Systems Variable section and click the Edit button. Add or modify the path lines with the path to the R/bin directory. For example: "C:\Program Files\R\R-X.X.X\bin" //////////////////////////////////////////////////////////////////////////////// USAGE //////////////////////////////////////////////////////////////////////////////// To launch the program, double click on "PeakAnalyzer.jar", or open a terminal window, navigate to the PeakAnalyzer folder and type: java -jar PeakAnalyzer.jar //////////////////////////////////////////////////////////////////////////////// DOCUMENTATION //////////////////////////////////////////////////////////////////////////////// PeakAnalyzer is a Java GUI application comprising two main utilities: 1. PeakAnnotator - for annotating genomic loci 2. PeaksSplitter - for subdividing broad peaks into individual binding sites The following documentation describes the parameters of each GUI window. First window - "Choose Utility" ================================== In this window you have to choose which application to run. The options are: 1. Peak Annotation - choose this option if you want to perform bulk annotation of genomic locations, such as identifing location within genes or closest up- or downstream transcription start site 2. Split Peaks - choose this option if you want to split enrichment areas into individual binding sites. This should be done prior to de novo motif analysis. Peak Annotation window ========================= In this window you can choose which annotation utility to run. The options are: 1. NDG - For each locus, search for its Nearest Downstream Gene on both the forward and reverse strand. If the position of the locus is within a gene, the program describes in which part of that gene the locus is situated (for example exon, first intron, etc.). 2. TSS - For each locus, find its closest TSS (transcription start site). In order to do this, the program searches for the closest either upstream or downstream gene compared to the genomic coordinate of the locus. 3. ODS - Overlapping two data sets (peak files), to identify common and unique genomic locations. Uses random regions matched for chromosome and length to calculate an enrichment over random and p-value. Nearest Downstream Genes/Nearest TSS window ============================================ Under this window, you have to select the input parameters for the NDG/TSS utilities. The parameters are: *** Peak file This is a REQUIRED parameter for the "NDG" and "TSS" utilities. The file lists the genomic coordinates that were found by a peak calling program or obtained in some other way. The format should be tab or space delimited, where each locus is described by its "chromosome", "start" and "end" location. PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT *** Annotation file This is a REQUIRED parameter for the "NDG" and "TSS" utilities. The file lists the features/genes of interest and their location in the genome, in one of two formats: 1. GTF format - can be download from Ensembl ftp site at: http://www.ensembl.org/info/data/ftp/index.html GTF FILES ARE EXPECTED TO CONTAIN THE SUFFIX ".gtf" 2. BED format - that can be downloaded from the UCSC table browser tool. The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1". Requirements for BED file format - NDG utility: The following fields (columns) should be present: chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount, blockSizes, blockStarts. Requirements for BED file format - TSS utility: The following fields (columns) should be present: chrom, chromStart, chrEnd, strand, Please note that according to BED format, lower-numbered fields (columns) must always be present if higher-numbered fields are used. Hence, although the field "name" is not required for TSS, it should be specified in the file. (Just inserting any character in column number 4 in the file is sufficient). Sample annotation files for human and mouse are provided with the program, and are located in the sub directory "Data" in the PeakAnalyzer folder. *** Gene type options When the annotation file is of GTF format, the user has the option to choose the type of genes that will be used for annotation, either "Coding genes only" or "Coding and non-coding genes". The latter includes genes such as miRNAs and other non-coding RNAs. *** Symbol file This is an optional parameter for both the "NDG" and "TSS" utilities. The symbol file maps accession numbers to gene symbols and can be downloaded for example from the UCSC table browser. It is necessary when using BED format annotation file, since these do not contain gene symbols, whereas for the Ensembl GTF annotation files a symbol file is not required. *** Output folder This is a REQUIRED parameter. An output directory must be specified, for Peak Annotation to put the output files in. *** Prefix String to add to output file names, for example in case the same peak files are analyzed using different parameters. If you choose to run the program using the same input files several times, and you don't use the prefix option, then the output files will be overwritten. Overlap peak lists window =========================== You will get to this window if you chose the "Overlap" utility The input parameters are: *** Peak file1 This is a REQUIRED parameter. The file lists the genomic coordinates that were found by a peak calling program or obtained in some other way. The format should be tab or space delimited, where each locus is described by its "chromosome", "start" and "end" location. PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT *** Peak file2 This is a REQUIRED parameter. A second peak file to be compared with Peak file1. *** Output folder This is a REQUIRED parameter. An output directory must be specified, for Peak Annotation to put the output files in. *** Prefix String to add to output file names, for example in case the same peak files are analyzed using different parameters. *** Randomization You can choose this box if you would like to calculate the significance of the intersection and the fold enrichment over random. If this option is checked, random regions matched to the first regions file will be generated, and intersect with the second. In order to create random data sets, you have to provide ChrLength file *** ChrLength file File containing the size of each chromosome, for example: chr1 197195432 chr2 181748087 Split Peaks window ==================== You will get to this window if you chose the "Split Peaks" option. Input parameters: *** Peak File This is a REQUIRED parameter. The file lists the genomic coordinates that were found by a peak calling program or obtained in some other way. The format should be tab or space delimited, where each locus is described by its "chromosome", "start" and "end" location. THIS FILE SHOULD BE SORTED BY CHROMOSOME AND START POSITION PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT *** WIG type This is a REQUIRED parameter. This can be a WIG file OR a WIG folder that contains one WIG file for each chromosome, where the WIG file describes the signals (usually number of reads) along the genome. Split Peaks supports WIG files in VariableStep or Bedgraph formats. The wig header lines, "track type" and "variableStep" (when the file is of VariableStep format) are required. The files can be zipped or gzipped, so it's not necessary to uncompress them. WIG file names for each chromosme (under WIG folder) should contain the word "chr" + chromosome number, for example "my.chr12.wig". *** Output Folder This is a REQUIRED parameter. An output directory must be specified, for Split Peaks to put the output files in. *** Prefix String to add to output file names, for example in case the same peak files are analyzed using different parameters. *** Fetch subpeak sequences Check this box if you would like to fetch the subpeak sequences surrounding the summit regions (where for example binding site are located). You have to be connected to the internet in order to fetch sequences. Split and Fetch sequence parameters window ============================================ Split parameters: *** Separation float This value determines when a peak will be separated into subpeaks. Local maxima regions are found within each peak and the height of neighboring local maxima are compared. The lowest value is multiplied by this separation float number to yield the minimum depth required to separate the two peaks. For example, a value of 0.5 means that the height of the valley should be less than half the height of its summits in order for them to be separated. *** Minimum height Height cutoff. Only subpeaks with at least this number of reads in their summit region will be reported. Fetch sequence parameters: Please note that the sequences are fetched from the latest build of the genome. *** Organism The sequences are fetched directly from the Ensembl DAS database. The user has to specify the organism, and PeakAnalyzer will fetch the corresponding sequences. *** Length Length of sequence to fetch (default 60). The sequences are fetched near the summit region, so if the length is 60, 30 bp will be fetched upstream to the peak summit position, and 30 bp downstream. *** Amount Number of best subpeak sequences to fetch (those with the highest numbers of reads in their summit region). These sequences can be used as input for motif prediction tools such as MEME. The default number is 300. This is the maximum number of sequences the web-based version of MEME will accept (more sequences can be input when run locally). //////////////////////////////////////////////////////////////////////////////// OUTPUT FILES //////////////////////////////////////////////////////////////////////////////// Peak Annotation outputs: ------------------------ The output of the "NDG" utility are three tab delimited files: ************************************************************** A. "peakFileName.ndg.peakFileNameSuffix" For example, if the input peak file is "myPeaks.test, the output file will be "myPeaks.ndg.test". This file describes the closest downstream genes for each genomic locus, and contains the following fields: 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. # Overlapped_Genes - Number of transcripts that overlap with the genomic loci. More details about these genes are reported in the second output file described below. 5. Downstream_FW_Gene - ID of the closest downstream gene on the forward strand. 6. Symbol - Symbol of the closest downstream gene on the forward strand. 7. Distance - Distance of the peak to its closest downstream gene on the forward strand. 8. Downstream_REV_gene - ID of the closest downstream gene on the reverse strand. 9. Symbol - Symbol of the closest downstream gene on the reverse strand. 10. Distance - Distance of the peak to its closest downstream gene on the reverse strand. B. "peakFileName.overlap.peakFileNameSuffix" For example, if the input peak file is "myPeaks.test", the output file will be "myPeaks.overlap.test". This file describes the transcripts overlapping the peaks, if any such are found. 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. OverlapGene - Overlapping gene ID 5. Symbol - Overlapping gene symbol 6. Overlap_Begin - In which part of the gene does the peak's start position overlap 7. Overlap_Center - In which part of the gene does the peak's central position overlap 8. Overlap_End - In which part of the gene does the peak's end position overlap C. "peakFileName.summary.peakFileNameSuffix" For example, if the input peak file is "myPeaks.test", the output file will be "myPeaks.summary.test". This file contains the following fields 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. OverlapGene - Overlapping gene Symbol. 5. Downstream Gene - Nearest downstream gene. 6. Distance - Distance between the peak and its nearest downstream gene. The output of the TSS option is a tab delimited file: ***************************************************** "peakFileName.tss.peakFileNameSuffix" For example, if the input peak file is "myPeaks.test", the output file will be "myPeaks.tss.test" This file contains the following fields: 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. Distance - The distance from the peak to its closest TSS. 5. GeneStart - The start location of the closest gene on the genome. 6. GeneEnd - The end location of the closest gene on the genome. 4. ClosestTSS_ID - ID of the closest gene. 5. Symbol - Symbol of the closest gene. 6. Strand - Strand of closest gene. The output of the "Overlap" option are three tab delimited files: ****************************************************************** A. "peakFile1_peakFile2.overlap.txt" For example, if the input peak files are "myPeaks1.txt" and "myPeaks2.txt", the output file will be "myPeaks1_myPeaks2.overlap.txt" Each line in this file describes an overlap event between two genomic loci, and has the following fields: 1. Chromosome 2. peakFile1_Start - Start location of the first genomic locus 3. peakFile1_End - End location of the first genomic locus 4. peakFile1_Name - Name of the first genomic locus (if it exist in the input file) 5. peakFile2_Start - Start location of the second genomic locus 6. peakFile2_End - End location of the second genomic locus 7. peakFile2_Name - Name of the second genomic locus (if it exist in the input file) B+C. Unique files - one file for each genomic input file, which describes the unique peaks. PeakSplitter output ----------------- If you specified a text under the "prefix" parameter, all output file names will start with the text you mentioned. 1. peakFileName.subpeaks.inputFileNameSuffix For example, if the input peak file is "myPeaks.test", the output file will be "myPeaks.subpeaks.test" This is a tabular file, which contains information about subpeaks, including chromosome name, start position of subpeak, end position of subpeak, number of reads in peak summit position and subpeak summit position related to the start position of subpeak region. 2. peakFileName(without suffix).bestSubpeaks.fa For example, if the input peak file is "myPeaks.test", the output file will be "myPeaks.bestSubpeaks.fa This is a fasta file, containing the sequences of the best subpeaks (those with highest number of reads in their summit position). The fasta file can be uploaded to a motif prediction program such as MEME. Plots ***** There is an option to generate summary plots of the data. In order to generate them you need to push the button "Generate plot" that will appear after the program has been executed. Nearest downstream genes (NDG) plots: 1. Peaks overlapping genes The position of peaks within genes are plotted. This is plotted based on the location of the central point of the peak region. Sometimes, the central point fall out of a known gene (although the peak itself overlap the gene), in this case, the overlapping region is defined as "Intergenic". 2. Distance to NDG The distance is calculated between the central point of the peak to the TSS of the nearest downstream gene. Distance is always a positive value. Transcription start site (TSS) plot: 1. Distance from TSS The distance is calculated between the central point of the peak to the TSS of the nearest gene. Since the distance is calculated to nearest TSS rather than the nearest downstream TSS, the values can be both positive and negative.