Useful tips - IJM BioInfo Club

BioInfo Club - useful tips

Diverse tips

** to write tab or enter characters in the shell
press Ctrl+V first and then the special character. "Enter" is represented by "^M"

Useful commands

HEAD / TAIL

$head ctd.txt
shows the first 10 lines

$head -n 2 *.pdb
shows the first 2 lines

$history | tail -n 15
shows the 15 most recent items in your command history

$tail -n +2 *.txt
shows from the second line to the end

$head -n -1 *.txt
shows from the second line to the 10th line

-------
GREP
prints out the lines containing the characters
$grep ">" *.fasta

$grep "\-122" ctd.txt
searches for a negative number

-c
chows only a count of the results

-v
shows only the lines that do not match the pattern. Inverted search.

-i
ignore case

-E
Use regular expressions. Terms should be in quotes, use [] to indicate a character range, use [[:space:]] for \s, [[:digit:]] for \d.

-n
Show line number of the matches

-------
AGREP
searches for a nearly exact match.

-d "\>"
uses > as a delimiter between records rather than end-of-line

-B -y
returns only the best match
$agrep -B -y -d "\>" CYG FPexcerpt.fta

-2
returns results with up to this many mismatches between query and record. Maximum allowed is 8.

-l
only lists filenames that contain a match

-i
case-insensitive search

-------
CUT

$cut -f 1,3 *.txt
returns columns 1 and 3 delimited by tabs

$cut -f 1-3 *.txt
returns columns 1 to 3 delimited by tabs

$cut -c 16-20,30 *.txt
returns characters 16 to 20 and 30 from each line

$grep ">" *.fta | cut -c 2-11
prints out the gene names

$head *.txt | cut -f 5,7 -d ","
returns columns 5 and 7. These are delimited by , in the original file and in the output.

-------
SORT

$grep ">" *.fasta | sort

-n
sorts by numerical value rather than alphabetically

-f
makes all lines uppercase before sorting

-r
sorts in reverse order

-k 3
sorts lines based on column 3 , with columns delimited by space or tab
$head *.txt | sort -k 3

-t ","
uses commas for delimiters

-u
returns a unique representative of repeated items

-------
UNIQ
removes identical lines that are in immediate succession and keeps a single line.

-c
counts the number of occurrence of each unique line and write it before each unique line
$cut -c 12-21 ctd.txt | uniq -c

-f 4
ignores the first 4 fields (columns delimited by any number of spaces) in determining uniqueness

-i
ignore case when determining uniqueness