Unix Tools

man

$ man <command>

Displays the manual page (if available) for the specified command.

head / tail

$ head <file>
$ tail -1 <file>

Prints a number of lines from the start (head) or end (tail) of the given file.
The option -n (or —lines=n) specifies the number of lines to show. Default == 10 lines.

cut

$ cut -f 1,2,3 <file>

Cut removes parts of a line, depending on either a given delimiter (with the -d option) or the default of a tab. -f specifies the fields created by these delimiters, and in the above example would remove everything except the first three columns.
BEWARE if file has been created manually in Gedit, tabs may have automatically been replaced with multiple spaces. In this case cut will fail.

grep

$ grep <pattern> <file>              # finds lines in file that matches pattern
$ grep -w <pattern> <file>           # matches pattern only if it occurs as a whole word
$ grep -E '<regex>' <file>           # matches regular expression
$ egrep '<regex>' <file>             # matches regular expression
$ cat -n <file> | grep <pattern>     # adds line numbers then matches pattern

Finds and prints lines in a file that match a pattern. The pattern may be simple, such as a single character or word, or a regular expression (e.g. "^#" to match a # at the start of a line, as in a VCF header). Using regular expressions is better supported in expression mode, activated with either the -E flag or by running egrep (which is the same as grep -E. To avoid the shell attempting to process the expressions itself, they should be contained within quotes.
grep is an extremely versatile tool, and the man page should be read to see all the options available. For example, the search can be inverted, to find only those lines not matching the pattern, by using grep -v <pattern> <file>.
In the above example, grep is combined with cat -n. This effectively creates a copy of each line of the file, and appends a line number to the beginning, before passing the result out as a stream. grep can be called directly on this stream, without having to specify a file, and the result will be the lines matching the pattern including a line number. However, this will interfere with positional patterns, such as the "^#" pattern used previously.

sort

$ sort <file>
$ sort -r <file>

Sort simply applies an alphanumeric smallest -> largest sort to the input file. -r reverses the sorted output (i.e. largest to smallest). This can result in unexpected results if sorting data with numerical values (such as the output of cat -n as explained above). To obtain a sort using numbers as full numbers rather than individual characters, use the -g option. As an example, using some data from a yeast GFF file:

input:
    352 ARS
    196 ARS_consensus sequence
    6 blocked_reading_frame
    7058 CDS
    16 centromere
sort output:
    16 centromere
    196 ARS_consensus_sequence
    352 ARS
    6 blocked_reading_frame
    7058 CDS
sort -g output:
    6 blocked_reading_frame
    16 centromere
    196 ARS_consensus sequence
    352 ARS
    7058 CDS

uniq

$ uniq <file>
$ uniq -c <file>

Returns a list of lines that appear at least once in a file. Useful if the file is a well formatted list of one item per line, but otherwise it may have to be combined with other tools (such as cut) in order to be useful.
The -c option includes a count of the number of occurrences of that line.

wc

$ wc <file>
$ wc -l <file>
$ wc -w <file>
$ wc -c <file>

Counts the numbers of lines, words, and characters in a file. The options -l, -w, and -c display their respective individual counts, rather than the default of all three.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License