compseq

## Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

## Function

Calculate the composition of unique words in sequences

## Description

compseq calculates the composition of words of a specified length (dimer, trimer etc) in the input sequence(s). The word length is user-specified. The unique sequences (words), their observed count, observed frequency, expected frequency and (observed / expected) frequency are written to the output file. The (observed / expected) frequency highlights any words with unusually high (or low) occurence in the input sequences.

## Algorithm

By default, compseq makes the (false) assumption that each word is equally likely. The expected frequency therefore of any dimer is 1/16 - this is simply the inverse of the number of possible dimers (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT). Similarly, the expected frequency of any trimer is 1/64, etc. Clearly this is not the case in real sequences where there will be bias in favour of some monomers and words. There are ways around this (see "Notes").

The normal behaviour of compseq is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. The '-frame' option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on.

## Usage

Here is a sample session with compseq

To count the frequencies of dinucleotides in a file:

 ``` % compseq tembl:x65923 -word 2 result3.comp Calculate the composition of unique words in sequences ```

Example 2

To count the frequencies of hexanucleotides, without outputting the results of hexanucleotides that do not occur in the sequence:

 ``` % compseq tembl:x65923 -word 6 result6.comp -nozero Calculate the composition of unique words in sequences ```

Example 3

To count the frequencies of trinucleotides in frame 2 of a sequence and use a previously prepared compseq output to show the expected frequencies:

 ``` % compseq tembl:x65923 -word 3 result3.comp -frame 2 -in prev.comp Calculate the composition of unique words in sequences ```

## Command line arguments

 ```Calculate the composition of unique words in sequences Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-sequence] seqall Sequence(s) filename and optional format, or reference (input USA) -word integer [2] This is the size of word (n-mer) to count. Thus if you want to count codon frequencies for a nucleotide sequence, you should enter 3 here. (Integer 1 or more) [-outfile] outfile [*.compseq] This is the results file. Additional (Optional) qualifiers (* if not always prompted): -infile infile This is a file previously produced by 'compseq' that can be used to set the expected frequencies of words in this analysis. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you are analysing a nucleotide sequence. -frame integer [0] The normal behaviour of 'compseq' is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. This option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on. (Integer 0 or more) * -[no]ignorebz boolean [Y] The amino acid code B represents Asparagine or Aspartic acid and the code Z represents Glutamine or Glutamic acid. These are not commonly used codes and you may wish not to count words containing them, just noting them in the count of 'Other' words. * -reverse boolean [N] Set this to be true if you also wish to also count words in the reverse complement of a nucleic sequence. -calcfreq boolean [N] If this is set true then the expected frequencies of words are calculated from the observed frequency of single bases or residues in the sequences. If you are reporting a word size of 1 (single bases or residues) then there is no point in using this option because the calculated expected frequency will be equal to the observed frequency. Calculating the expected frequencies like this will give an approximation of the expected frequencies that you might get by using an input file of frequencies produced by a previous run of this program. If an input file of expected word frequencies has been specified then the values from that file will be used instead of this calculation of expected frequency from the sequence, even if 'calcfreq' is set to be true. -[no]zerocount boolean [Y] You can make the output results file much smaller if you do not display the words with a zero count. Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit ```

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-sequence]
(Parameter 1)
seqall Sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
-word integer This is the size of word (n-mer) to count. Thus if you want to count codon frequencies for a nucleotide sequence, you should enter 3 here. Integer 1 or more 2
[-outfile]
(Parameter 2)
outfile This is the results file. Output file <*>.compseq
-infile infile This is a file previously produced by 'compseq' that can be used to set the expected frequencies of words in this analysis. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you are analysing a nucleotide sequence. Input file Required
-frame integer The normal behaviour of 'compseq' is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. This option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on. Integer 0 or more 0
-[no]ignorebz boolean The amino acid code B represents Asparagine or Aspartic acid and the code Z represents Glutamine or Glutamic acid. These are not commonly used codes and you may wish not to count words containing them, just noting them in the count of 'Other' words. Boolean value Yes/No Yes
-reverse boolean Set this to be true if you also wish to also count words in the reverse complement of a nucleic sequence. Boolean value Yes/No No
-calcfreq boolean If this is set true then the expected frequencies of words are calculated from the observed frequency of single bases or residues in the sequences. If you are reporting a word size of 1 (single bases or residues) then there is no point in using this option because the calculated expected frequency will be equal to the observed frequency. Calculating the expected frequencies like this will give an approximation of the expected frequencies that you might get by using an input file of frequencies produced by a previous run of this program. If an input file of expected word frequencies has been specified then the values from that file will be used instead of this calculation of expected frequency from the sequence, even if 'calcfreq' is set to be true. Boolean value Yes/No No
-[no]zerocount boolean You can make the output results file much smaller if you do not display the words with a zero count. Boolean value Yes/No Yes
(none)
Associated qualifiers
"-sequence" associated seqall qualifiers
-sbegin1
-sbegin_sequence
integer Start of each sequence to be used Any integer value 0
-send1
-send_sequence
integer End of each sequence to be used Any integer value 0
-sreverse1
-sreverse_sequence
boolean Reverse (if DNA) Boolean value Yes/No N
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide1
-snucleotide_sequence
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein1
-sprotein_sequence
boolean Sequence is protein Boolean value Yes/No N
-slower1
-slower_sequence
boolean Make lower case Boolean value Yes/No N
-supper1
-supper_sequence
boolean Make upper case Boolean value Yes/No N
-sformat1
-sformat_sequence
string Input sequence format Any string
-sdbname1
-sdbname_sequence
string Database name Any string
-sid1
-sid_sequence
string Entryname Any string
-ufo1
-ufo_sequence
string UFO features Any string
-fformat1
-fformat_sequence
string Features format Any string
-fopenfile1
-fopenfile_sequence
string Features file name Any string
"-outfile" associated outfile qualifiers
-odirectory2
-odirectory_outfile
string Output directory Any string
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

## Input file format

compseq reads a normal sequence(s) USA.

An optional second input file is the output from a previous compseq run used to set the expected word frequencies.

The optional second input data file format is exactly the same as the output file format.

It expects to read in a previous output file of this program. An error is produced if the word size of the current compseq job and that of the output file being read in are different.

### Input files for usage example

'tembl:x65923' is a sequence entry in the example nucleic acid database 'tembl'

### Database entry: tembl:x65923

 ```ID X65923; SV 1; linear; mRNA; STD; HUM; 518 BP. XX AC X65923; XX DT 13-MAY-1992 (Rel. 31, Created) DT 18-APR-2005 (Rel. 83, Last updated, Version 11) XX DE H.sapiens fau mRNA XX KW fau gene. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; OC Homo. XX RN [1] RP 1-518 RA Michiels L.M.R.; RT ; RL Submitted (29-APR-1992) to the EMBL/GenBank/DDBJ databases. RL L.M.R. Michiels, University of Antwerp, Dept of Biochemistry, RL Universiteisplein 1, 2610 Wilrijk, BELGIUM XX RN [2] RP 1-518 RX PUBMED; 8395683. RA Michiels L., Van der Rauwelaert E., Van Hasselt F., Kas K., Merregaert J.; RT "fau cDNA encodes a ubiquitin-like-S30 fusion protein and is expressed as RT an antisense sequence in the Finkel-Biskis-Reilly murine sarcoma virus"; RL Oncogene 8(9):2537-2546(1993). XX DR H-InvDB; HIT000322806. XX FH Key Location/Qualifiers FH FT source 1..518 FT /organism="Homo sapiens" FT /chromosome="11q" FT /map="13" FT /mol_type="mRNA" FT /clone_lib="cDNA" FT /clone="pUIA 631" FT /tissue_type="placenta" FT /db_xref="taxon:9606" FT misc_feature 57..278 FT /note="ubiquitin like part" FT CDS 57..458 FT /gene="fau" FT /db_xref="GDB:135476" FT /db_xref="GOA:P35544" FT /db_xref="GOA:P62861" FT /db_xref="HGNC:3597" FT /db_xref="InterPro:IPR000626" FT /db_xref="InterPro:IPR006846" FT /db_xref="InterPro:IPR019954" FT /db_xref="InterPro:IPR019955" FT /db_xref="InterPro:IPR019956" FT /db_xref="UniProtKB/Swiss-Prot:P35544" FT /db_xref="UniProtKB/Swiss-Prot:P62861" FT /protein_id="CAA46716.1" FT /translation="MQLFVRAQELHTFEVTGQETVAQIKAHVASLEGIAPEDQVVLLAG FT APLEDEATLGQCGVEALTTLEVAGRMLGGKVHGSLARAGKVRGQTPKVAKQEKKKKKTG FT RAKRRMQYNRRFVNVVPTFGKKKGPNANS" FT misc_feature 98..102 FT /note="nucleolar localization signal" FT misc_feature 279..458 FT /note="S30 part" FT polyA_signal 484..489 FT polyA_site 509 XX SQ Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other; ttcctctttc tcgactccat cttcgcggta gctgggaccg ccgttcagtc gccaatatgc 60 agctctttgt ccgcgcccag gagctacaca ccttcgaggt gaccggccag gaaacggtcg 120 cccagatcaa ggctcatgta gcctcactgg agggcattgc cccggaagat caagtcgtgc 180 tcctggcagg cgcgcccctg gaggatgagg ccactctggg ccagtgcggg gtggaggccc 240 tgactaccct ggaagtagca ggccgcatgc ttggaggtaa agttcatggt tccctggccc 300 gtgctggaaa agtgagaggt cagactccta aggtggccaa acaggagaag aagaagaaga 360 agacaggtcg ggctaagcgg cggatgcagt acaaccggcg ctttgtcaac gttgtgccca 420 cctttggcaa gaagaagggc cccaatgcca actcttaagt cttttgtaat tctggctttc 480 tctaataaaa aagccactta gttcagtcaa aaaaaaaa 518 // ```

### File: prev.comp

 ```# # Output from 'compseq' # # The Expected frequencies are calculated on the (false) assumption that every # word has equal frequency. # # The input sequences are: # HSFAU Word size 3 Total count 516 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # AAA 17 0.0329457 0.0156250 2.1085271 AAC 5 0.0096899 0.0156250 0.6201550 AAG 18 0.0348837 0.0156250 2.2325581 AAT 4 0.0077519 0.0156250 0.4961240 ACA 5 0.0096899 0.0156250 0.6201550 ACC 6 0.0116279 0.0156250 0.7441860 ACG 2 0.0038760 0.0156250 0.2480620 ACT 7 0.0135659 0.0156250 0.8682171 AGA 12 0.0232558 0.0156250 1.4883721 AGC 7 0.0135659 0.0156250 0.8682171 AGG 16 0.0310078 0.0156250 1.9844961 AGT 10 0.0193798 0.0156250 1.2403101 ATA 2 0.0038760 0.0156250 0.2480620 ATC 3 0.0058140 0.0156250 0.3720930 ATG 7 0.0135659 0.0156250 0.8682171 ATT 2 0.0038760 0.0156250 0.2480620 CAA 10 0.0193798 0.0156250 1.2403101 CAC 6 0.0116279 0.0156250 0.7441860 CAG 13 0.0251938 0.0156250 1.6124031 CAT 5 0.0096899 0.0156250 0.6201550 CCA 12 0.0232558 0.0156250 1.4883721 CCC 13 0.0251938 0.0156250 1.6124031 CCG 8 0.0155039 0.0156250 0.9922481 CCT 10 0.0193798 0.0156250 1.2403101 CGA 2 0.0038760 0.0156250 0.2480620 CGC 10 0.0193798 0.0156250 1.2403101 CGG 9 0.0174419 0.0156250 1.1162791 CGT 4 0.0077519 0.0156250 0.4961240 CTA 5 0.0096899 0.0156250 0.6201550 CTC 11 0.0213178 0.0156250 1.3643411 CTG 10 0.0193798 0.0156250 1.2403101 CTT 11 0.0213178 0.0156250 1.3643411 GAA 11 0.0213178 0.0156250 1.3643411 GAC 6 0.0116279 0.0156250 0.7441860 GAG 10 0.0193798 0.0156250 1.2403101 GAT 4 0.0077519 0.0156250 0.4961240 GCA 7 0.0135659 0.0156250 0.8682171 GCC 18 0.0348837 0.0156250 2.2325581 GCG 8 0.0155039 0.0156250 0.9922481 GCT 10 0.0193798 0.0156250 1.2403101 GGA 13 0.0251938 0.0156250 1.6124031 GGC 17 0.0329457 0.0156250 2.1085271 GGG 7 0.0135659 0.0156250 0.8682171 GGT 9 0.0174419 0.0156250 1.1162791 GTA 6 0.0116279 0.0156250 0.7441860 GTC 9 0.0174419 0.0156250 1.1162791 GTG 8 0.0155039 0.0156250 0.9922481 GTT 5 0.0096899 0.0156250 0.6201550 TAA 7 0.0135659 0.0156250 0.8682171 TAC 3 0.0058140 0.0156250 0.3720930 TAG 4 0.0077519 0.0156250 0.4961240 TAT 1 0.0019380 0.0156250 0.1240310 TCA 10 0.0193798 0.0156250 1.2403101 TCC 6 0.0116279 0.0156250 0.7441860 TCG 7 0.0135659 0.0156250 0.8682171 TCT 10 0.0193798 0.0156250 1.2403101 TGA 4 0.0077519 0.0156250 0.4961240 TGC 9 0.0174419 0.0156250 1.1162791 TGG 14 0.0271318 0.0156250 1.7364341 TGT 5 0.0096899 0.0156250 0.6201550 TTA 2 0.0038760 0.0156250 0.2480620 TTC 10 0.0193798 0.0156250 1.2403101 TTG 7 0.0135659 0.0156250 0.8682171 TTT 7 0.0135659 0.0156250 0.8682171 Other 0 0.0000000 0.0000000 10000000000.0000000 ```

## Output file format

The output format consists of:

Header information and comments are preceeded by a '#' character at the start of the line.

The Word size and the Total count are then given on separate lines,

The headers of the columns of results are preceeded by a '#'

The results columns are: the sub-sequence word, the observed frequency, the expected frequency (which will be read from the input file if one is given, else it is a simple inverse of the number of words of the size specified that can be constructed), the ratio of the observed to expected frequency.

After a blank line at the end, the results of 'Other' words is given - this is the number of words with a sequence which has IUPAC ambiguity codes or other unusual characters in.

### File: result3.comp

 ```# # Output from 'compseq' # # The Expected frequencies are calculated on the (false) assumption that every # word has equal frequency. # # The input sequences are: # X65923 Word size 2 Total count 517 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # AA 45 0.0870406 0.0625000 1.3926499 AC 20 0.0386847 0.0625000 0.6189555 AG 45 0.0870406 0.0625000 1.3926499 AT 14 0.0270793 0.0625000 0.4332689 CA 34 0.0657640 0.0625000 1.0522244 CC 43 0.0831721 0.0625000 1.3307544 CG 25 0.0483559 0.0625000 0.7736944 CT 37 0.0715667 0.0625000 1.1450677 GA 31 0.0599613 0.0625000 0.9593810 GC 43 0.0831721 0.0625000 1.3307544 GG 46 0.0889749 0.0625000 1.4235977 GT 28 0.0541586 0.0625000 0.8665377 TA 15 0.0290135 0.0625000 0.4642166 TC 33 0.0638298 0.0625000 1.0212766 TG 32 0.0618956 0.0625000 0.9903288 TT 26 0.0502901 0.0625000 0.8046422 Other 0 0.0000000 0.0000000 10000000000.0000000 ```

### File: result6.comp

 ```# # Output from 'compseq' # # Words with a frequency of zero are not reported. # The Expected frequencies are calculated on the (false) assumption that every # word has equal frequency. # # The input sequences are: # X65923 Word size 6 Total count 513 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # AAAAAA 6 0.0116959 0.0002441 47.9064327 AAAAAG 1 0.0019493 0.0002441 7.9844055 AAAAGC 1 0.0019493 0.0002441 7.9844055 AAAAGT 1 0.0019493 0.0002441 7.9844055 AAACAG 1 0.0019493 0.0002441 7.9844055 AAACGG 1 0.0019493 0.0002441 7.9844055 AAAGCC 1 0.0019493 0.0002441 7.9844055 AAAGTG 1 0.0019493 0.0002441 7.9844055 AAAGTT 1 0.0019493 0.0002441 7.9844055 AACAGG 1 0.0019493 0.0002441 7.9844055 AACCGG 1 0.0019493 0.0002441 7.9844055 AACGGT 1 0.0019493 0.0002441 7.9844055 AACGTT 1 0.0019493 0.0002441 7.9844055 AACTCT 1 0.0019493 0.0002441 7.9844055 AAGAAG 6 0.0116959 0.0002441 47.9064327 AAGACA 1 0.0019493 0.0002441 7.9844055 AAGATC 1 0.0019493 0.0002441 7.9844055 AAGCCA 1 0.0019493 0.0002441 7.9844055 AAGCGG 1 0.0019493 0.0002441 7.9844055 AAGGCT 1 0.0019493 0.0002441 7.9844055 AAGGGC 1 0.0019493 0.0002441 7.9844055 AAGGTG 1 0.0019493 0.0002441 7.9844055 AAGTAG 1 0.0019493 0.0002441 7.9844055 AAGTCG 1 0.0019493 0.0002441 7.9844055 AAGTCT 1 0.0019493 0.0002441 7.9844055 AAGTGA 1 0.0019493 0.0002441 7.9844055 AAGTTC 1 0.0019493 0.0002441 7.9844055 AATAAA 1 0.0019493 0.0002441 7.9844055 AATATG 1 0.0019493 0.0002441 7.9844055 AATGCC 1 0.0019493 0.0002441 7.9844055 AATTCT 1 0.0019493 0.0002441 7.9844055 ACAACC 1 0.0019493 0.0002441 7.9844055 ACACAC 1 0.0019493 0.0002441 7.9844055 [Part of this file has been deleted for brevity] TGAGGC 1 0.0019493 0.0002441 7.9844055 TGCAGC 1 0.0019493 0.0002441 7.9844055 TGCAGT 1 0.0019493 0.0002441 7.9844055 TGCCAA 1 0.0019493 0.0002441 7.9844055 TGCCCA 1 0.0019493 0.0002441 7.9844055 TGCCCC 1 0.0019493 0.0002441 7.9844055 TGCGGG 1 0.0019493 0.0002441 7.9844055 TGCTCC 1 0.0019493 0.0002441 7.9844055 TGCTGG 1 0.0019493 0.0002441 7.9844055 TGCTTG 1 0.0019493 0.0002441 7.9844055 TGGAAA 1 0.0019493 0.0002441 7.9844055 TGGAAG 1 0.0019493 0.0002441 7.9844055 TGGAGG 4 0.0077973 0.0002441 31.9376218 TGGCAA 1 0.0019493 0.0002441 7.9844055 TGGCAG 1 0.0019493 0.0002441 7.9844055 TGGCCA 1 0.0019493 0.0002441 7.9844055 TGGCCC 1 0.0019493 0.0002441 7.9844055 TGGCTT 1 0.0019493 0.0002441 7.9844055 TGGGAC 1 0.0019493 0.0002441 7.9844055 TGGGCC 1 0.0019493 0.0002441 7.9844055 TGGTTC 1 0.0019493 0.0002441 7.9844055 TGTAAT 1 0.0019493 0.0002441 7.9844055 TGTAGC 1 0.0019493 0.0002441 7.9844055 TGTCAA 1 0.0019493 0.0002441 7.9844055 TGTCCG 1 0.0019493 0.0002441 7.9844055 TGTGCC 1 0.0019493 0.0002441 7.9844055 TTAAGT 1 0.0019493 0.0002441 7.9844055 TTAGTT 1 0.0019493 0.0002441 7.9844055 TTCAGT 2 0.0038986 0.0002441 15.9688109 TTCATG 1 0.0019493 0.0002441 7.9844055 TTCCCT 1 0.0019493 0.0002441 7.9844055 TTCCTC 1 0.0019493 0.0002441 7.9844055 TTCGAG 1 0.0019493 0.0002441 7.9844055 TTCGCG 1 0.0019493 0.0002441 7.9844055 TTCTCG 1 0.0019493 0.0002441 7.9844055 TTCTCT 1 0.0019493 0.0002441 7.9844055 TTCTGG 1 0.0019493 0.0002441 7.9844055 TTGCCC 1 0.0019493 0.0002441 7.9844055 TTGGAG 1 0.0019493 0.0002441 7.9844055 TTGGCA 1 0.0019493 0.0002441 7.9844055 TTGTAA 1 0.0019493 0.0002441 7.9844055 TTGTCA 1 0.0019493 0.0002441 7.9844055 TTGTCC 1 0.0019493 0.0002441 7.9844055 TTGTGC 1 0.0019493 0.0002441 7.9844055 TTTCTC 2 0.0038986 0.0002441 15.9688109 TTTGGC 1 0.0019493 0.0002441 7.9844055 TTTGTA 1 0.0019493 0.0002441 7.9844055 TTTGTC 2 0.0038986 0.0002441 15.9688109 TTTTGT 1 0.0019493 0.0002441 7.9844055 Other 0 0.0000000 0.0000000 10000000000.0000000 ```

### File: result3.comp

 ```# # Output from 'compseq' # # Only words in frame 2 will be counted. # The Expected frequencies are taken from the file: ../../data/prev.comp # # The input sequences are: # X65923 Word size 3 Total count 172 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # AAA 7 0.0406977 0.0329457 1.2352955 AAC 3 0.0174419 0.0096899 1.8000042 AAG 11 0.0639535 0.0348837 1.8333344 AAT 3 0.0174419 0.0077519 2.2500110 ACA 1 0.0058140 0.0096899 0.6000014 ACC 4 0.0232558 0.0116279 2.0000012 ACG 1 0.0058140 0.0038760 1.4999880 ACT 3 0.0174419 0.0135659 1.2857135 AGA 1 0.0058140 0.0232558 0.2500002 AGC 2 0.0116279 0.0135659 0.8571423 AGG 0 0.0000000 0.0310078 0.0000000 AGT 0 0.0000000 0.0193798 0.0000000 ATA 0 0.0000000 0.0038760 0.0000000 ATC 1 0.0058140 0.0058140 0.9999920 ATG 3 0.0174419 0.0135659 1.2857135 ATT 1 0.0058140 0.0038760 1.4999880 CAA 1 0.0058140 0.0193798 0.3000007 CAC 2 0.0116279 0.0116279 1.0000006 CAG 9 0.0523256 0.0251938 2.0769229 CAT 3 0.0174419 0.0096899 1.8000042 CCA 0 0.0000000 0.0232558 0.0000000 CCC 3 0.0174419 0.0251938 0.6923076 CCG 1 0.0058140 0.0155039 0.3749994 CCT 2 0.0116279 0.0193798 0.6000014 CGA 1 0.0058140 0.0038760 1.4999880 CGC 5 0.0290698 0.0193798 1.5000035 CGG 4 0.0232558 0.0174419 1.3333303 CGT 2 0.0116279 0.0077519 1.5000074 CTA 1 0.0058140 0.0096899 0.6000014 CTC 4 0.0232558 0.0213178 1.0909106 CTG 7 0.0406977 0.0193798 2.1000049 CTT 3 0.0174419 0.0213178 0.8181829 GAA 3 0.0174419 0.0213178 0.8181829 GAC 1 0.0058140 0.0116279 0.5000003 GAG 7 0.0406977 0.0193798 2.1000049 GAT 2 0.0116279 0.0077519 1.5000074 GCA 2 0.0116279 0.0135659 0.8571423 GCC 10 0.0581395 0.0348837 1.6666677 GCG 1 0.0058140 0.0155039 0.3749994 GCT 3 0.0174419 0.0193798 0.9000021 GGA 2 0.0116279 0.0251938 0.4615384 GGC 8 0.0465116 0.0329457 1.4117663 GGG 1 0.0058140 0.0135659 0.4285712 GGT 5 0.0290698 0.0174419 1.6666629 GTA 2 0.0116279 0.0116279 1.0000006 GTC 6 0.0348837 0.0174419 1.9999955 GTG 6 0.0348837 0.0155039 2.2499965 GTT 3 0.0174419 0.0096899 1.8000042 TAA 3 0.0174419 0.0135659 1.2857135 TAC 1 0.0058140 0.0058140 0.9999920 TAG 0 0.0000000 0.0077519 0.0000000 TAT 0 0.0000000 0.0019380 0.0000000 TCA 3 0.0174419 0.0193798 0.9000021 TCC 1 0.0058140 0.0116279 0.5000003 TCG 0 0.0000000 0.0135659 0.0000000 TCT 3 0.0174419 0.0193798 0.9000021 TGA 0 0.0000000 0.0077519 0.0000000 TGC 1 0.0058140 0.0174419 0.3333326 TGG 1 0.0058140 0.0271318 0.2142856 TGT 1 0.0058140 0.0096899 0.6000014 TTA 1 0.0058140 0.0038760 1.4999880 TTC 1 0.0058140 0.0193798 0.3000007 TTG 0 0.0000000 0.0135659 0.0000000 TTT 5 0.0290698 0.0135659 2.1428558 Other 0 0.0000000 0.0000000 10000000000.0000000 ```

None.

## Notes

The maximum word size is limited to 4 for proteins, and 6 for nucleotide sequences.

Large word sizes are not appropriate for the compseq algorithm. All possible words will be stored and reported. The algorithm is designed to generate useful information for word sizes expected to occur at least once in the input sequence.

The results are held in an array in memory before being written to a file. For large values of wordsize (over about 7 for nucleic, 5 for protein), you may run out of memory or generate a very large output file.

There is no way for compseq to guess what the true expected frequency should be for each word. It can however read in the result of a previous compseq analysis and use this to set the expected frequencies of the subsequences. In this case, the input sequences under investigation should be representative of those used for the previous compseq analysis. It is down to your biological expertise to ensure the sequences are genuinely "representative", for instance, you might select a group of sequences belonging to the same taxonomic rank such as genus or species.

The file of expected frequencies is specified by name with the -infile qualifier. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you are analysing a nucleotide sequence.

As an alternative to using -infile, the expected frequencies of words may be calculated from the observed frequency of single bases or residues in the sequences. To do this, specify the -calcfreq qualifier. If you are reporting a word size of 1 (single bases or residues) then there is no point in using this option because the calculated expected frequency will be equal to the observed frequency.

Calculating the expected frequencies like this will give an approximation of the expected frequencies that you might get by using an input file of frequencies produced by a previous run of this program. If an input file of expected word frequencies has been specified then the values from that file will be used instead of this calculation of expected frequency from the sequence, even if 'calcfreq' is set to be true.

None.

## Diagnostic Error Messages

"The word size is too large for the data structure available."
You chose a word size that cannot be stored by the program.
"Insufficient memory - aborting."
You do not have enough memory - use a machine with more memory.
"The word size you are counting (n) is different to the word size in the file of expected frequencies (n)."
You chose different word sizes in the run of compseq that produced your results file used to display the expected word frequencies to the word size used in this run of compseq.
You appear to be trying to read a corrupted compseq results file

## Exit status

It always exits with status 0 unless one of the above error conditions is found

## Known bugs

This program can use a large amount of memory is you were allowed to specify a large word size (7 or above). This may impact the behaviour of other programs on your machine.

If you run out of memory, you may see the program crash with a generic error message that will be specific to your machine's operating system, but will probably be a warning about writing to memory that the program does not own (eg "Segmentation fault" on a Solaris machine)

This is not a bug, it is a feature of the way this program grabs large amounts of memory. The maximum word size is restricted to avoid this problem.

Program name Description
backtranambig Back-translate a protein sequence to ambiguous nucleotide sequence
backtranseq Back-translate a protein sequence to a nucleotide sequence
banana Plot bending and curvature data for B-DNA
btwisted Calculate the twisting in a B-DNA sequence
chaos Draw a chaos game representation plot for a nucleotide sequence
dan Calculates nucleic acid melting temperature
density Draw a nucleic acid density plot
emowse Search protein sequences by digest fragment molecular weight
freak Generate residue/base frequency table or plot
isochore Plots isochores in DNA sequences
mwcontam Find weights common to multiple molecular weights files
mwfilter Filter noisy data from molecular weights file
oddcomp Identify proteins with specified sequence word composition
pepdigest Reports on protein proteolytic enzyme or reagent cleavage sites
pepinfo Plot amino acid properties of a protein sequence in parallel
pepstats Calculates statistics of protein properties
wordcount Count and extract unique words in molecular sequence(s)

## Author(s)

Gary Williams formerly at:
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

## History

Completed 2 March 2000
5 April 2001 (version 1.12.0) - the operation of the option '-reverse' has changed. It is now 'False' by default instead of being 'True' by default for nucleic sequences. Too many people were getting confused by the counts being done on both senses, so this is now done on only the forward sense by default.

## Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.