/usr/share/doc/squizz/TODO

PROGRAMS & LIBRARIES :

* Check all output functions results (printf, putc, ...).
* Load aa weights from extern data file.
* Allow extra gaps chars `.', `?', `~' (in all formats ?).
* Must convert all `..' to `. .' in GCG header.
* Make thread safe libraries (msf+gcg).
* Use valid gap characters according to each format.
* Check nucleic NEXUS format names.
* Check errors vs. normal end parsing.
* New `-q' (quiet) flag to suppress format message.
* New `-v' (verbose) flag to display all formats checks and results.
* Fix memory leaks in FASTA and PHYLIPS format detections.
* Make some sequence_t fields private.
* Warn for truncated values (names, accessions, ...).
* Check for sequence names output validity in each format.
* Require at least 2 sequences in alignments, even with non strict mode.
* Cannot handle matched token larger than 16kB.

FORMATS :

* ASN1: Sequence format from NCBI.
* BAMBE: Alignment format (derived from PHYLIP).
* CLUSTAL: Handle positions range after sequence names.
* EMBL: Restore version parsing with new ID line format.
* FASTA: Add NCBI header format parsing.
* GENAL: Sequence format from GenAl program (may conflict with FASTA).
* GENBANK: Cannot handle sequence bigger than 10Mb.
* MAF: Multiple Alignment Format (from UCSC Genome Bioinformatics).
* MEGA: Only a single dataset is supported.
* NEXUS: Non interleaved format.
* NEXUS: MESQUITE program seems to generate invalid format.
* PHYLIP: Support for multiple alignments in the same file.
* PHYLIP: Exercise sequence names cleanup.
* PSA/XPSA: Alignment format from pftools package, cf. psa(5), xpsa(5).
* RSF: New `Rich Sequence Format' format from GCG.
* SELEX: Alignment format (STOCKHOLM ancestor).
* SPROT: Fix DE line output for new structured datas.

DATABANKS :

* EMBL: RT line exceed 80 characters (JA477869 - rel_pat_pro_01_r110.dat).
* EMBL: RT line exceed 80 characters (JA477713 - rel_pat_syn_05_r110.dat).
* EMBL: Invalid author name (I13016 - rel_pat_unc_10_r110.dat).
* EMBL: Missing separator in RP line (GQ527172 - rel_std_inv_04_r110.dat).
* EMBL: Missing separator in RP line (AH000025 - rel_std_mus_02_r110.dat).
* EMBL: Missing separator in RP line (EU409559 - rel_std_phg_01_r110.dat).
* EMBL: Missing separator in RP line (D16449 - rel_std_pro_04_r110.dat).
* EMBL: Missing separator in RP line (L05770 - rel_std_pro_11_r110.dat).

* GENBANK: Keyword includes `;' character (JF681370 - gbenv43.seq).
* GENBANK: Keyword includes `;' character (JN802672 - gbenv45.seq).
* GENBANK: Invalid keyword separator `;  '  (U18916 - gbpln51.seq).
* GENBANK: Keyword includes `;' character (JN427016 - nc1226.flat).
* GENBANK: Keyword includes `;' character (JN451920 - nc1227.flat).

* GENBANK_WGS: CDS sequence miss indentation (CAAE01010487 - wgs.CAAE.gbff).

* GENPEPT: Invalid keywords separator `;  ' (U18916_1 - gppln51.seq).

* IMGT: Flat file has strange characters.
* IMGT: Many trailing spaces.
* IMGT: Primary accession number duplicated if secondary exists.

* REFSEQ: Unexpected `?' character (YP_004935261 - rsnc.1123.2011.gpff).
* REFSEQ: Unexpected `?' character (YP_004935348 - rsnc.1124.2011.gpff).
* REFSEQ: Unexpected `?' character (YP_004935261 - rsnc.1126.2011.gpff).
* REFSEQ: Unexpected `?' character (YP_004935261 - rsnc.1209.2011.gpff).

* UNIPROT: Ref 1 title has internal `"' (B2CI52 - uniprot_trembl.dat).
* UNIPROT: Ref 2 title has internal `"' (Q50565 - uniprot_trembl.dat).
* UNIPROT: Ref 1 title has internal `"' (Q4GX11 - uniprot_trembl.dat).
squizz 0.99b+dfsg-3 / usr / share / doc / squizz / TODO