Do not fold, find out what non-ACTGN characters do you have in your .fasta file?

While some software packages blissfully ignore unusual characters or whitespaces, others will complain. Today I ran two pipelines over the same file and both reported different sequence length. So, I ran this handy command:

cat sequence.fa | fold -w1 | sort | uniq -c

The beauty of fold command is that it will wrap any long continuous sequence at a fixed number of characters. Even after every single one. And voila, here are the results:

And yeay, indeed two pipelines didn’t agree by 375 characters. I must have somehow introduced spaces during the concatenation step when I was building a single sequence out of many smaller.

 

 

About mqm5775https://www.biostars.org/u/2884/Bioinformatics of sequences. Sex chromosomes. Enjoying DNA in my computer. Great ape Y chromosome evolution, specifically heterochromatin variability and analysis of male fertility genes. Creative
 use of visualization, gene expression of multi-copy gene families, genome assembly. Enthusiastic about learning and applying new technologies: Pacific Biosciencies (expert experience), Oxford Nanopore, BioNano Genomics.

One thought on “Do not fold, find out what non-ACTGN characters do you have in your .fasta file?

Leave a Reply

Your email address will not be published. Required fields are marked *