This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
This presentation gives an easy introduction to genome assemblies from next-generation sequencing data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/#!genome-assembly/
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
This presentation gives an easy introduction to genome assemblies from next-generation sequencing data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/#!genome-assembly/
An introduction to bioinformatics practices and aims will be given and contrasted against approaches from other fields. Most importantly, it will be discussed how bioinformatics fits into the discovery cycle for hypothesis driven neuroscience research.
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Secondary Structure Prediction of proteins Vijay Hemmadi
Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.
Some good references on the subject:
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
Flow Cytometry Training talks - part 1
This forms the first session of the Garvan Flow , Flow Cytometry Training course. this is a 1 1/2 day training course aimed at giving new and experienced researchers a better understanding of cytometry in medical and biological research.
An introduction to bioinformatics practices and aims will be given and contrasted against approaches from other fields. Most importantly, it will be discussed how bioinformatics fits into the discovery cycle for hypothesis driven neuroscience research.
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Secondary Structure Prediction of proteins Vijay Hemmadi
Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.
Some good references on the subject:
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
Flow Cytometry Training talks - part 1
This forms the first session of the Garvan Flow , Flow Cytometry Training course. this is a 1 1/2 day training course aimed at giving new and experienced researchers a better understanding of cytometry in medical and biological research.
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adammadalladam
Defense -- thesis: “Mapping Genotype to Phenotype using Attribute Grammar.”
PhD degree in Genetics, Bioinformatics and Computational Biology (GBCB) in the tracks of Computer Science, Mathematics and Life Sciences.
Comparative sequence studies of the repeat elements in diverse insect species can provide useful information on how to make use of them for developing abundant markers that can be used in those species;
$ At the moment, a total of 8 species are in genome assembly stages and another 35 are in progress for genome sequencing;
$ Different molecular marker systems in the field of entomology are expected to provide new directions to study insect genomes in an unprecedented way in the years to come
Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.
Lecture on the annotation of transposable elementsfmaumus
Lecture on the annotation of transposable elements at the CNRS school "BioinfoTE" in 2020 (Fréjus, France). https://bioinfote.sciencesconf.org/
ORGANIZING COMITEE
Emmanuelle Lerat (LBBE – CNRS Université Lyon 1),
Anna-Sophie Fiston-Lavier (ISEM – Université de Montpellier)
Florian Maumus (URGI – INRAe Versailles)
François Sabot (DIADE – IRD Montpellier)
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
Web Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Web Apollo. It is addressed to the members of the Phascolarctos cinereus research community.
Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Apollo. It is addressed to the members of the American Chestnut & Chinese Chestnut Genomics research community.
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form: late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variants [1].
To help identify causative genetic variants, we have combined highly accurate, long-read sequencing with hybrid-capture technology. In this collaborative webinar*, we present this method and show how combining IDT xGen® Lockdown® Probes with PacBio SMRT® Sequencing allows targeting and sequencing of candidate genes from genomic DNA and corresponding transcripts from cDNA. Using a panel of target capture probes for 35 AD candidate genes, we demonstrate the power of this approach by looking at data for two individuals with AD. Some additional benefits of this method include the ability to leverage long reads, phase heterozygous variants, and link corresponding transcript isoforms to their respective alleles.
Reference: 1. Van Cauwenberghe C, Van Broeckhoven C, Sleegers K. (2016) The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med, 18(5):421–430.
* This presentation represents a collaboration between Pacific Biosciences and Integrated DNA Technologies. The individual opinions expressed may not reflect shared opinions of Pacific Biosciences and Integrated DNA Technologies.
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...Torsten Seemann
"Bioinformatics tools for the diagnostic laboratory" presented at the Australian Society for Antimicrobials 2016 annual conference in Melbourne Australia. Slides are aimed at a biological / pathology / clinican audience. Some material has been re-imagined from Nick Loman's ECCMID 2015 talk.
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...Torsten Seemann
This talk introduces a Linux Professional audience to bacterial genomics and modern sequencing technology. The title is slightly misleading and is a bit of clickbait. The diagrams are good.
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
An introduction to basic genomics bioinformatics concepts in 20 minutes for an audience of clinicians, epidemiologists and other public health officials.
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
How genomics is changing the practice of public health microbiology. The role of whole genome sequencing as the "one true assay". Another powerful tool for the epidemiologist.
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
Long read sequencing - the good, the bad, and the really cool. Covers Illumina SLR, Pacbio RSII and Oxford Nanopore as of June 2015. Discusses bioinformatics differences of long reads over short reads.
Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Torsten Seemann
Invited talk at the Australian Society for Microbiology Annual Conference 2014 on "FriPan" our tool for visualizing bacterial pan genomes across 10-100s of isolates.
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.
A presentation to a lay audience at Melbourne Knowledge Week on how bacteria are a part of our life and what we are doing with genomics to manage them.
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
I describe the three levels of parallelism that can be exploited in bioinformatics software (1) clusters of multiple computers; (2) multiple cores on each computer; and (3) vector machine code instructions.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
4. The shortest piece is 48,000,000 bp
Total haploid size
3,200,000,000 bp
(3.2 Gbp)
Total diploid size
6,400,000,000 bp
(6.4 Gbp)
∑ = 6,400,000,000 bp
5. Human DNA iSequencer ™ 46 chromosomal
and1 mitochondrial
sequences
In an ideal world ...
AGTCTAGGATTCGCTATAG
ATTCAGGCTCTGATATATT
TCGCGGCATTAGCTAGAGA
TCTCGAGATTCGTCCCAGT
CTAGGATTCGCTAT
AAGTCTAAGATTC...
6. The real world ( for now )
Short
fragments
Reads are stored in “FASTQ” files
Genome
Sequencing
10. De novo genome assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome
from the sequence reads only
14. • Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
Shakespearomics
Whoops!
I dropped
them.
15. • Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
Shakespearomics
I am good
with words.
16. • Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
• Majority consensus
Friends, Romans, countrymen, lend me your ears; (1 contig)
Shakespearomics
We have
reached a
consensus !
17. Overlap - Layout - Consensus
Amplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
22. Do the graph traverse
Size matters not. Look at me. Judge me by my size, do you?
Size matters not. Look at me. Judge me by my soze, do you?
2 supporting reads
1 supporting reads
26. What makes a jigsaw puzzle hard?
Repetitive
regions
Lots of
pieces
Missing
pieces
Multiple
copies
No corners
(circular
genomes)
No box
Dirty
pieces
Frayed
pieces : .,
27. What makes genome assembly hard
Size of the human genome = 3.2 x 109
bp (3,200,000,000)
Typical short read length = 102
bp (100)
A puzzle with millions to billions of pieces
Storing in RAM is a challenge ⇢ “succinct data structure”
1. Many pieces (read length is very short compared to the genome)
28. What makes genome assembly hard
Finding overlaps means
examining every pair of reads
Comparisons = N×(N-1)/2
~ N2
Lots of smart tricks to reduce
this close to ~N
2. Lots of overlaps
29. What makes genome assembly hard
ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA
TATATATA TATATATA TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
3. Lots of sky (short repeats)
How could we possibly assemble this segment of DNA?
All the reads are the same!
30. What makes genome assembly hard
Read 1: GGAACCTTTGGCCCTGT
Read 2: GGCGCTGTCCATTTTAGAAACC
4. Dirty pieces (sequencing errors)
1. The error in Read 2 might prevent us from seeing that it
overlaps with Read 1
2. How would we know which is the correct sequence?
31. What makes genome assembly hard
5. Multiple copies (long repeats)
Our old nemesis the REPEAT !
Gene Copy 1 Gene Copy 2
33. What is a repeat?
A segment of DNA
that occurs more than once
in the genome
34. Major classes of repeats in the human genome
Repeat Class Arrangement Coverage (Hg) Length (bp)
Satellite (micro, mini) Tandem 3% 2-100
SINE Interspersed 15% 100-300
Transposable elements Interspersed 12% 200-5k
LINE Interspersed 21% 500-8k
rDNA Tandem 0.01% 2k-43k
Segmental Duplications Tandem or
Interspersed
0.2% 1k-100k
37. Long reads can span repeats
Repeat copy 1 Repeat copy 2
long
reads
1 contig
38. Draft vs Finished genomes (bacteria)
250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
39. The two laws of repeats
1. It is impossible to resolve repeats of length L
unless you have reads longer than L.
2. It is impossible to resolve repeats of length L
unless you have reads longer than L.
41. Coverage vs Depth
Coverage
Fraction of the genome
sequenced by
at least one read
Depth
Average number of reads
that cover any given region
Intuitively more reads should increase coverage and depth
46. Depth is easy to calculate
Depth = N × L / G
Depth = Y / G
N = Number of reads
L = Length of a read
G = Genome length
Y = Sequence yield (N x L)
47. Example: Coverage of the Tarantula genome
“The size estimate of the tarantula
genome based on k-mer analysis is 6 Gb
and we sequenced at 40x coverage from
a single female A. geniculata”
Sangaard et al 2014. Nature Comm (5) p1-11
Depth = N × L / G
40 = N x 100 / (6 Billion)
N = 40 * 6 Billion / 100
N = 2.4 Billion
48. Coverage and depth are related
Approximate
formula for
coverage
assuming
random reads
Coverage = 1 - e-Depth
49. Much more sequencing needed in reality
● Sequencing is not random
○ GC and AT rich regions
are under represented
○ Other chemistry quirks
● More depth needed for:
○ sequencing errors
○ polymorphisms
54. Completeness : Total size
Proportion of the original genome represented by the assembly
Can be between 0 and 1
Proportion of estimated genome size
… but estimates are not perfect
55. Completeness: core genes
Proportion of coding sequences can be estimated based on
known core genes thought to be present in a wide variety of
organisms.
Assumes that the proportion of assembled genes is equal to the
proportion of assembled core genes.
Number of Core Genes in
Assembly
Number of Core Genes in
Database
In the past this was done with a tool called CEGMA
There is a new tool for this called BUSCO
56. Correctness
Proportion of the assembly that is free from errors
Errors include
1. Mis-joins
2. Repeat compressions
3. Unnecessary duplications
4. Indels / SNPs caused by assembler
57. Correctness: check for self consistency
● Align all the reads back to the contigs
● Look for inconsistencies
Original Reads
Align
Assembly
Mapped Read
59. Why assemble genomes
● Produce a reference for new species
● Genomic variation can be studied by comparing
against the reference
● A wide variety of molecular biological tools become
available or more effective
● Coding sequences can be studied in the context of their
related non-coding (eg regulatory) sequences
● High level genome structure (number, arrangement of
genes and repeats) can be studied
60. Vast majority of genomes remain unsequenced
Estimated number of species ~ 9 Million
Whole genome sequences (including
incomplete drafts) ~ 3000
Estimated number of species Millions to
Billions
Genomic sequences ~ 150,000
Eukaryotes Bacteria & Archaea
Every individual has a different genome
Some diseases such as cancer give rise to massive genome level variation
61.
62. Not just genomes
● Transcriptomes
○ One contig for every isoform
○ Do not expect uniform coverage
● Metagenomes
○ Mixture of different organisms
○ Host, bacteria, virus, fungi all at once
○ All different depths
● Metatranscriptomes
○ Combination of above!
64. Take home points
● De novo assembly is the process of reconstructing a long
sequence from many short ones
● Represented as a mathematical “overlap graph”
● Assembly is very challenging (“impossible”) because
○ sequencing bias under represents certain regions
○ Reads are short relative to genome size
○ Repeats create tangled hubs in the assembly graph
○ Sequencing errors cause detours and bubbles in the assembly graph