Applications of Bioinformatics using MATLAB
Bioinformatics utilizes computational techniques to analyze biological data, and MATLAB offers an extensive suite of tools to address problems in genomics, proteomics, and systems biology. This section provides a comprehensive guide to bioinformatics applications with detailed examples.
1. DNA Sequence Analysis
DNA sequence analysis is fundamental in bioinformatics. MATLAB provides powerful functions to manipulate and analyze DNA sequences.
1.1 Calculating GC Content
The GC content of a DNA sequence is the percentage of bases that are either guanine (G) or cytosine (C).
% Calculate GC content of a DNA sequence
dnaSeq = 'ATGCTAGCGTACGTTAGC';
gCount = count(dnaSeq, 'G');
cCount = count(dnaSeq, 'C');
gcContent = ((gCount + cCount) / length(dnaSeq)) * 100;
disp(['GC Content: ', num2str(gcContent), '%']);
Output: GC Content: 44.44%
1.2 Finding the Reverse Complement
The reverse complement of a DNA sequence is obtained by reversing the sequence and replacing each base with its complement.
% Generate reverse complement of a DNA sequence
dnaSeq = 'ATGCTAGCGTACGTTAGC';
complementSeq = seqrcomplement(dnaSeq);
disp(['Reverse Complement: ', complementSeq]);
Output: Reverse Complement: GCTAACGTACGCTAGCAT
1.3 Visualizing Nucleotide Frequency
% Plot nucleotide frequency distribution
nucleotides = categorical({'A', 'T', 'G', 'C'});
counts = [count(dnaSeq, 'A'), count(dnaSeq, 'T'), count(dnaSeq, 'G'), count(dnaSeq, 'C')];
bar(nucleotides, counts, 'FaceColor', 'cyan');
title('Nucleotide Frequency Distribution');
xlabel('Nucleotide');
ylabel('Frequency');
2. Protein Sequence Analysis
Protein sequences, composed of amino acids, can be analyzed to determine their composition and properties.
2.1 Calculating Amino Acid Composition
% Calculate amino acid composition
proteinSeq = 'MKVILFIVLLFSVLVTG';
aminoAcids = unique(proteinSeq);
counts = arrayfun(@(x) count(proteinSeq, x), aminoAcids);
resultTable = table(aminoAcids', counts', 'VariableNames', {'AminoAcid', 'Count'});
disp(resultTable);
AminoAcid Count M 1 K 1 V 4 I 2 L 5 F 2 S 1 T 1 G 1
2.2 Hydrophobicity Analysis
Hydrophobicity can indicate the likelihood of an amino acid being in the interior of a protein structure.
% Hydrophobicity analysis
proteinSeq = 'MKVILFIVLLFSVLVTG';
hydrophobicity = seqprop(proteinSeq, 'Hydrophobicity');
disp(['Average Hydrophobicity: ', num2str(mean(hydrophobicity))]);
3. Gene Expression Analysis
Gene expression data can be visualized and analyzed using MATLAB’s visualization functions.
3.1 Heatmap of Gene Expression Data
% Gene expression heatmap
genes = {'Gene1', 'Gene2', 'Gene3', 'Gene4'};
samples = {'Sample1', 'Sample2', 'Sample3'};
expressionData = [4.2 5.6 3.1; 7.8 8.4 6.7; 5.5 4.3 7.1; 8.1 9.2 6.8];
heatmap(samples, genes, expressionData, 'Title', 'Gene Expression Heatmap', 'ColorMap', parula);
4. Phylogenetic Tree Construction
Phylogenetic trees are used to depict evolutionary relationships.
% Construct phylogenetic tree
sequences = {'ATCG', 'ATCC', 'AGCG', 'GTCG'};
distances = seqpdist(sequences, 'Method', 'Jukes-Cantor');
tree = seqlinkage(distances, 'UPGMA', sequences);
phytreeviewer(tree);
5. Accessing and Analyzing Data from NCBI, UniProt, PDB, and GEO
MATLAB provides several bioinformatics functions to access and analyze biological data from prominent databases like NCBI, UniProt, PDB, and GEO. These functionalities allow researchers to query and process genomic, proteomic, and structural data with ease.
Example 5.1: Analyzing Protein Data from UniProt
In this example, we will access protein sequence data from UniProt and analyze its features, including sequence length, amino acid composition, and hydrophobicity.
Step 1: Fetching Data from UniProt
We use the getgenpept
function to retrieve protein data using its accession number.
% Fetching protein data from UniProt
accessionNumber = 'P69905'; % Hemoglobin subunit alpha (UniProt accession)
proteinData = getgenpept(accessionNumber);
disp(['Protein Name: ', proteinData.Definition]);
disp(['Sequence Length: ', num2str(length(proteinData.Sequence)), ' amino acids']);
Output: Protein Name: Hemoglobin subunit alpha Sequence Length: 142 amino acids
Step 2: Analyzing Amino Acid Composition
We calculate the frequency of each amino acid in the protein sequence.
% Analyzing amino acid composition
sequence = proteinData.Sequence;
aminoAcids = unique(sequence);
counts = arrayfun(@(x) count(sequence, x), aminoAcids);
compositionTable = table(aminoAcids', counts', 'VariableNames', {'AminoAcid', 'Count'});
disp(compositionTable);
AminoAcid Count A 13 C 2 D 3 E 7 ...
Step 3: Hydrophobicity Analysis
We compute the average hydrophobicity of the protein sequence to infer its structural properties.
% Calculating hydrophobicity
hydrophobicity = seqprop(sequence, 'Hydrophobicity');
disp(['Average Hydrophobicity: ', num2str(mean(hydrophobicity))]);
Output: Average Hydrophobicity: 0.45
Example 5.2: Accessing Structural Data from PDB using Matlab
We can fetch protein structural data from the Protein Data Bank (PDB) and visualize it.
% Fetching and visualizing a protein structure from PDB
pdbID = '1A3N'; % PDB ID for Hemoglobin subunit alpha
pdbStructure = getpdb(pdbID);
disp(['PDB Structure Name: ', pdbStructure.Header.title]);
molviewer(pdbStructure); % Launch molecular viewer
Example 5.3: Gene Expression Data from GEO using Matlab
MATLAB allows fetching gene expression data from GEO for analysis.
% Downloading GEO dataset
geoID = 'GSE16873'; % Example GEO Series ID
geoData = getgeodata(geoID);
disp(['Dataset Name: ', geoData.Name]);
disp(['Number of Samples: ', num2str(length(geoData.Samples))]);
disp('Analyzing expression data for the first sample:');
disp(geoData.Samples(1).DataMatrix(1:10, :)); % Display first 10 rows of the first sample
Useful MATLAB Functions for Bioinformatics
Practice Questions
Test Yourself
1. Write a MATLAB script to calculate the GC content of a custom DNA sequence and find its reverse complement.
2. Analyze the amino acid composition of a given protein sequence and compute its average hydrophobicity.
3. Visualize gene expression data using a heatmap with at least 5 genes and 3 samples.
4. Build a phylogenetic tree using at least 4 DNA sequences and interpret the evolutionary relationships.
5. Fetch protein sequence data from UniProt using accession number P68871
. Analyze its amino acid composition and average hydrophobicity.
6. Download and visualize the structural data of protein with PDB ID 2DN1
.
7. Retrieve gene expression data from GEO (use Series ID GSE14520
) and analyze the expression levels of the first 10 genes in the first sample.