Academic Block

Logo of Academicblock.net

Applications of Bioinformatics using MATLAB

Bioinformatics utilizes computational techniques to analyze biological data, and MATLAB offers an extensive suite of tools to address problems in genomics, proteomics, and systems biology. This section provides a comprehensive guide to bioinformatics applications with detailed examples.

1. DNA Sequence Analysis

DNA sequence analysis is fundamental in bioinformatics. MATLAB provides powerful functions to manipulate and analyze DNA sequences.

1.1 Calculating GC Content

The GC content of a DNA sequence is the percentage of bases that are either guanine (G) or cytosine (C).

% Calculate GC content of a DNA sequence
dnaSeq = 'ATGCTAGCGTACGTTAGC';
gCount = count(dnaSeq, 'G');
cCount = count(dnaSeq, 'C');
gcContent = ((gCount + cCount) / length(dnaSeq)) * 100;
disp(['GC Content: ', num2str(gcContent), '%']);
Output: GC Content: 44.44%
    

1.2 Finding the Reverse Complement

The reverse complement of a DNA sequence is obtained by reversing the sequence and replacing each base with its complement.

% Generate reverse complement of a DNA sequence
dnaSeq = 'ATGCTAGCGTACGTTAGC';
complementSeq = seqrcomplement(dnaSeq);
disp(['Reverse Complement: ', complementSeq]);
Output: Reverse Complement: GCTAACGTACGCTAGCAT
    

1.3 Visualizing Nucleotide Frequency

% Plot nucleotide frequency distribution
nucleotides = categorical({'A', 'T', 'G', 'C'});
counts = [count(dnaSeq, 'A'), count(dnaSeq, 'T'), count(dnaSeq, 'G'), count(dnaSeq, 'C')];
bar(nucleotides, counts, 'FaceColor', 'cyan');
title('Nucleotide Frequency Distribution');
xlabel('Nucleotide');
ylabel('Frequency');

2. Protein Sequence Analysis

Protein sequences, composed of amino acids, can be analyzed to determine their composition and properties.

2.1 Calculating Amino Acid Composition

% Calculate amino acid composition
proteinSeq = 'MKVILFIVLLFSVLVTG';
aminoAcids = unique(proteinSeq);
counts = arrayfun(@(x) count(proteinSeq, x), aminoAcids);
resultTable = table(aminoAcids', counts', 'VariableNames', {'AminoAcid', 'Count'});
disp(resultTable);
AminoAcid    Count
M            1
K            1
V            4
I            2
L            5
F            2
S            1
T            1
G            1
    

2.2 Hydrophobicity Analysis

Hydrophobicity can indicate the likelihood of an amino acid being in the interior of a protein structure.

% Hydrophobicity analysis
proteinSeq = 'MKVILFIVLLFSVLVTG';
hydrophobicity = seqprop(proteinSeq, 'Hydrophobicity');
disp(['Average Hydrophobicity: ', num2str(mean(hydrophobicity))]);

3. Gene Expression Analysis

Gene expression data can be visualized and analyzed using MATLAB’s visualization functions.

3.1 Heatmap of Gene Expression Data

% Gene expression heatmap
genes = {'Gene1', 'Gene2', 'Gene3', 'Gene4'};
samples = {'Sample1', 'Sample2', 'Sample3'};
expressionData = [4.2 5.6 3.1; 7.8 8.4 6.7; 5.5 4.3 7.1; 8.1 9.2 6.8];
heatmap(samples, genes, expressionData, 'Title', 'Gene Expression Heatmap', 'ColorMap', parula);

4. Phylogenetic Tree Construction

Phylogenetic trees are used to depict evolutionary relationships.

% Construct phylogenetic tree
sequences = {'ATCG', 'ATCC', 'AGCG', 'GTCG'};
distances = seqpdist(sequences, 'Method', 'Jukes-Cantor');
tree = seqlinkage(distances, 'UPGMA', sequences);
phytreeviewer(tree);

5. Accessing and Analyzing Data from NCBI, UniProt, PDB, and GEO

MATLAB provides several bioinformatics functions to access and analyze biological data from prominent databases like NCBI, UniProt, PDB, and GEO. These functionalities allow researchers to query and process genomic, proteomic, and structural data with ease.

Example 5.1: Analyzing Protein Data from UniProt

In this example, we will access protein sequence data from UniProt and analyze its features, including sequence length, amino acid composition, and hydrophobicity.

Step 1: Fetching Data from UniProt

We use the getgenpept function to retrieve protein data using its accession number.

% Fetching protein data from UniProt
accessionNumber = 'P69905'; % Hemoglobin subunit alpha (UniProt accession)
proteinData = getgenpept(accessionNumber);
disp(['Protein Name: ', proteinData.Definition]);
disp(['Sequence Length: ', num2str(length(proteinData.Sequence)), ' amino acids']);
Output:

Protein Name: Hemoglobin subunit alpha

Sequence Length: 142 amino acids
    

Step 2: Analyzing Amino Acid Composition

We calculate the frequency of each amino acid in the protein sequence.

% Analyzing amino acid composition
sequence = proteinData.Sequence;
aminoAcids = unique(sequence);
counts = arrayfun(@(x) count(sequence, x), aminoAcids);
compositionTable = table(aminoAcids', counts', 'VariableNames', {'AminoAcid', 'Count'});
disp(compositionTable);
AminoAcid    Count

A            13
C             2
D             3
E             7
...
    

Step 3: Hydrophobicity Analysis

We compute the average hydrophobicity of the protein sequence to infer its structural properties.

% Calculating hydrophobicity
hydrophobicity = seqprop(sequence, 'Hydrophobicity');
disp(['Average Hydrophobicity: ', num2str(mean(hydrophobicity))]);
Output:

Average Hydrophobicity: 0.45
    

Example 5.2: Accessing Structural Data from PDB using Matlab

We can fetch protein structural data from the Protein Data Bank (PDB) and visualize it.

% Fetching and visualizing a protein structure from PDB
pdbID = '1A3N'; % PDB ID for Hemoglobin subunit alpha
pdbStructure = getpdb(pdbID);
disp(['PDB Structure Name: ', pdbStructure.Header.title]);
molviewer(pdbStructure); % Launch molecular viewer

Example 5.3: Gene Expression Data from GEO using Matlab

MATLAB allows fetching gene expression data from GEO for analysis.

% Downloading GEO dataset
geoID = 'GSE16873'; % Example GEO Series ID
geoData = getgeodata(geoID);
disp(['Dataset Name: ', geoData.Name]);
disp(['Number of Samples: ', num2str(length(geoData.Samples))]);
disp('Analyzing expression data for the first sample:');
disp(geoData.Samples(1).DataMatrix(1:10, :)); % Display first 10 rows of the first sample

Useful MATLAB Functions for Bioinformatics

Function
Description
seqrcomplement
Generates the reverse complement of a DNA sequence.
seqpdist
Computes pairwise distances between biological sequences.
seqlinkage
Constructs a phylogenetic tree from sequence distances.
heatmap
Creates a heatmap to visualize data.
seqprop
Calculates sequence properties like hydrophobicity.
getgenpept
Fetches protein data from GenPept (UniProt).
getpdb
Downloads structural data from the Protein Data Bank (PDB).
getgeodata
Retrieves gene expression data from GEO.
seqprop
Analyzes sequence properties, including hydrophobicity.

Practice Questions

Test Yourself

1. Write a MATLAB script to calculate the GC content of a custom DNA sequence and find its reverse complement.

2. Analyze the amino acid composition of a given protein sequence and compute its average hydrophobicity.

3. Visualize gene expression data using a heatmap with at least 5 genes and 3 samples.

4. Build a phylogenetic tree using at least 4 DNA sequences and interpret the evolutionary relationships.

5. Fetch protein sequence data from UniProt using accession number P68871. Analyze its amino acid composition and average hydrophobicity.

6. Download and visualize the structural data of protein with PDB ID 2DN1.

7. Retrieve gene expression data from GEO (use Series ID GSE14520) and analyze the expression levels of the first 10 genes in the first sample.