Posted: February 26th, 2023

Computer science assignment

Homework #1

Due February 14th, 11:59pm

Each homework submission must include:

• An archive (.zip or .gz) file of the source code containing:

o The makefile used to compile the code on Monsoon (5pts)

o All .cpp and .h files (5pts)

• A full write-up ( of ) file containing answers to homework’s questions (5pts), including

the exact command line needed to execute every subproblem of the homework

The source code must follow the following guidelines:

• No external libraries that implement data structures discussed in class are allowed, unless

specifically stated as part of the problem definition. Standard input/output and utilities libraries

(e.g. math.h) are ok.

• All external data sources (e.g. input data) must be passed in as a command line argument (no

hardcoded paths within the source code (5pts).

• Solutions to sub-problems must be executable separately from each other. For example, via a

special flag passed as command line argument (5pts)

For this homework, you will need to use the most recent human genome assembly located on Monsoon:


• This file contains multiple scaffolds

that comprise the human genome

• The genome is in FASTA format (see


o The headers are unique and

always begin with the “>”

character. These can be

discarded for this homework.

Each line of genome file is exactly 80 characters long (plus carriage return character)

o The genomic sequences consist of the following alphabet {A, C, G, T, N}

Problem #1 (of 2): Monsoon account creation and workshop

• (25pts) Navigate to NAU’s High Performance Computing Cluster (Monsoon) account creation page


• Complete the Self-Paced Workshop

• Obtain and submit the validation codes to self-validate your account

• Take a screenshot of the successful ‘confirm user’ command (see example below) and submit it

as part of your writeup to complete problem #1 of the assignment.

Problem #2 (of 2): basic text processing

Write code to read, store, and analyze the latest human genome assembly (found at:

/common/contrib/classroom/inf503/genomes/human.txt ). At minimum, your code must contain


• A character array to store the entire human genome in a single data structure

• A separate function to read the human genome file

• A function to compute the number of A, C, G, or T characters in the human genome

• Comments describing major code blocks and control structures

A. (20pts) Read in and store the human genome. There will be multiple scaffolds (each with a

separate header denoted by “>”). Concatenate the entire genome (discard headers) into a

single character array data structure. Collect the following statistics (see below) as you are

reading the file. Hint: you can keep running totals or store scaffold sizes / names in a separate

sets of arrays

• How many scaffolds were there?

• What was the longest and shortest scaffold? Provide names of scaffolds and lengths.

• What was the average scaffold length?

B. (20pts) Write a function to assess the content of the human genome – count the total number

of a given character (A, C, G, or T) in the whole genome.

• What is the ‘big O’ notation of your search (linear / quadratic / cubic / etc)?

• How long does it take (in seconds) to execute this function? Hint: You will need to use

system time within your code to get accurate time estimates.

• What was the GC content of the human genome (percent of C’s and G’s in the genome)?

