Programming and Algorithms II
Degree in Bioinformatics
Fall 2018

***************************************************
*** Lab 1:                                      ***
*** Running time of sorting algorithms          ***
*** Speeding up programs using dictionaries     ***
***************************************************

Part 1: 

The goal of this lab is to experience the difference
between quadratic and subquadratic running times.
We will use three sorting algorithms as an example. 

Build a program that 

1. generates a list of random integers of a given length N
2. sorts the list using a sorting algorithm
3. prints the time used by the process.

Then run the program for different sorting algorithms
and different N's. 

Use at least:
1. selection sort (which you must program)
2. insertion sort (which you must program)
3. the sort option in python 

To generate a list of random integers you can use the following
Pyton code:

<code>
import random
def randomList(n,d):
   "returns a list of n random integers in [0..d-1]"
   L = []
   for i in range(0,n):
       # randint(a..b) returns a random integer in [a..b]
       L.append(random.randint(0,d-1))
   return L
<end code>

To time a piece of code, you can use the time module:

<begin code>
import time
startTime = time.time()

#do your thing here
...

endTime = time.time()
print(endTime-startTime)
<end code>

There is also a timeit module. Give timeit.timeit 
the instruction you want to time: 

<begin code>
import timeit
t = timeit.timeit('"-".join(str(n) for n in range(100))', number=10000)
print(t)
<end code>

Then make a plot that has list size in the x axis 
and running time in the y axis. Use list sizes large enough
that e.g. selection sort takes of the order of 2-3 minutes.
You don't have to try every single list size up to them, 
try for example powers of 2 or so.

In my laptop, selection sort of lists with about 16000 elements takes about 1 minute. 
Can you guess how much it would take for one with 1 million elements?

Of course, it's nicer if you make a program that actually
tries all these points one after the other, rather than 
you calling the base program by hand many times. 

Can you note the difference between O(n^2) and O(n log n) sort?
Take into account that if you choose sizes that are too small
then the time to start up the program may dominate the 
overall running time (like an additive O(n)) term.

--------

Part 2: 

Program Solutions 1, 2, 3, 4 to the keeping counts problem.
If not all four, program at least 1 and 4.  

- Solution 1 keeps an unordered table and searches elements linearly.
  You may need an ad-hoc function for linear search that returns 
  a position or -1, and that "knows" that the list has pairs (x,c)
  when searching for x

- Solution 2 keeps an ordered table and searches elements using  
  binary search. You may need and ad-hoc function for binary search
  for lists of pairs (x,c) that compares only x, and a function  
  that inserts pair (x,1) in a list of pairs, which is similar
  to a piece of insertion sort 
  (in your first trial, you'll get index out of bounds trial;
  remember that insertion sort moves "up" an element that  
  is already part of the list, while you are adding a new one.
  
- Solution 3 sorts the list, then "compresses" all occurrences
  of every x in the sorted list into a pair (x,c) in the new list.

- Solution 4 uses a dictionary. 
  Note that in Python3 when you want to convert a dictionary into 
  a list you do 
      lst = list(d.items())
  (an explicit conversion to list is necessary)

Now run your four solutions for random lists of size n = 1, 2, 4, ... up to 
where you can. Start with few different elements in every list (d=10, for example).
In my laptop, with d=10, the slowest solution takes about 12 seconds for 
n=1 milion or so.

Do the methods scale in time as predicted by the theory?

Now, we also know that Solutions 3 and 4 are pretty much independent of 
d = number of distinct elements in list = the size of the resulting list.
But the running times of Solutions 1 and 2 do depend on d.
Now fix for example n = 500,000 and try for d = 10, 20, 40, ... up to 
where you can, see if you observe the dependence in the running time. 


In all the lab: 
Aim at having clean code. Use functions a lot. Document them.