import hashlib

stringa1 = "1 am an innocent string".encode('utf8')
r = hashlib.md5(stringa1).hexdigest()
print('md5: {}'.format(r))

stringa1 = "l am an innocent string".encode('utf8')
r = hashlib.md5(stringa1).hexdigest()
print('md5: {}'.format(r))

stringa1 = "I am an innocent string".encode('utf8')
r = hashlib.md5(stringa1).hexdigest()
print('md5: {}'.format(r))

md5: 24e0f1fe1d8407d2faff6ed758b18270
md5: bcef484a5550fc09f0eb1acbcc3d9089
md5: 1d9014f6be7aec56a8d0f4d10404546a


!md5sum ./Lesson_01_introduction.ipynb

2cef471efb258f70189297aa9992e6a4  ./Lesson_01_introduction.ipynb


def md5(fname):
    """hash function appropriate for big data"""
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

hash_result = md5("Lesson_01_introduction.ipynb")
print(hash_result)

2cef471efb258f70189297aa9992e6a4


!cd ~/didattica/corso_programmazione_1819/programmingCourseDIFA
!mkdir -p snakemake_lesson
!cd snakemake_lesson

!rm *.txt

%cd ~/didattica/corso_programmazione_1819/programmingCourseDIFA
%cd snakemake_lesson

!pwd

/home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA
/home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson
/home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson


%%file Snakefile

rule all:
    shell:
        "echo 'hello world' > result.txt"

Writing Snakefile


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1

[Mon Mar 18 13:28:35 2019]
rule all:
    jobid: 0

[Mon Mar 18 13:28:35 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132835.224533.snakemake.log

!ls

result.txt  Snakefile


%%file Snakefile

rule all:
    output:
        "result.txt"
    shell:
        "echo 'hello world' > {output}"

Overwriting Snakefile


!snakemake

Building DAG of jobs...
Nothing to be done.
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132839.150574.snakemake.log


!rm result.txt


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1

[Mon Mar 18 13:28:40 2019]
rule all:
    output: result.txt
    jobid: 0

[Mon Mar 18 13:28:40 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132840.589923.snakemake.log


%%file Snakefile

rule all:
    input:
        "partial_1.txt",
        "partial_2.txt"
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"

Overwriting Snakefile


!snakemake

Building DAG of jobs...
MissingInputException in line 2 of /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/Snakefile:
Missing input files for rule all:
partial_2.txt
partial_1.txt


%%file Snakefile

rule all:
    input:
        "partial_1.txt",
        "partial_2.txt"
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        "partial_1.txt",
        "partial_2.txt"
    run:
        for filename in output:
            with open(filename, 'w') as file:
                print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!rm result.txt


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1	create_partials
	2

[Mon Mar 18 13:28:47 2019]
rule create_partials:
    output: partial_1.txt, partial_2.txt
    jobid: 1

Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:28:47 2019]
Finished job 1.
1 of 2 steps (50%) done

[Mon Mar 18 13:28:47 2019]
rule all:
    input: partial_1.txt, partial_2.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:28:47 2019]
Finished job 0.
2 of 2 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132847.277140.snakemake.log

!ls

partial_1.txt  partial_2.txt  result.txt  Snakefile


!cat partial_1.txt

the result of partial_1.txt


!cat partial_2.txt

the result of partial_2.txt


!cat result.txt

the result of partial_1.txt
the result of partial_2.txt


!rm result.txt


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1

[Mon Mar 18 13:28:52 2019]
rule all:
    input: partial_1.txt, partial_2.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:28:52 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132852.354044.snakemake.log


!rm *.txt


%%file Snakefile

rule all:
    input:
        "partial_1.txt",
        "partial_2.txt",
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        out="partial_{number}.txt"
    run:
        filename = output.out
        with open(filename, 'w') as file:
            print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	2	create_partials
	3

[Mon Mar 18 15:52:33 2019]
rule create_partials:
    output: partial_2.txt
    jobid: 2
    wildcards: number=2

Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 15:52:34 2019]
Finished job 2.
1 of 3 steps (33%) done

[Mon Mar 18 15:52:34 2019]
rule create_partials:
    output: partial_1.txt
    jobid: 1
    wildcards: number=1

Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 15:52:34 2019]
Finished job 1.
2 of 3 steps (67%) done

[Mon Mar 18 15:52:34 2019]
rule all:
    input: partial_1.txt, partial_2.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 15:52:34 2019]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T155233.983477.snakemake.log


%%file Snakefile

numbers = [1, 2, 3, 4]

rule all:
    input:
        expand("partial_{number}.txt", number=numbers)
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        out = "partial_{number}.txt"
    run:
        filename = output.out
        with open(filename, 'w') as file:
            print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!rm result.txt

!ls

partial_1.txt  partial_2.txt  Snakefile


!snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	2	create_partials
	3

[Mon Mar 18 13:28:58 2019]
rule create_partials:
    output: partial_4.txt
    jobid: 4
    wildcards: number=4

Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:28:59 2019]
Finished job 4.
1 of 3 steps (33%) done

[Mon Mar 18 13:28:59 2019]
rule create_partials:
    output: partial_3.txt
    jobid: 3
    wildcards: number=3

Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:28:59 2019]
Finished job 3.
2 of 3 steps (67%) done

[Mon Mar 18 13:28:59 2019]
rule all:
    input: partial_1.txt, partial_2.txt, partial_3.txt, partial_4.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:28:59 2019]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132858.929586.snakemake.log


!cat result.txt

the result of partial_1.txt
the result of partial_2.txt
the result of partial_3.txt
the result of partial_4.txt


!rm partial_1.txt
!rm result.txt


!snakemake --dag | dot -Tsvg > dag.svg

Building DAG of jobs...


!cp ./dag.svg ../immagini/snakemake_dag.svg
!convert ../immagini/snakemake_dag.svg ../immagini/snakemake_dag.png


!snakemake --detailed-summary > provenance.tsv

Building DAG of jobs...


import pandas as pd
pd.read_csv("provenance.tsv", index_col=0, sep='\t')


!rm *.txt


!snakemake --cores 6

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	4	create_partials
	5

[Mon Mar 18 13:29:13 2019]
rule create_partials:
    output: partial_3.txt
    jobid: 3
    wildcards: number=3

[Mon Mar 18 13:29:13 2019]
rule create_partials:
    output: partial_4.txt
    jobid: 4
    wildcards: number=4

[Mon Mar 18 13:29:13 2019]
rule create_partials:
    output: partial_1.txt
    jobid: 1
    wildcards: number=1

[Mon Mar 18 13:29:13 2019]
rule create_partials:
    output: partial_2.txt
    jobid: 2
    wildcards: number=2

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:13 2019]
Job counts:
	count	jobs
	1	create_partials
	1
Finished job 4.
1 of 5 steps (20%) done
[Mon Mar 18 13:29:13 2019]
Finished job 2.
2 of 5 steps (40%) done
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:13 2019]
Finished job 3.
3 of 5 steps (60%) done
[Mon Mar 18 13:29:13 2019]
Finished job 1.
4 of 5 steps (80%) done

[Mon Mar 18 13:29:13 2019]
rule all:
    input: partial_1.txt, partial_2.txt, partial_3.txt, partial_4.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:29:13 2019]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132913.290945.snakemake.log


!rm *.txt


%%file Snakefile

numbers = [1, 2, 3, 4]

rule all:
    input:
        expand("partial_{number}.txt", number=numbers)
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        out = "partial_{number}.txt"
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, 'w') as file:
            print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!snakemake --cores 6 --resources memory=12

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Provided resources: memory=12
Job counts:
	count	jobs
	1	all
	4	create_partials
	5

[Mon Mar 18 13:29:18 2019]
rule create_partials:
    output: partial_2.txt
    jobid: 2
    wildcards: number=2
    resources: memory=6

[Mon Mar 18 13:29:18 2019]
rule create_partials:
    output: partial_3.txt
    jobid: 3
    wildcards: number=3
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:18 2019]
Finished job 2.
1 of 5 steps (20%) done
[Mon Mar 18 13:29:18 2019]
Finished job 3.
2 of 5 steps (40%) done

[Mon Mar 18 13:29:18 2019]
rule create_partials:
    output: partial_1.txt
    jobid: 1
    wildcards: number=1
    resources: memory=6

[Mon Mar 18 13:29:18 2019]
rule create_partials:
    output: partial_4.txt
    jobid: 4
    wildcards: number=4
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:19 2019]
Finished job 1.
3 of 5 steps (60%) done
[Mon Mar 18 13:29:19 2019]
Finished job 4.
4 of 5 steps (80%) done

[Mon Mar 18 13:29:19 2019]
rule all:
    input: partial_1.txt, partial_2.txt, partial_3.txt, partial_4.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:29:19 2019]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132918.612842.snakemake.log


%%file Snakefile

numbers = [i for i in range(int(config['number']))]

rule all:
    input:
        expand("partial_{number}.txt", number=numbers)
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        out = "partial_{number}.txt"
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, 'w') as file:
            print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!rm *.txt


!snakemake --cores 6 --resources memory=12 --config number=4

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Provided resources: memory=12
Job counts:
	count	jobs
	1	all
	4	create_partials
	5

[Mon Mar 18 13:29:23 2019]
rule create_partials:
    output: partial_3.txt
    jobid: 4
    wildcards: number=3
    resources: memory=6

[Mon Mar 18 13:29:23 2019]
rule create_partials:
    output: partial_2.txt
    jobid: 3
    wildcards: number=2
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:23 2019]
Finished job 4.
1 of 5 steps (20%) done

[Mon Mar 18 13:29:23 2019]
rule create_partials:
    output: partial_0.txt
    jobid: 1
    wildcards: number=0
    resources: memory=6

[Mon Mar 18 13:29:23 2019]
Finished job 3.
2 of 5 steps (40%) done

[Mon Mar 18 13:29:23 2019]
rule create_partials:
    output: partial_1.txt
    jobid: 2
    wildcards: number=1
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:23 2019]
Finished job 1.
3 of 5 steps (60%) done
[Mon Mar 18 13:29:23 2019]
Finished job 2.
4 of 5 steps (80%) done

[Mon Mar 18 13:29:23 2019]
rule all:
    input: partial_0.txt, partial_1.txt, partial_2.txt, partial_3.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:29:23 2019]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132923.287973.snakemake.log


%%file config.yaml
number: 4

Writing config.yaml


%%file Snakefile

configfile: "./config.yaml"

numbers = [i for i in range(int(config['number']))]

rule all:
    input:
        expand("partial_{number}.txt", number=numbers)
    output:
        "result.txt"
    shell:
        "cat {input} > {output}"
        
rule create_partials:
    output:
        out = "partial_{number}.txt"
    resources: 
        memory = 6
    run:
        filename = output.out
        with open(filename, 'w') as file:
            print("the result of {}".format(filename), file=file)

Overwriting Snakefile


!rm *.txt


!snakemake --cores 6 --resources memory=12

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Provided resources: memory=12
Job counts:
	count	jobs
	1	all
	4	create_partials
	5

[Mon Mar 18 13:29:28 2019]
rule create_partials:
    output: partial_2.txt
    jobid: 3
    wildcards: number=2
    resources: memory=6

[Mon Mar 18 13:29:28 2019]
rule create_partials:
    output: partial_0.txt
    jobid: 1
    wildcards: number=0
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:28 2019]
Finished job 3.
1 of 5 steps (20%) done

[Mon Mar 18 13:29:28 2019]
rule create_partials:
    output: partial_3.txt
    jobid: 4
    wildcards: number=3
    resources: memory=6

[Mon Mar 18 13:29:28 2019]
Finished job 1.
2 of 5 steps (40%) done

[Mon Mar 18 13:29:28 2019]
rule create_partials:
    output: partial_1.txt
    jobid: 2
    wildcards: number=1
    resources: memory=6

Job counts:
	count	jobs
	1	create_partials
	1
Job counts:
	count	jobs
	1	create_partials
	1
[Mon Mar 18 13:29:29 2019]
Finished job 2.
3 of 5 steps (60%) done
[Mon Mar 18 13:29:29 2019]
Finished job 4.
4 of 5 steps (80%) done

[Mon Mar 18 13:29:29 2019]
rule all:
    input: partial_0.txt, partial_1.txt, partial_2.txt, partial_3.txt
    output: result.txt
    jobid: 0

[Mon Mar 18 13:29:29 2019]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /home/enrico/didattica/corso_programmazione_1819/programmingCourseDIFA/snakemake_lesson/.snakemake/log/2019-03-18T132928.691830.snakemake.log


import requests

url_base = ("https://raw.githubusercontent.com/UniboDIFABiophysics"+
            "/programmingCourseDIFA/master/snakemake_exercise/")
filename = "transazioni_00.tsv"

response = requests.get(url_base+filename)

# Throw an error for bad status codes
response.raise_for_status()

with open(filename, 'wb') as handle:
    handle.write(response.content)

format	file size	write time	read time	size after compression
extended csv	430 MB	1m 15s	9.78s	132 MB
csv with gzip	132 MB	6m 57s	16.2s
hdf5 uncompressed	463 MB	155ms	299ms	154 MB
npy uncompressed	462 MB	638ms	393ms	154 MB

	date	rule	version	log-file(s)	input-file(s)	shellcmd	status	plan
output_file
result.txt	-	all	-	NaN	partial_1.txt,partial_2.txt,partial_3.txt,part...	cat partial_1.txt partial_2.txt partial_3.txt ...	missing	update pending
partial_1.txt	-	create_partials	-	NaN	NaN	-	missing	update pending
partial_2.txt	Mon Mar 18 13:28:47 2019	create_partials	-	NaN	NaN	-	rule implementation changed	no update
partial_3.txt	Mon Mar 18 13:28:59 2019	create_partials	-	NaN	NaN	-	ok	no update
partial_4.txt	Mon Mar 18 13:28:59 2019	create_partials	-	NaN	NaN	-	ok	no update

Data pipeline and Snakemake¶

Data types¶

Text data¶

text data - csv (and tsv)¶

Text data - JSON¶

text data - JSON lines¶

text data - INI¶

text data - YAML¶

text data - XML¶

notes on XML¶

Binary formats¶

Binary formats - lossless images (bmp, png, gif, tiff)¶

Binary formats - DICOM¶

Binary format - numpy format (NPY)¶

Binary formats - HDF5¶

Binary formats - databases (sqlite)¶

Performances of various formats¶

File hashing¶

Data Pipelines¶

1 - metadata¶

2 - raw data¶

3 - source code¶

4 - source data¶

5 - usage data¶

6 - intermediate data¶

7 - temporary data¶

Snakemake¶

Configurations¶

Exercise¶

suggestions¶