项目作者: FedericoSerini

项目描述 :
Project 2 - Data Engineering Nanodegree
高级语言: Jupyter Notebook
项目地址: git://github.com/FedericoSerini/DEND-Project-2-Song-Play-Analysis-With-NoSQL.git


Project 2: Song Play Analysis With NoSQL

Project passed

Summary


Preamble

This project is not focused on ETL process, but on data modeling
on Cassandra and how it differs from relational database data modeling.

Let’s start with PRIMARY KEY, in Cassandra primary keys works slightly different from RDBMS ones,
in fact in Cassandra, the PRIMARY KEY is made up of either just the PARTITION KEY
or with the addition of CLUSTERING COLUMNS.
And you can have just one record unique,
if we insert data with same PRIMARY KEY then data will be overwritten with the latest record state

In WHERE conditions the PRIMARY KEY must be included and you can use your CLUSTERING KEY
to order your data
but they are not necessary, by default data will be ordered as DESC

In Cassandra the denormalization process is a must, you cannot apply normalization like an RDBMS
because you cannot use JOINs. More the denormalization process is done more the query will
run faster, in fact, Cassandra has been optimized for fast writes not for fast reads.

To reach the complete denormalization you have to follow the pattern 1 Query - 1 Table.
This leads to data duplication but it does not matter, the denormalization process
by nature itself produces data duplication.

Following the CAP theorem, Cassandra embraces the AP guarantees.

It provides only AP because of its structure, data are shared by nodes so if

a node goes down another one can satisfy the client request but due

the high number of nodes data could be not updated for each node that is why

Cassandra offers only Eventual Consistency


ETL process

Although this is not an ETL focused project, we must care how data will be
ingested in our database, and how to read our input properly.
Our ETL process consist in read every file in /event_data, this data have to be
aggregated in a file called event_datafile_new.csv and after the aggregation,

the file has to be parsed and persisted on the database


How to run

First of all, you need a Cassandra instance up and running

Here you can find the Binary packages for your preferred operating system

After downloading the package, If you do not know to move on just follow this Documentation

You have to install also Python and Jupyter Notebook

Note:

In this example we will not use any authorization mechanism

After installing your database, Python and Jupyter on your local machine

open your terminal and type

jupyter notebook

It will start the service, after the service has been started just drag and drop the notebook


Project structure

This is the project structure, if the bullet contains /

means that the resource is a folder:

  • /event_data - The directory of CSV files partitioned by date
  • /images - Simply a folder with images that are used in Project1B Project_Template notebook
  • Project1B Project_Template.ipynb - It is a notebook that illustrates the project step by step
  • event_datafile_new.csv - The aggregated CSV composed by all event_data files

CQL queries

Query 1: Give me the artist, song title and song’s length in the music app history that was heard
during
sessionId = 338 and itemInSession = 4

  1. CREATE TABLE IF NOT EXISTS song_data_by_session (session_id INT,
  2. session_item_number INT,
  3. artist_name TEXT,
  4. song_title TEXT,
  5. song_length DOUBLE,
  6. PRIMARY KEY ((session_id, session_item_number)))

In this case session_id and session_item_number are enough to
make a record unique for our request.
Our complete PRIMARY KEY is composed by session_id, session_item_number

  1. SELECT artist_name, song_title, song_length
  2. FROM song_data_by_session
  3. WHERE session_id = 338 AND session_item_number = 4

Query 2: Give me only the following: name of artist, song (sorted by itemInSession)
and user (first and last name) for userid = 10, sessionid = 182

  1. CREATE TABLE IF NOT EXISTS song_user_data_by_user_and_session_data (user_id INT,
  2. session_id INT,
  3. session_item_number INT,
  4. artist_name TEXT,
  5. song_title TEXT,
  6. user_first_name TEXT,
  7. user_last_name TEXT,
  8. PRIMARY KEY ((user_id, session_id), session_item_number))

In this case user_id and session_id are the COMPOUND PARTITION KEY
this allows us to have a unique PRIMARY KEY for our query, but for this request we have to
order by session_item_number but not to query on that, so we have to declare session_item_number as CLUSTERING KEY.
Our complete PRIMARY KEY is composed by user_id, session_id, session_item_number

  1. SELECT artist_name, song_title, user_first_name, user_last_name
  2. FROM song_user_data_by_user_and_session_data
  3. WHERE user_id = 10 AND session_id = 182

Query 3: Give me every user name (first and last) in my music app history who listened
to the song ‘All Hands Against His Own’

  1. CREATE TABLE IF NOT EXISTS user_data_by_song_title (song_title TEXT, user_id INT,
  2. user_first_name TEXT,
  3. user_last_name TEXT,
  4. PRIMARY KEY ((song_title), user_id))

In this case song_title is the PARTITION KEY and user_id
is the CLUSTERING KEY, the request asks to retrieve the user name
by song title, so we have to set song_title as PARTITION KEY, but
more users can listen to the same song so we may have many INSERT with the
same key, Cassandra overwrites data with the same key so we need to add a CLUSTERING KEY
because we need to have a unique record but not to query on that.
Our complete PRIMARY KEY is composed by song_title, user_id

  1. SELECT user_first_name, user_last_name
  2. FROM user_data_by_song_title
  3. WHERE song_title = 'All Hands Against His Own'