deagar commited on
Commit
e19a510
·
1 Parent(s): 0643282

updated assessment ipynb

Browse files
Files changed (1) hide show
  1. notebooks/assesment.ipynb +38 -0
notebooks/assesment.ipynb CHANGED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# PySpark Data Engineering Assessment\n",
8
+ "\n",
9
+ "## Tasks\n",
10
+ "\n",
11
+ "1. Read the CSV data (in `../data/titanic.csv`) into:\n",
12
+ " - a Pandas DataFrame\n",
13
+ " - a Spark DataFrame\n",
14
+ "\n",
15
+ "2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
16
+ "\n",
17
+ "3. Run basic aggregations:\n",
18
+ " - Find the average Fare by Pclass\n",
19
+ " - Find survival rate by Sex and Pclass\n",
20
+ " - etc.\n",
21
+ "\n",
22
+ "4. Write the cleaned Spark DataFrame to a Parquet file.\n",
23
+ "\n",
24
+ "5. Bonus tasks:\n",
25
+ " - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
26
+ " - Provide quick EDA (e.g., distribution of Ages).\n",
27
+ "\n"
28
+ ]
29
+ }
30
+ ],
31
+ "metadata": {
32
+ "language_info": {
33
+ "name": "python"
34
+ }
35
+ },
36
+ "nbformat": 4,
37
+ "nbformat_minor": 2
38
+ }