Spaces:
Sleeping
Sleeping
updated assessment ipynb
Browse files- notebooks/assesment.ipynb +38 -0
notebooks/assesment.ipynb
CHANGED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"# PySpark Data Engineering Assessment\n",
|
8 |
+
"\n",
|
9 |
+
"## Tasks\n",
|
10 |
+
"\n",
|
11 |
+
"1. Read the CSV data (in `../data/titanic.csv`) into:\n",
|
12 |
+
" - a Pandas DataFrame\n",
|
13 |
+
" - a Spark DataFrame\n",
|
14 |
+
"\n",
|
15 |
+
"2. Perform some data cleaning (e.g., drop rows with nulls in `Age` or `Fare`).\n",
|
16 |
+
"\n",
|
17 |
+
"3. Run basic aggregations:\n",
|
18 |
+
" - Find the average Fare by Pclass\n",
|
19 |
+
" - Find survival rate by Sex and Pclass\n",
|
20 |
+
" - etc.\n",
|
21 |
+
"\n",
|
22 |
+
"4. Write the cleaned Spark DataFrame to a Parquet file.\n",
|
23 |
+
"\n",
|
24 |
+
"5. Bonus tasks:\n",
|
25 |
+
" - Create a temporary Spark SQL table/view, query it with SQL syntax.\n",
|
26 |
+
" - Provide quick EDA (e.g., distribution of Ages).\n",
|
27 |
+
"\n"
|
28 |
+
]
|
29 |
+
}
|
30 |
+
],
|
31 |
+
"metadata": {
|
32 |
+
"language_info": {
|
33 |
+
"name": "python"
|
34 |
+
}
|
35 |
+
},
|
36 |
+
"nbformat": 4,
|
37 |
+
"nbformat_minor": 2
|
38 |
+
}
|