Posted: September 19th, 2022

data science and python

Need help with this one from qustion 7-13, please help me with this homework 

Data sciencepythonStatistic

{
“cells”: [
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### BZAN 6357 Frameworks and Methods\n”,
“# Homework 1”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Requirement\n”,
“1. Rename this notebook as `HW-1 .ipynb`, replace the part `` (including the pointy brackets) with your name\n”,
“1. Each question is clearly marked by **Q. ##**. Provide your code answer in the designated _Code_ cell following each question\n”,
“1. If question is conceptual, add a _Markdown_ cell below the question. Clearly mark your answer cell, such as:
\n”,
” **Answer:** (followed by your answer)\n”,
“1. Feel free to insert additional cells as needed\n”,
“1. This HW also includes demonstrations of useful data science skills for specific tasks. These are purely informative materials; you do not need to answer them; but feel free to study and use them anyway you want in solving your own hands-on problems\n”,
“1. **You must run all cells following their original order. Also run your answers in the _Markdown_ cells!**\n”,
“1. Save notebook before closing it. Submit your completed notebook file (.ipynb) to Blackboard homework submission item. **no need to zip**”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Points\n”,
“Q1-6 are conceptual, Q7-13 are hands-on\n”,
“\n”,
“| Q | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11| 12| 13|Total|\n”,
“|—|—|—|—|—|—|—|—|—|—|—|—|—|—|-|\n”,
“|Pts|0.5|0.3|0.3|0.3|0.3|0.3|0.4|0.5|0.5|0.8|0.8|0.6|0.4|6|”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“# Conceptual Questions”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“—\n”,
“**DKD** Chapter 1 (p15)\n”,
“\n”,
“—”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 1 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### For each of the following meetings (a, b, c, d, e), explain which phase (or phases) in the CRISP-DM process is represented:\n”,
“(This Q same as DKD Q3)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“a. Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is.”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“b. The data mining project manager meets with the data warehousing manager to discuss how the data will be collected.”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“c. The data mining consultant meets with the Vice President for Marketing, who says that he would like to move forward with customer relationship management.\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“d. The data mining project manager meets with the production line supervisor, to discuss implementation of changes and improvements.\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“e. The analysts meet to discuss whether the neural network or decision tree models should be applied.\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 2 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“CRISP-DM is not the only standard process for data mining. Study an alternative methodology \”Roadmap\” (discussed in PML, p11; also discussed in class), discuss the similarities and differences with CRISP-DM.\n”,
“\n”,
“(This Q adapted from DKD Q5)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“—\n”,
“**DKD** Chapter 2 (p.48-50)\n”,
“\n”,
“—”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 3 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Describe the possible negative effects of proceeding directly to mine data that has not been preprocessed.\n”,
“\n”,
“(This Q same as DKD Q1)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 4 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Explain why a birthdate variable would be preferred to an age variable in a database.\n”,
“\n”,
“(This Q same as DKD Q5)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“—\n”,
“**DKD** Chapter 3 (p.88-90)\n”,
“\n”,
“—”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 5 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Why do we need to perform exploratory data analysis? Why should not we simply proceed directly to the modeling phase and start applying our high powered data mining software?\n”,
“\n”,
“(This Q same as DKD Q2)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 6 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Why perfectly correlated (i.e., collinear) input variables should not be included together in training models? What should be done if you detect collinear input vars?\n”,
“\n”,
“(This Q not from DKD)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“# Hands-on Questions”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Import packages”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“import pandas as pd\n”,
“import numpy as np\n”,
“from numpy import NaN as NA\n”,
“import numpy.random as random\n”,
“import matplotlib.pyplot as plt\n”,
“import seaborn as sns\n”,
“from sklearn.preprocessing import StandardScaler, MinMaxScaler\n”,
“from sklearn.model_selection import train_test_split\n”,
“from statsmodels.stats.weightstats import ttest_ind\n”,
“from statsmodels.stats.proportion import proportions_ztest\n”,
“\n”,
“# inline plot\n”,
“%matplotlib inline\n”,
“\n”,
“# seaborn style\n”,
“sns.set_style(‘ticks’)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# this is so that results are replicable, and your work is graded based on your results\n”,
“# Run this cell. Do not change anything!\n”,
“np.random.seed(30)\n”,
“random_state = 30”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Read data\n”,
“>Be sure to use data file \”bzan6357_churn.csv\”. Keep the data file in your jupyter notebook directory.
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“df_churn = pd.read_csv(‘bzan6357_churn.csv’)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Understanding data”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“df_churn.dtypes”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [

column \”phone\” (i.e., stores customers’ phone numbers) is best treated as an ID column. Why? It’s an \”object\”-type (text) column, and its values are unique for each customer (p.s. two customers having the same phone number is considered a critical operational failure!)


]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## *caution about `area_code`*\n”,
“\n”,
“column `area_code` stores the three-digit area-code of the phone line. Its data-type is `int`, but it is not difficult to realize that the numbers of area-code do _not_ represent a higher or lower level; they are arbitrarily assigned by the telecommunication facility.\n”,
“\n”,
“>Thus, the best way to use and model `area-code` is to treat it as a categorical variable\n”,
“\n”,
“The following code snippet transforms `area_code` into a \”categorical\” variable; note the type of `area_code` has been updated”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“df_churn[‘area_code’] = df_churn[‘area_code’].astype(‘category’)\n”,
“# re-print data-types\n”,
“df_churn.dtypes”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Preprocessing data”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“This is not a question. These materials demonstrate some useful skills about data preprocessing.\n”,
“### *quickly transform binary string column (`object`, text) to dummy column (`int` or `float`)*”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“df_churn1 = df_churn.copy()\n”,
“\n”,
“# Note the following columns are to be transformed:\n”,
“# \”churn\”, \”vmail_plan\”\n”,
“# they have these possible values: \”yes\” / \”no\”\n”,
“\n”,
“# Version 1 below creates integer columns (below two lines are commented out)\n”,
“#df_churn1[‘churn’] = df_churn1[‘churn’] == ‘yes’\n”,
“#df_churn1[‘vmail_plan’] = df_churn1[‘vmail_plan’] == ‘yes’\n”,
“\n”,
“# Version 2 below creates float columns\n”,
“df_churn1[‘churn’] = (df_churn1[‘churn’] == ‘yes’).astype(float)\n”,
“df_churn1[‘vmail_plan’] = (df_churn1[‘vmail_plan’] == ‘yes’).astype(float)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**sanity check**: verify transform successful”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“display(df_churn1.dtypes[[‘churn’,’vmail_plan’]])\n”,
“print(‘=’*40)\n”,
“print(‘Original values look like:’)\n”,
“display(df_churn[[‘churn’,’vmail_plan’]].head(3))\n”,
“print(‘-‘*30)\n”,
“print(‘Transformed values look like:’)\n”,
“display(df_churn1[[‘churn’,’vmail_plan’]].head(3))”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# sanity check data-types\n”,
“df_churn1.dtypes”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
” ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
” ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
” ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“—\n”,
“**DKD** Chapter 2 (p.48-50)\n”,
“\n”,
“—”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Use the \”bzan6357_churn.csv\” dataset imported above for the following exercises.\n”,
“>Run all of the previous cells before attempting the following questions”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 7 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Transform the `day_mins` column using Z-score standardization (\”z-standardization\”). Use dataframe `df_churn1`. Afterwards, replace the values in column `day_mins` of `df_churn1` with the z-standardized values\n”,
“\n”,
“(This Q adapted from DKD Q37)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 8 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Transform the `night_mins` column using Z-score standardization, and replace the values in column `night_mins` of `df_churn1` with the z-standardized values. Then use package `matplotlib` to make a histogram plot of the _standardized_ variable; interpret the graph, and use your own words to describe the range of the standardized values\n”,
“\n”,
“(This Q adapted from DKD Q41)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 9”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Transform the `intl_calls` column using Min-Max normalization (\”range-standardization\”), and replace the values in column `intl_calls` of `df_churn1` with the transformed values. Then make a histogram plot of the _standardized_ variable; interpret the graph, and use your own words to describe the range of the standardized values, and visually estimate the mean of the standardized values\n”,
“\n”,
“(This Q not from DKD)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 10 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“If you are asked to transform categorical var `area_code` into a group of flag vars, how many flag vars will you need? Write codes to create and add the flag vars to the main dataset. Use dataframe `df_churn1` when you create flag vars; create a new dataframe `df_churn2` when you merge the datasets\n”,
“\n”,
“(This Q not from DKD)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Exploratory Data Analysis (EDA)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“This is not a question. These materials demonstrate some useful skills about EDA.\n”,
“\n”,
“### *quickly detect collinear input vars using elegant visualization*\n”,
“This code snippet uses `seaborn` package, **heatmap** object, and leverages the \”mask\” feature to effectively visualize all the information contained in the correlation matrix by only showing its lower half\n”,

You are encouraged but not required to fully understand how to implement the half-matrix corr heatmap


]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“\n”,
“# get the correlation matrix for visualization and for a closer look later\n”,
“corr = df_churn1.drop(labels=[‘phone’], axis=1).corr()\n”,
“\n”,
“# prepare tools for making a half-matrix correlation heatmap\n”,
“mask = np.triu(np.ones_like(corr, bool))\n”,
“f, ax = plt.subplots(1,1, figsize=(8, 8))\n”,
“cmap = sns.diverging_palette(220, 10, as_cmap=True)\n”,
“\n”,
“# plot heatmap\n”,
“sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0.1, square=True, ax=ax, \n”,
” linewidths=.5, cbar_kws={‘shrink’: .75})”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**_[Observe and Think]_**\n”,
“- five dark red results indicate possible collinear pairs of vars\n”,
“- they are:\n”,
” – day_mins and day_charge\n”,
” – eve_mins and eve_charge\n”,
” – intl_mins and intl_charge\n”,
” – night_mins and night_charge\n”,
” – vmail_plan and vmail_message”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### *get the exact correlations to drill down the findings*\n”,
“> Note, you already have the full correlation matrix saved by doing the visualization”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Run this cell. Do not change anything!\n”,
“collinear_results = pd.Series(\n”,
” {‘day_mins_charge’: corr.loc[‘day_mins’,’day_charge’],\n”,
” ‘eve_mins_charge’: corr.loc[‘eve_mins’,’eve_charge’],\n”,
” ‘intl_mins_charge’: corr.loc[‘intl_mins’,’intl_charge’],\n”,
” ‘night_mins_charge’: corr.loc[‘night_mins’,’night_charge’],\n”,
” ‘vmail_plan_msg’: corr.loc[‘vmail_plan’,’vmail_message’],}\n”,
“)\n”,
“print(‘Possible collinear results are:’)\n”,
“display(collinear_results.round(3))”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [

\n”,
“\n”,
“- For the first four pairs of vars (perfectly correlated), one should retain only one in each pair; the \”_mins\” vars are perhaps easier to interpret;\n”,
“\n”,
“- For the last pair of vars, correlation is nearly max but not quite yet (0.957); one may retain both vars, and later use other analysis to compare their relative importance.

\n”,
“\n”,
“> the next homework will demonstrate a few smart tricks on how to quickly drop and retain variables”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 11 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Complete the following three parts.\n”,
“(This Q not from DKD)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“(1) Use package `seaborn` to make a panel of four scatterplots (2 by 2 layout, two subplots in 1st row and two in 2nd row). Plot `day_mins` on the x-axis of all subplots and share x across all. Use `day_charge` and `eve_mins`, respectively, on the y-axis of the upper subplots; use `night_mins` and `intl_mins`, respectively, on the y-axis of the lower subplots. In each scatterplot, use different colors for subsets with `churn` values 0 and 1. Use dataframe `df_churn1`”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“(2) Observe each scatterplot. Disregarding the colors, use your sense to estimate the overall correlation coefficient in each scatterplot (i.e., between each pair of vars)?\n”,
“\n”,
“1. correlation between \”day_charge\” and \”day_mins\”: \n”,
“1. correlation between \”eve_mins\” and \”day_mins\”: \n”,
“1. correlation between \”night_mins\” and \”day_mins\”: \n”,
“1. correlation between \”intl_mins\” and \”day_mins\”: ”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“(3) Develop codes to get the exact correlation coefficients of these pairs of vars. Compare the results with your estimates above”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 12 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Perform a train-test split using the original dataset `df_churn` and set test_size of 0.4. Assign sets to `df_train` and `df_test`. Then conduct a t-test for equal mean on variable \”eve_calls\”: begin by stating the null hypothesis, then state the value of $\\alpha$ (alpha) you choose, conduct the t-test, print results properly, then clearly interpret the resutls, and make a conclusion\n”,

Very important! Use the random_state variable in train-test-split that is defined on top of this jupyter notebook. Otherwise, your results will be messed up! Also, be sure to specify \”shuffle = True\”.


]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## Q. 13 ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Using the train and test sets from the previous question. Then conduct a z-test for equal proportions on variable \”vmail_plan\”: begin by stating the null hypothesis, then state the value of $\\alpha$ (alpha) you choose, conduct the test, print results properly, then clearly interpret the resutls, and make a conclusion”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“## Write your codes here\n”
]
}
],
“metadata”: {
“kernelspec”: {
“display_name”: “Python 3”,
“language”: “python”,
“name”: “python3”
},
“language_info”: {
“codemirror_mode”: {
“name”: “ipython”,
“version”: 3
},
“file_extension”: “.py”,
“mimetype”: “text/x-python”,
“name”: “python”,
“nbconvert_exporter”: “python”,
“pygments_lexer”: “ipython3”,
“version”: “3.6.4”
},
“varInspector”: {
“cols”: {
“lenName”: 16,
“lenType”: 16,
“lenVar”: 40
},
“kernels_config”: {
“python”: {
“delete_cmd_postfix”: “”,
“delete_cmd_prefix”: “del “,
“library”: “var_list.py”,
“varRefreshCmd”: “print(var_dic_list())”
},
“r”: {
“delete_cmd_postfix”: “) “,
“delete_cmd_prefix”: “rm(“,
“library”: “var_list.r”,
“varRefreshCmd”: “cat(var_dic_list()) ”
}
},
“types_to_exclude”: [
“module”,
“function”,
“builtin_function_or_method”,
“instance”,
“_Feature”
],
“window_display”: false
}
},
“nbformat”: 4,
“nbformat_minor”: 2
}

Expert paper writers are just a few clicks away

Place an order in 3 easy steps. Takes less than 5 mins.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00