Description
Problem 1
(k-means, 55pts) The dataset provided for this problem NBAstats.csv is stats from NBA players. The
players are indexed by their names, and they are labeled by 5 different positions: {center (C), power forward
(PF), small forward (SF), shooting guard (SG), point guard (PG)} and there are 27 attributes, e.g., age,
team, games, games started, minutes played and so on (that makes total of 29 columns in the data matrix).
Make sure that you standardize the data (zero-mean and standard deviation = 1) before you analyze the
data.
1. (30pts) Write a function cluster = mykmeans(X, k) that clusters data X ∈ R
n×p
(n number of
objects and p number of attributes) into k clusters.
2. (10pts) For this problem, use all features except team. Use your code to group the players into
k = {3, 5} clusters. Report the centers found for each clusters for each k, distribution of positions in
each cluster and your brief observation.
3. (5pts) Some of the attributes are perhaps redundant in terms of Linear Algebra. Report which ones
are redundant and explain why.
4. (10pts) For this problem, use the following set of attributes {2P%, 3P%, FT%, TRB, AST, STL,
BLK} to perform k-means clustering with k = {3, 5}. Report the centers found for each clusters for
each k, distribution of positions in each cluster and your brief observation.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Priyank Arora (priyank.arora@mavs.uta.edu) Page 1 of 2
CSE4334/5334 Data Mining Assignment 1
Problem 2
(k-NN, 45pts) The dataset provided for this problem NBAstats.csv is stats from 475 NBA players. The
players are labeled by 5 different positions: {center (C), power forward (PF), small forward (SF), shooting
guard (SG), point guard (PG)} and there are 27 attributes, e.g., age, team, games, games started, minutes
played and so on. Use the first 375 players as training data and remaining 100 players as testing data. Make
sure that you standardize the data (zero-mean and standard deviation = 1) before you analyze the data.
1. (25pts) Write a function class = myknn(X, test, k) that performs k-nearest neighbor (k-NN) classification where X ∈ R
n×p
(n number of objects and p number of attributes) is training data, test is
testing data, and k is a user parameter.
2. (10pts) For this problem, use all features except team. Use your k-NN code to perform classification.
Set k = {1, 5, 10, 30} and report their accuracies and your observation.
3. (15pts) For this problem, use the following set of attributes {2P%, 3P%, FT%, TRB, AST, STL,
BLK} to perform k-NN classification with k = {1, 5, 10, 30}. Report accuracies for each k and your
observation.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Priyank Arora (priyank.arora@mavs.uta.edu) Page 2 of 2