Search the whole station

决策方法和预测分析代写 统计代写 STAT代写

STAT5009 Decision Methods & Predictive Analytics

Take-Home Project

决策方法和预测分析代写 Objective This take-home project is one of three assessments (along with Tests 1 & 2) in this unit. It is worth 60% of the overall mark.


This take-home project is one of three assessments (along with Tests 1 & 2) in this unit. It is worth 60% of the overall mark.

The main objective is to allow the participants in the unit to demonstrate their grasp of the fundamentals and practical use of statistical and machine learning methods for prediction/classification that have been discussed during the unit, and those that participants will research on their own.

The project should focus on the analysis of a substantive dataset that participants may obtain from online sources, or from their place of work.

Up to now, in both lectures and tutorials, we have analyzed data and fitted predictive models as if the steps to do so were clear, well-laid out, and led invariably to a ‘correct’ answer. Reality, however, is messier. There is not a linear path from problem and data to solution, and one of the pedagogical objectives of the project is to allow participants to get some sense of that.

Participants should work (with some exception) in teams of 2 people. Analysis and reporting are to be carried out in R/RStudio using R Markdown.

Assessment 决策方法和预测分析代写

Proposal (2–3 pages)10%TW7
Peer Review of Proposal5%TW8
Written Report35%TW12
Oral Presentation10%TW12

Details of each of these assessments are shown below, and rubrics and other reference material will be available on Blackboard.


Participants may wish to use data from their own workplaces, as long as confidentiality requirements do not prevent them from writing a report that will be read by the lecturer nor from speaking about their analysis to the other participants in the unit.

There are many public sources of data available, including open data websites such as OpenDataSoft. The appendix also contains a list of data websites compiled by an academic at the University of Idaho.

The idea is to find a dataset that is sufficiently complex to allow you to demonstrate your familiarity with the methods studied in the unit, and those that we have not. There may be several response variables for which prediction/classification methods have to be used. In addition, you will find yourself more motivated if you select a dataset from a field that is of interest to you.

Project Proposal 决策方法和预测分析代写

The project proposal is a short (2-3 page) Word document produced using R

Markdown that contains:

1. Title

2. Data & Analyses

a. Objective: What do you plan on predicting/classifying and why?

b. Where do the data come from? Have these data been analyzed before?

c. Describe context and variables and their types; show some plots/tables

d. What analyses do you propose to carry out?

e. How will you evaluate the predictive models?

Your proposal will be marked by one of your classmates.

Peer Review of Proposal 决策方法和预测分析代写

You will be provided with a rubric and some general guidelines to help you evaluate a proposal written by one of your classmates.

Project Report

The project report should be written as a formal technical report. It can be written wholly in R Markdown and then converted to Word, or some combination of R Markdown for technical appendices and Word for the main body. There is no prescribed structure, but it should contain the following elements:

1. Problem Statement and Background

• What is the problem you are trying to solve? Where do the data come

from? Include background material as appropriate.

2. Methods

• What are the methods you used for exploratory analysis and for prediction/classification? Provide background information on methods that we did not cover in the unit.

• What hyper-parameter choices did you make and why?

• What data cleaning/wrangling did you have to do before analysis?

• Include methods that didn’t work as well as those that did.

3. Results 决策方法和预测分析代写

• Provide a detailed description of your results. What are the performance measures you used to assess predictive/classification accuracy?

• If the data have been analyzed before, how well did your methods

perform compared to those that others used?

• Use informative and interesting visualizations for EDA and for

displaying your results.

4. Conclusions and Lessons Learned

• What would you have done differently? What other methods could you

have used?

Depending on the complexity of the problem you have decided to tackle, the main body of the report will be 10–20 pages long, including important plots and tables. The appendix should contain the R Markdown file and the resulting output from your data wrangling, exploratory data analysis, and quantitative analysis. If you use any external resources such as books or websites – and you are encouraged to do so! – please make sure that you cite them appropriately.

A rubric will be provided on BB to guide you as you write the report. If you are

working in a team, please provide a breakdown of the effort of each member, and

what each individual worked on.


You will be required to add the following statement to your report:

1. This assignment is my/our own original work, except where I/we have

appropriately cited the original source (appropriate citation of original work will

vary from discipline to discipline).

2. This assignment has not previously been submitted in any form for this or any

other unit, degree or diploma at any university or other institute of tertiary


3. I/we acknowledge that it is my responsibility to check that the file I/we have

submitted is: a) readable, b) the correct file and c) fully complete.

Oral Presentation 决策方法和预测分析代写

The last lecture/workshop slot will be devoted to oral presentation of your work.

Depending on the number of presentations, each presentation will be between 8 – 12 minutes long plus some time for questions. A rubric will be made available on Blackboard.


The course textbook and supporting materials should be your starting point for help on exploratory data analysis and predictive methods. There is plenty of online help on using R. For example, the website stackoverflow has a subsection devoted to R that’s very useful. A searchable archive of the R help list may be found at this website. And, of course, you are welcome to contact me for guidance.

Good luck.

Appendix A Additional Data Sources 决策方法和预测分析代写


(Hyperlinked and compiled by Stephen Sauchi Lee, University of Idaho.)

1. 200,000+ Jeopardy questions

2. Awesome Public Datasets on github, curated by caesar0301.

3. AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

4. Canada Open Data, pilot project with many government and geospatial datasets.

5. Causality Workbench data repository.

6. CDC Data — Medical data from the Centers for Disease Control and Prevention

7. — US government source of data about the nation’s people and economy

8. CKAN — Open-source data portal platform

9. Corral Big Data repository at Texas Advanced Computing Center, supporting data centric science.

10. CrowdFlower Data for Everyone library.


11. Data Market — Portal for shared business data

12. Data Planet, The largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases.

13. Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly (Jan 2011).

14. — Source of machine readable datasets generated by the US government

15., publicly available data from UK (also London datastore.)

16., central guide for education data resources including high value data sets, data visualization tools, resources for the classroom, applications created from open data and more.

17., open government data from US, EU, Canada, CKAN, and more.

18. DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets.

19. DataMarket, visualize the world’s economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.

20., a clearinghouse of datasets available from the City & County of San Francisco, CA.

21-30 决策方法和预测分析代写

21. Dataverse Network — Repository for research datasets

22. Delve, Data for Evaluating Learning in Valid Experiments

23. Donors Choose: data related to their projects

24. EconData, thousands of economic time series, produced by a number of US Government agencies.

25. Enron Email Dataset, data from about 150 users, mostly senior management of Enron.

26. Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana – the trusted and comprehensive resource for  European cultural heritage content.

27. FEDSTATS, a comprehensive source of US statistics and more

28. FIMI repository for frequent itemset mining, implementations and datasets.

29. Financial Data Finder at OSU, a large catalog of financial data sets.

30. FiveThirtyEight: data and code related to their articles


31. Free SVG Maps — Website for free geographic maps

32. GDELT: The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.”

33. GeoDa Center, geographical and spatial data.

34. Google ngrams datasets, text from millions of books scanned by Google.

35. Google Public Data Explorer — Google’s public data portal to explore, visualize, and communicate large datasets

36. Grain Market Research, financial data including stocks, futures, etc.

37. Guardian DataBlog — Data journalism and data visualization from the Guardian

38. HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.

39. ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.

40. IMDb Datasets — Webpage for access to IMDb datasets

41-50 决策方法和预测分析代写

41. Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.

42. Investor Links, includes financial data

43. Jake Hofman Data Links — Jake Hofman’s bookmarked computational social science data resources

44. Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data.

45. Kaggle – home of Data Science

46. KDD Cup center, with all data, tasks, and results.

47. KDnuggets Data Repositories List — Data repository list maintained by KDnuggets, a popular data mining website

48. Kevin Chai list of datasets, for text, SNA, and other fields.

49. KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.

50. Datasets — Webpage for access to datasets


51. Linked Data — Linkage site for distributed data

52. Linking Open Data project, at making data freely available to everyone.

53. Million Song Dataset

54. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.

55. ML Data, the data repository of the EU Pascal2 networks.

56. — A public repository for machine learning data

57. NASDAQ Data Store, provides access to market data.

58. National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.

59. National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.

60. NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas.

61-70 决策方法和预测分析代写

61. Open Data Census, assesses the state of open data around the world.

62. Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.

63. OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.

64. Peter Skomoroch (LinkedIn) Data Links — Peter Skomoroch’s bookmarked machine learning data resources

65. PubGene(TM) Gene Database and Tools, genomic-related publications database

66. Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.

67. qunb, a platform to find and visualize quantitative data.

68. RealClimate Data — Aggregator for selected sources of code and data related to climate science

69. Reddit Open Data — Forum on the social news site reddit for open APIs and datasets

70. Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits


71. Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance.

72. SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.

73. Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management web site.

74. StateMaster — Reference site for data on US states

75. StatLib, CMU Datasets Archive.

77. The Upshot: data related to their articles

78. Time Series Data Library

79. UCI Datasets — The UC Irvine Machine Learning Repository, a popular source of machine learning datasets

80. UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.

81-90 决策方法和预测分析代写

81. UCR Time Series Data Archive, offering datasets, papers, links, and code.

82. UFO reports: geolocated and time-standardized UFO reports for close to a century

83. UK’s Met Office Data — Climate station records from the UK’s National Weather Service

84. UK’s Office for National Statistics — Source of datasets generated by the UK’s Office for National Statistics

85. United States Census Bureau.

86. Visual Analytics Benchmark Repository.

87. Web Data Commons, structured data from the Common Crawl, the largest public web corpus.

88. Wikipedia Database — Webpage for access to complete Wikipedia database dumps

89. Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources

90. Wolfram Alpha disease and patient level data.


91. Wolfram|Alpha — Computational knowledge engine or answer engine

92. World Bank Catalog — World Bank data

93. Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition

94. Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.

95. Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities


更多代写:Python代写  GMAT代考  英国essay代寫  澳洲essay代写  thesis代寫 软件工程代写

合作平台:essay代写 论文代写 写手招聘 英国留学生代写

The prev: The next:

Related recommendations