Big Data 大数据作业代写 Question 1 20 Marks A. Outline the steps that a read request goes through, starting when a client opens a file on HDFS (given a pathname) Question 1 20 Marks ...View details
大数据考试代考 1. (a) A HDFS client reading data from a file on HDFS: a. Receives all file data through the NameNode a. Receives all file data through the NameNode
A HDFS client reading data from a file on HDFS:
a. Receives all file data through the NameNode
b. Receives the location of file blocks from the NameNode
c. Receives both the data and block locations from the NameNode
d. Receives both the data and block locations from a DataNode
Choose the single correct answer and explain the effect of this design on the performance and scalability of the system, as well as the requirements it imposes on the affected components [5 points]. Also discuss why non-selected options are wrong. [3 points].
Considering the functionality and operation of NameNodes and DataNodes, one would expect:
a. The RAM and disk capacity of all machines (i.e., both those hosting a NameNode and those hosting a DataNode) to be of the same (commodity) grade
b. The disk capacity of machines hosting NameNodes to be higher than of machines hosting DataNodes
c. The RAM capacity of machines hosting NameNodes to be higher than of machines hosting DataNodes
d. The disk and RAM capacity of machines hosting DataNodes to be higher than of machines hosting NameNodes Choose the single correct answer and explain the reasoning behind your choice.
You are given a dataset in text format where each line contains two integer values (as strings) separated by space, the first denoting an item’s weight and the second denoting its volume. We want to design a MapReduce job that will return all distinct combinations of these two values that exist in the dataset; in other words, it will be implementing the statement “SELECT DISTINCT Weight, Volume FROM Dataset”. Provide the pseudocode for the map(key, value) and reduce(key, value) functions (alternatively, explain in plain English what these functions will do), assuming that the default text input format is used.
Extend your solution to also use a combiner; i.e., provide the pseudocode for its reduce(key, value) function (or explain in plain English what this function will do). Also discuss any amendments you may need to perform to your map(…) and/or reduce(…) functions so that they work correctly when this combiner is defined.
Is it possible for reducers to start executing before all mappers are finished? Also, is it possible for reducers to start producing output while some mappers are still executing? What would be the performance repercussions in each case? Explain your answer.
Spark optimises its processing on the RDD graph by breaking it down into stages, based on the dependencies among RDDs. Given a parent and a child RDD, the dependency between them is:
a. Narrow when each partition of the parent RDD is used by at most one partition of the child RDD and wide when each partition of the parent RDD is used by multiple child RDD partitions
b. Narrow when each partition of the parent RDD is used by multiple child RDD partitions and wide when each partition of the parent RDD is used by at most one partition of the child RDD.
c. Ignored if all partitions of one or the other RDD reside on a single host.
d. Considered only for expensive operations such as joins and group-by’s.
Choose the single correct answer, then discuss how these dependencies are used to create the stages [3 points] and explain how and why this design makes sense from a performance/efficiency point of view [6 points].
Consider a Log-Structured Merge Tree (LSM)-based data store. Indicate whether each of the following statements is true or false and briefly explain your answer ([2 points each ]):
a. The data store is generally expected to exhibit lower read latency as more items are added.
b. The data store is generally expected to exhibit high write throughput.
c. Each compaction results in less items being retrievable from the data store.
d. LSM-based data stores can only be implemented using a master-workers architecture.
Would you consider BigTable/HBase row stores or column stores? Explain your answer.
Consider a distributed data store which uses Consistent Hashing with an order-preserving hash function to assign data items to storage nodes. Indicate whether the following statements are true or false and explain your answer, giving examples if necessary [2 points each].
a. An item would be hashed to the same storage node regardless of fluctuations in the number of nodes in the system.
b. The storage load would generally be balanced across nodes.
c. Range queries are expected to perform well.
d. Point (equality) queries are expected to perform better than with a random hash function.
Consider a distributed data store where each item has R replicas (i.e., R instances of the item stored on nodes across the system). Let r be the number of replicas that must be contacted when reading an item, and w be the number of replicas that must be contacted when writing an item. Which of the following must hold in order for the system to provide strong consistency guarantees (linearizability) in the face of errors during reads and writes when using quorum-based replication? Discuss any assumptions and explain your answer.
a. Set r and w so that r +w > N and w+w > N.
b. Set r and w so that either r ≥ N/2 or w ≥ N/2.
c. Set r and w so that r +w > N and r > w.
d. None of the above.