20 Pages
5044 Words
Introduction Of Big Data Management And Business Intelligence Assignment
Get free written samples from subject experts and Assignment Writing in UK.
Big Data Management is the method of governing, administering, and organizing large volumes of structured as well as unstructured data. The main target of big data management deals with ensuring that the data quality is at a high level. It also increases the accessibility of “big data analytics” and “business intelligence”. Corporations, agencies of the government, employ the strategies of big data management and some other organizations so that it can help them to withstand the data pools that are fast growing. These data may be sometimes in terabytes or can even exist in terabyte file formats. If the data management is done effectively then it can help the companies in locating valuable information even in the large volume of data from different sources (Anwar et al. 2021). As a crucial part of the data management method, it is important for the companies to decide which type of data needs to be kept for future convenience and which data needs to be disposed of and analyzed for improving the processes of current business.
Literature Review
Critical evaluation of three big data processing in paradigms
Three types of paradigms in big data management involve:
- Spreadsheet UI v/s Workflow- It is very important for data practitioners to know the type of data they are going to operate on so that they could judge whether a user interface-like workflow is needed or the spreadsheet interface. The workflow interface provides a platform for placing the icons or components that represent the configured data tasks and connecting those components within lines for representing the dependencies and the lineage involved within the workflow. Due to the presence of the abstraction layer, the content of the data is not viewable until the time workflow or job, which is predefined, is run and the output of the system is browsed. It is very crucial to observe and note that this type of paradigm often tends to assume the transformation that is required and during the creation time, it joins. There may be a need for assessing and validating by the use of multiple iterations and phrases of test to ensure that the output meets the requirements (A-Ru-Han et al. 2022). On the other hand, a spreadsheet gives its users a view of the data itself directly and tries to present each individual data or attribute in the form of columns generally with visually embedded cues used for data sparsity, for representing data mismatch, its uniqueness, and few other anomalies. This type of paradigm helps in reducing the iteration amounts and in accelerating the cycle of data preparation in which the data transformation and validation is a very important elements for the use case.
- Code-based approach v/s Clicks- With the emerging technology of point-and-click, ease of use, and drag-and-drop business, it has become an important differentiator while the preparation of data. However, it is seen that in code-based approach has gained well popularity among technical users who gives preference to the flexibility and the lower software costs that have the ability to resist the customization. A code-based approach has the requirement of highly skilled resources. It was seen that any change in the code needs to be processed according to the test of development, production, quality assurance, and also of production.
- Full data v/s Sample- In this paradigm, it involves the use cases which has the requirement of a complete data population, like the mi8gration of master data, reporting the regulatory, and analysis of any fraud activities. Similarly, there is a presence of use cases that are known to perform best using the subset or sample of the data like in marketing segmentation and in the predictive analysis. A sample-based approach may result in increasing the risk of the data which is missing, which may have a huge effect largely depending on the use cases because the size and the sophistication of the sample data may differ from product to product (Asch et al. 2018). Some of the tools involved in it may allow selecting the sample data size depending on the resources that are available while processing. On the other hand, the full data perspective provides workability with all the records of data and the attributes of columns present in the dataset that enables a detailed approach to the profiling of the data quality. The full datasets are observed to have important impacts on the accuracy of the data and in delivering the information's which are reliable exclusively depending on the understanding of the user and other case scenarios.
- Stand-Alone Application v/s Vendor Add-on- Another factor to see whether the solution is existing as a stand-alone or as an analytics application or ETL environment. There is a risk of having limited capabilities for meeting the specific needs that may arise while determining the risks that are involved within it so that benefits could be provided by the use of an integrated solution.
- The task performed on the computer are not interacted with by the computer in batch processing and the operator after starting the run of tapes, punch, and cards for the inputs of serial operations.
- The evaluation of the batch processing creates a lack of interaction for the managers to allow freedom about the critical exception.
- The dependencies to trigger the process in batch processing of the customer to place the order online in generating the request and dependencies.
- The monitors with abnormalities are taken normally for the completion of batch processing in the next beginning of the job until it is changed.
- The development of transactions for the priority in real-time difficulties in scheduling the fact constructed by the real-time processing.
- The validated transaction in real-time processing is the means of execution for the operation with the reflected execution in order to serialize the committed transactions.
- The real-time processing in advanced computing technology is the level of containment for the analysis of large data in real-time processing.
- The mechanism of efficient flow can be scheduled for the implementation of the stakeholders in the concurrent task of overhead.
- The computing systems of the hybrid computer combine both analog and digital components traditionally with the complex mathematical computer, especially in hybrid processing.
- The hybrid computer can describe the used of thin computer client in using the hard drive relies on the program of the server.
Implementation and development of big data solution
Different datasets with the use of the .csv extension were used for analyzing the data in Gephi software. These datasets were created when the users were playing the imaginary game named as “Catch the Pink Flamingo”. The datasets involve the ad-click.csv, which says about the database created because of the user clicking on the ads that appeared during the game. The buy-clicks.csv shows the data of the items that the users purchased. The dataset game-clicks.csv represented the records of the clicks that were performed during playing the game (Barns et al. 2020). The dataset level-events.csv represents the record that occurred during playing the events of the game especially when it happened during the end or beginning of the new level of the game. The team-assignments.csv dataset represents the record that was created when the user every time joined a team and the dataset team.csv says about the information of the team involved in the game.
The user-session.csv records the sessions that the user plays in the completely flamingo game. Last but not the least, the database known as users.csv represents the database that was created because of the users of the game. It is seen that the techniques, which are agile and in iterative form can quickly implement the data sets. The chat_leave_team_chat.csv is the database that is recorded when any user leaves the team for this a new record is added to the database.
After that, with the team-level value a bar plot was created using the datasets and by the following syntax.
It also creates an edge known as “leaves” by the user. The chat_respond_team_chat.csv dataset comes into action when one player's chat id responds to a particular post by one other player then a new line are created. The chat_join_team_chat.csv is the dataset when a new record is created and is added to the existing file as soon as a new user joins the respective team (Brinch et al. 2018). The chat_mention_team_chat.csv comes into action when the users are notified about being mentioned and the new record is added to this database. In addition, an edge is created between the nodes of users and the chat text being labeled as mentions.
Explanation of data analysis
K-mean Clustering- It is basically an algorithm that is found in unsupervised machines. If a contradiction study is done for traditionally supervised algorithms in machine learning it is observed that K-means clustering helps in classifying the data without any type of training in the labeled data. Once the group is being defined and the algorithm is run, any sort of new data can be assigned to it easily to the group which would be found to be the most relevant one (Cui et al. 2020). The real world application of K-means clustering may involve profiling of the customers, segmentation of the market, and the vision of computers, astronomies, and search engines. Clustering helps in identifying two types of quality in the data and that is meaningfulness and usefulness. Through meaningful clusters, domain knowledge is being expanded and the useful cluster serves as the steps for building the pipeline of the data.
“from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)”
|
Decision Tree classification- The decision tree helps in building the regression or the classification of the models as the tree structure form. It performs the operation of breaking down of dataset into the subsets, which are quite smaller, and at the same duration, decision tree, which is associated, is developed incrementally. The result, which is developed, is that of a tree with nodes of decisions and with leaf nodes.
“from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=6, random_state=1)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(y_pred[0:5])”
|
Using the software Neo4j the chat data was analyzed so that improvements in the game experience of the user could be made.
As for the preprocessing process, the data first has been read by the developer and to do that, the developer, in this case, has sectioned the whole dataset into two dissimilar datasheets, in which one is an "edge" sheet and the other one is a "nodes" sheet. In this case, the node sheet has developed more than one attribute column in which the first column has included the Name attribute and every single Id has separate attributes. In the case of the edges sheet, every single edge has the columns as follows: a) Weight, b) Type, c) Target, d) Source. As for the data transmission operation for this programming, this operation consists of two main phases (Fer et al. 2018). In the very first phase, the developer had built up a nodes sheet in which every single node is enlisted with an Id (which is unique). In the second phase, the developer has established an edges sheet in which every single connection between nodes had been presented as connections between Ids. In this scenario, the developer before starting the processing had made sure the fact that every single character within the sheet of "Character Interaction" has a unique name and this attribute (unique) has been utilized in every single reference (in a consistent manner) to the character within the sheet.
After establishing the worksheet within excel every single character has been enlisted to a sole column of the nodes sheet for the Character Interaction sheet by the developer (Hye-Sun et al. 2019). It has established a list of every single character within the short story. After that, all the duplicate values have been eliminated from the “Name” column. However, for that reason, this has provided a list of characters (unique) within the short story.
3.2 Correlation analysis
In this section, this analysis has been utilized in a statistical manner and the reason behind this was to evaluate the firmness of the linear relationship between the variables of the game and calculate their association. This analysis has evaluated the level of transformation in a sole variable because of the transformation in the other. A positive correlation has been established in this case. However, when it comes to a case like this game dataset management, the developer utilizes this analysis to investigate the data (which are quantitative) gathered via the research process such as live polls and surveys (Khanra et al. 2020). The developer has provided an effort to recognize the connection, trends and patterns between two datasets and variables. According to this figure mentioned in the above portion, the correlation indicates the value of the level of the team according to the gaming performance.
From the above image it can be stated that the relation between different users Ids and their chat session can be observed. Using Neo4j the chat session data has been use for graphical representation that shows different variables for different users. As for the user id row, the value of user id indicates a higher experience value in comparison with the userSessionId, therefore, establishing a positive correlation. The first user indicated a negative value in correlation and the reason behind this is that the user has avoided any type of tam margin during the conduction of the game which established the negative value (Mazumdar et al. 2019). This has also highlighted a lower experience toward team joining in a specific operation. However, as for the 3rd user id, it indicates a negative correlation with team level and this also prefers the same cause which has been mentioned in the above portion. It has been clearly seen that the count hits are mostly establishing positive correlation except for the 3rd and the 6th ones. The positive one indicates that the user has surpassed the restricted hits for the game and thus creating a higher experience for the relation and on the other hand the, the negative ones indicating not surpassing the restrained hits of the game which has established a lower experience of the user towards the hits of the game.
4.0 Big Data Design
4.1 Enabling of big data design and storage for processing
Big data is generally stored in the lake of data. The warehouses of the data are basically built on the databases that are relational and contain those data only that are structured in nature and can provide support to different types of data and are based on databases or platforms involving big data (Nadal et al. 2019). Therefore, it can be said that the Big data helps companies in making decisions that are informed by understanding the desires of the customers. The analysis has the potential to help in achieving rapid growth with the help of realizing the real-time data.
Batch processing
Batch processing refers to the application of database files, which have been kept for a long significant period. For instance, imagine completing all of a major financial firm's transactions in a single weekend. This data includes many records for a single day, which may be saved as a document, database, or other type of information (Pappas et al. 2018). Big data is frequently kept in a data lake. Data warehouses are based on relational databases and contain solely structured data. They may accommodate many types of data and are based on databases or platforms incorporating big data. As a result, it can be claimed that big data assists businesses in making educated decisions by knowing client demands. With the use of real-time data, the analysis has the ability to assist in attaining quick growth.
Real time processing
Real-time data analysis refers to the application of information in a small space of time to provide near-instantaneous results. Because the planning takes place as the information is inserted, it involves a comprehensive supply of incoming data to produce a continuous output.
Hybrid processing
Hybrid data management is an approach of distributing, collecting, and analyzing data that will allow businesses to promote innovation. To reach this degree of data administration, they will need a hybrid cloud solutions that can handle all forms of information in an organization.
4.2 Recommendations and Results
Some recommendations that could be made in order to better the results of the analysis of Big data is that a roadmap could be created before implementing the structured or unstructured or semi structured data. The platforms should exist in the form of a stack involving single technology instead of individual ones, thereby the strategy of the technology must address the roadmap for integration of this Big data platform. The strategy must also aim in anticipating and addressing the challenges regarding storage, capacity, and challenges that are scaled for the future. The companies must ensure staff is trained before implementing the experiments on Big data. The concept of data exploration, multi-platform techniques, and real-time monitoring are recently becoming very popular and should be implemented by the company to make better results (Park et al. 2022). Also, for smooth allowance of the big data, it is important to appoint a proper owner for individual platforms of big data. It can be assumed that the model of a clear governance idea can help in aiding the completeness of the data, integrity, and quality.
4.3 Design choices for big data solutions
The evolution of the advancements within this big data has been enhanced with the data volume. It has been seen that managing a wide volume of information is a global challenge for data engineers as well as data scientists. This issue has manifested in various new advancements which are NoSQL, Spark, Hadoop and many more. The crucial issue of managing big data is an issue of resources (Vayena et al. 2018). The reason behind this is that the bigger the volume of the information, the more the resources are needed, in terms of disks, processors, and memory. The purpose of the optimization of the performance is to either decrease the utilization of the resource or make it more effective for utilizing the available resources fully. This involves some major design choices with which the analysis of big data can be managed in an effective manner. They are as follows:
- a) Designing on the basis of data volume: Before beginning to establish any type of data processed, the individual need to acknowledge the volume of the data which has been utilized. However, if the information starts with being wide or small, then in this case the developer needs to take the optimization of the performance into consideration. The application as well as the process that performs effectively for big data generally sustain too much elevation for small information.
- b) Decrease of data volume prior to the method: During the conduction of an operation with large data sets, decreasing the size of the data in the prior process is the most effective path to accomplish firm performance. This operation can be conducted in more than one way. They are as follows: i) Selecting the type of the data in an economical way which indicates the fact that, if a data is negative then the utilization of integer type is recommended (Wan et al. 2021). ii) The aggregation of data is always an efficient process for decreasing the column of the data when the data’s lower granularity is not required. iii) Leveraging the structures of the data (which are complicated) is necessary for decreasing the duplication of the data.
- c) Sectioning of the data in a proper manner with processing logic: Enabling the parallelism of the data is the most efficient path for quick processing of data. As the volume of the data increases, the number of processes (which are parallel) increases. However, for that reason, making addition of more hardware will level the whole process of data without the requirement to transform the code. However, for data engineers, a general process is the partitioning of the data (Zhang and Wang, 2021). There are various details on the basis of approaches to data partitioning, which are effective in this case. Efficient partitioning will provide the following outcomes: i) Permitting the processing phases of data (which are downstream), for example, aggression as well join, to happen within a similar position. However, for example, conducting the sectioning by time period is generally a firm strategy if the logic of data processing is self-restrained. ii) The size of every single division needs to be even, in order to make sure a similar amount of time is consumed for processing every single partition.
4.4 Design choices for governing ethics
The formal governance takes into consideration the formal policies, accountabilities and standards of this data and on the other hand, the informal governance takes into consideration the culture, which is determined by the belief of the organizational actors. In addition, based on norms as well as values and shared beliefs the features of ethical governance can establish ethical challenges, which involve the establishment of ethical norms, procedures and rules and also the initialization of the price. Practices (that are unethical) can become legitimized as well as accepted within the organizational culture (Wan et al. 2021). However, to restrain this issue, the developer, in this case, needs to be watchful in the governance of the data and the governments need to foist sanctions against practices that are unethical. Training, as well as education, is a means to construct a proper series of shared notions, beliefs and values and motivate ultimately the actions of the developer based on data practices. The developer requires constructing new procedures as well as rules for reinforcing and regulating the behavior of the client in a proper way. These notions can make the flow of the data more transparent for clients (Zhang and Wang, 2021). However, where developers share information, improper utilization of the information by one organization within the bonds can have negative impacts on other corporations within the bonds.
These notions also can shield policies for the security of data in which the data is subject to clashing regulations as well as laws in dissimilar places. Accessibility can be an effective design principle for ethics for big data analytics and this needs to be incorporated within the improvement method of the data being constructed. In addition, the effective principle can be persuasion and transparency and this is necessary for implementation so that the clients can conduct informed choices regarding the data. This involves giving paths (that are clear) for the utilizers to opt in the analysis easily.
Conclusions
The crucial need for Big-data, managing new pieces of information, hardware of low-cost commodities, and analytical software together resulted in the formation of unique moments in the data analytics history. The introduction of these trends has made the systems capable of analyzing the different datasets very quickly and also cost-effectively. It can be concluded through this report that Big Data Management truly leaps forward in the upcoming generation. Based on the above analysis, it has been stated that the game-clicks dataset records of the actions made while learning the game were stored in csv format. The csv file contains a record of what transpired throughout the game's events, especially when it happened at the finish or beginning of a new level. In the group assignments, csv dataset is a record that was produced each time a person joined a team and the dataset team. The csv file contains information on the game's team. The session of the user. csv keeps track of the user's sessions during the whole flamingo game. Last but not least, there is the user database. The .csv file reflects the database that was built by the game's users.
References
Anwar, M.J., Gill, A.Q., Hussain, F.K. and Imran, M. 2021, "Secure big data ecosystem architecture:challenges and solutions", EURASIP Journal on Wireless Communications and Networking, vol. 2021, no. 1.
A-Ru-Han, B., Liu, Y., Dong, J., Zheng-Peng, C., Zhen-Jie Chen and Wu, C. 2022, "Evolutionary Game Analysis of Co-Opetition Strategy in Energy Big Data Ecosystem under Government Intervention", Energies, vol. 15, no. 6, pp. 2066.
Asch, M., Moore, T., Badia, R., Beck, M., Beckman, P., Bidot, T., Bodin, F., Cappello, F., Choudhary, A., de Supinski, B. and Deelman, E., 2018. Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. The International Journal of High Performance Computing Applications, 32(4), pp.435-479.
Barns, S., 2020. Re-engineering the city: Platform ecosystems and the capture of urban big data. Frontiers in Sustainable Cities, 2, p.32.
Brinch, M. 2018, "Understanding the value of big data in supply chain management and it business processes: Towards a conceptual framework", International Journal of Operations and Production Management, vol. 38, no. 7, pp. 1589-1614.
Cui, Y., Kara, S. and Chan, K.C., 2020. Manufacturing big data ecosystem: A systematic literature review. Robotics and computer-integrated Manufacturing, 62, p.101861.
Fer, I., Ryan, K., Moorcroft, P.R., Richardson, A.D., Cowdery, E.M. and Dietze, M.C. 2018, "Linking big models to big data: efficient ecosystem model calibration through Bayesian model emulation", Biogeosciences, vol. 15, no. 19, pp. 5801-5830.
Hye-Sun, K., Hwa-Young, J. and Hae-Jong Joo 2019, "The big data visualization technology based ecosystem cycle on high speed network", Multimedia Tools and Applications, vol. 78, no. 20, pp. 28903-28916.
Khanra, S., Dhir, A., Islam, A.N. and Mäntymäki, M., 2020. Big data analytics in healthcare: a systematic literature review. Enterprise Information Systems, 14(7), pp.878-912.
Mazumdar, S., Seybold, D., Kritikos, K. and Verginadis, Y. 2019, "A survey on data storage and placement methodologies for Cloud-Big Data ecosystem", Journal of Big Data, vol. 6, no. 1, pp. 1-37.
Mazumdar, S., Seybold, D., Kritikos, K. and Verginadis, Y., 2019. A survey on data storage and placement methodologies for cloud-big data ecosystem. Journal of Big Data, 6(1), pp.1-37.
Nadal, S., Romero, O., Abelló, A., Vassiliadis, P. and Vansummeren, S., 2019. An integration-oriented ontology to govern evolution in big data ecosystems. Information systems, 79, pp.3-19.
Pappas, I.O., Mikalef, P., Giannakos, M.N., Krogstie, J. and Lekakos, G. 2018, "Big data and business analytics ecosystems: paving the way towards digital transformation and sustainable societies", Information Systems and eBusiness Management, vol. 16, no. 3, pp. 479-491.
Pappas, I.O., Mikalef, P., Giannakos, M.N., Krogstie, J. and Lekakos, G., 2018. Big data and business analytics ecosystems: paving the way towards digital transformation and sustainable societies. Information Systems and e-Business Management, 16(3), pp.479-491.
Pappas, I.O., Patrick, M., Giannakos, M.N., Krogstie, J. and George, L. 2018, "Big data and business analytics ecosystems: paving the way towards digital transformation and sustainable societies", Information Systems and eBusiness Management, vol. 16, no. 3, pp. 479-491.
Park, W.H., Siddiqui, I.F., Chakraborty, C., Qureshi, N.M.F. and Shin, D.R., 2022. Scarcity-aware spam detection technique for big data ecosystem. Pattern Recognition Letters, 157, pp.67-75.
Sagar, L.C. and Sinha, M. 2021, "A study on emerging trends in Indian startup ecosystem: big data, crowd funding, shared economy", International Journal of Innovation Science, vol. 13, no. 1, pp. 1-16.
Vayena, E., Dzenowagis, J., Brownstein, J.S. and Sheikh, A., 2018. Policy implications of big data in the health sector. Bulletin of the World Health Organization, 96(1), p.66.
Wan, F., Li, R. and Qi, L. 2021, "Research on Military Logistics Ecosystem Based on Big Data", Journal of Physics: Conference Series, vol. 1813, no. 1.
Zhang, X. and Wang, Y. 2021, "Research on intelligent medical big data system based on Hadoop and blockchain", EURASIP Journal on Wireless Communications and Networking, vol. 2021, no. 1.