The R programming language finds another application in detecting fraudulent credit card transactions. In this project, various Machine Learning algorithms are used that can differentiate counterfeit transactions from genuine ones. The credit card detection project in R makes use of multiple algorithms such as Logistic Regression, Decision Trees, Gradient Boosting Classifiers, and Artificial Neural Networks.

Assignment Questions

The paper should be 7 pages, double-spaced, 12 font-size, excluding the title and reference pages, using APA format, with at least 5 recent, scholarly, peer-reviewed references. As in any scholarly writing, students should not merely copy information from another author, but use evidence to support the contentions they have drawn from their findings and critically analyze related literature – each paper needs to be an analytical paper, not a summary of readings.

Below is the background about the research paper.

The Card Transactions dataset is used in this credit card fraud detection project in R; this dataset contains fraudulent as well as authentic transactions. The project has the following steps – importing the datasets containing the credit card transactions, exploring the data, manipulating and structuring the data, modeling the data, fitting the model in the Logistic Regression algorithm, and finally, implementing the Decision Tree, Artificial Neural Network, and Gradient Boosting models.

Outline

Based on the previously utilized dataset from your previous research paper, write a statistical analysis report that goes further in depth. Describe what you observe, what approaches you chose to take and why, and report any inferences that you come up with. If any data manipulation, statistical testing, or linear modeling has been performed, please include your functions and scripts in an appendix.

1

Running Head: DATA VISUALIZATION FUNCTION

10

DATA VISUALIZATION FUNCTION

Data Visualization Function

Student Name

Institution Affiliation

Data Visualization Function

Data Visualization

Data visualization refers to a graphical representation of data and information. It translates information into a visual context to make data simpler for understanding. According to Marastats (2019), Data visualization is useful for cleaning data, detecting outliers and unusual groups, identifying clusters and trends, spotting patterns, evaluating model output, and presenting results. However, data visualization’s main importance is to make it simpler and easier to identify trends, patterns, and outliers in a large data set.

It is one of the data science process steps whereby data has to be visualized for conclusions to be made after being collected, processed, and afterward modeled. Data visualization aims to locate, manipulate, identify, format, and deliver inefficient data instead of looking at data in a spreadsheet or tabular format. It transforms large data into images more accurately and effectively to represent data information (Childers & Taylor, 2021).

Data visualization is important for business as it helps to identify factors affecting customer behavior and indicates parts that need more attention. It makes data easier for the stakeholders to understand the time and place to place specific materials and predict sales volumes. It is important in making faster decisions as it absorbs information quickly. Besides, data visualization clarifies the steps to be undertaken and improved the ability to give the audience information. According to Vitaly Friedman (2008), data visualization’s “main goal objective is to represent information clearly and effectively through graphics.

Visualization has been made more important by big data and data analysis projects. Organizations are using machine learning to collect large amounts of data that can be hard to understand and explain. However, visualization has made it easier to easily speed this large data and present easy information to stakeholders.

In his 1983 book The Visual Display of Quantitative Information, Edward Tufte clarifies ‘graphical displays’ and some principles for effective graphical display in this passage: “Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.

Big data visualization uses more complex presentations, and it requires powerful computers to collect, process, and present data in a graphical presentation that humans can easily understand. Besides being of benefit to an organization, data visualization has some disadvantages. A specialist needs to be hired to identify the best data set and organization styles. Its projects require IT involvement since it requires powerful computer systems and efficient storage of its processing (Khalid & Zebaree, 2021).

Data analysis techniques in R

Data analysis can be done using a variety of statistical software. R is statistical software that contains a variety of inbuilt libraries that build visualizations with minimal flexibility and code. It makes it easier for data processing and analysis in the shortest time possible.

However, R contains packages that make it easier to process data over a short time. The packages are necessary units with reproducible R codes that include; R functions, sample data, and documentation that describe how they are to be used. Packages are stored in a directory called a library. The software comes with packages, and others can be downloaded and installed when needed. After installation, they are loaded into the session to be used (Khalid & Zebaree, 2021).

R’s most powerful aspect is that its packages are free for many types of analysis. One of them is text analysis. Research shows that about 80% of the world’s data is unstructured. R is also used to analyze unstructured data. The software provides TM and Sentiment packages used for working with unstructured text. It is easier to install the Tm package normally. The sentiment is difficult to install as it is archived in the year 2012. Different data types can be used in the same object in R.

Data is presented in databases as APIs that form part of a text or character vectors, while the other is numerical numbers. However, it is the specialist/analyst job to identify the type of statistical data to assign and, after that, use the best R data type to solve it. consider variables as numeric, nominal or categorical, or ordinal. Vectors in R can be numeric/integer, ordered factor, or factor classes. Each statistical type of variable is provided in R software. An ordered factor is rarely used in some cases but can be created by ordered or the function factor. Functions and data structures exist as objects in R entities and can be operated as data.

Newly developed packages and functions in R

R packages continue to advance day by day. Each day, new functions and packages are developed to accomplish and solve different required tasks. However, these functions and packages are also used by organizations to access and implement how products are being targeted and the behavior of customers. Here is a list of different newly developed packages and functions;

Beta functions

The beta package comprises various packages: dBeta, pBeta, pBinom, Beta.2p. The dBeta command in R is used to return beta density values for vector quantities. It can also be applied to return the values of beta density that corresponds to the input vector and two shape parameters as in shape1 and shape2. The package itself also provides a list of functions that include calculating moments, alternative parameterizations, and Beta distributions fitting to values’ vectors. Other includes; estimation of classification accuracy, diagnostic performance, and consistency, also known as the “Livingston and Lewis approach” as the fundamental method, making it easier for extensive Beta distribution use.

norm

cNorm is an R environment package used to generate continuous test norms and analyze the model fit for biometrics and psychometrics. It was developed to develop a continuous norms grade or age in assessing performance. However, it estimates percentile curves in explanatory variable dependence. Some of the advantages include; norm tables are determined based on the normative sample table. It is not only of a class level or a single cohort.

Limits are analytically or graphically be evaluated to determine where the model deviates from the data. It makes it easier to indicate a point test score is not easy to be interpreted. cNorm does not require distribution assumptions, and in most cases, the data is modeled more effectively than with the parametric methods.

bi

odata

biodata is a function used for the generation of correlated artificial binary data. It is used to summarize data in intervals and calculate mean and 95% confidence intervals in the selected variable’s mean from a data frame. Mean intervals summarize other numeric variables.

However, binning has three options. The main one is bon bin into 40 intervals. The second one, the user, can select a binning interval while the last one, the user can develop breaks to use as binning intervals.

rTRNG

rTRNG is a statistical package for parallel Random Number Generator in R software. This package relies upon Tina’s Random Number Generator (TRNG) for sequential and parallel Monte Carlo simulations. These random number generators can be manipulated in operations by jump and split. It allows a jump ahead and split-sequence to be manipulated into a required sub-sequences. Makes enable other techniques suitable to parallel algorithms such as block-splitting and leapfrogging.

Logic in developing TM and sentiment function and newly developed aspects

The data analyzed here are from a newspaper saved to a text file, loaded, and processed. In this process, we first need to do away with nonessential characters such as numbers and web addresses, processing the actual words in a text file. Sentiment analysis will be performed using Bayesian analysis from classifying comments. However, negative, positive, and neutral polarity is determined and combined as a single data frame.

Polarities and emotions can be processed after those comments. It includes removing normal English stop words. This sentiment package identifies important words, frequently occurring, and likely emotion associations. However, with minimal time and little work, important topics have been automatically extracted from R software’s unstructured text using sentiment package (Habowski & Waterman, 2021). Besides, a table of comments has been extracted with polarity and emotions attached. We can sort them by either polarity or emotion and do more and more analysis on the data. This package can be used to effectively select comments for Quality Assurance analysis but is an into for further analysis.

Data visualization in R

About 80% of the time for data analysis is spent preparing and cleaning data. Importing data sets, data screening, and assigning labels. It is a technique for graphical data representation. Using elements such as charts, graphs, maps, histograms, and scatter plots, data is made more understandable and easier for prediction.

It is easier to recognize trends and exceptions in our data through data visualization. We can convey quick information and results in a graphical or pictorial form. It helps interpret data quickly and check out the association of different variables to see their effects on trends and insights.

R software provides a wide set of tools and inbuilt functions to perform data analysis and represent data and build data visualizations. According to Tableau,

“[Data Visualization is] one of the most useful professional skills to develop. The better you can convey your points visually, the better you can leverage that information”. Base, grid, and lattice graphics are data visualization that can be performed in R. (Tableau, 2019, pp. 16-17

The software contains inbuilt functions included for the graphics package required for data visualization. Here are some of the functions;

The plot() Function

This function is used to plot objects in R graphically.

plot(x,y,type,main,sub,xlab,ylab,asp,col,..)

Above is a basic syntax on plot() function

Barplot

This function is used to present data in rectangular bars, horizontal and vertical, and bar lengths proportional to the variable value from the data set.

Histogram

A histogram divides values into continuous range groups measured against the frequency range of the variable.

ggplot2 package

ggplot2 package is normally used on the grammar of graphics. It refers to a set of rules used for graph description and building. ggplot2 uses the grammar of graphics by breaking graphs into components such as layers and scales. This package comprises coordinates, faceting, themes, layers, and data (Habowski & Waterman, 2021). However, it is one of R’s most sophisticated and important packages for data visualization. It creates the most versatile and elegant quality plots with fewer adjustments. It is simple to create single and multivariate graphs with the package’s help.

In conclusion, the current technology has made it easier to analyze large data more efficiently and quickly through data visualization. Among the statistical software is R. It is free software for statistical computing of large data and graphics. It contains various packages and functions used for data analysis that compiles and runs on UNIX platforms, macOS and Windows.

References

Habowski, A. N., & Waterman, M, L. (2021). GECO. Gene expression clustering optimization app for non-linear data visualizations of patterns. BMC bioinformatics, 22(1), 1-13.

Cheng, H., Xie, K., Wen, C., & He, J. B. (2021). Fast Visualization of Massive Data based on Improved Hilbert R-tree and Stacked LSTM Models. IEEE Access.

Childers, A. F., & Taylor, D. G. (2021). Making data collections and analysis fun, fast, and flexible with Classroom Stats. Primus31(1), 91-98.

Khalid, Z. M., & Zebaree, S. R. (2021). Big Data Analysis for Data Visualization: A Review. International Journal of Science and , 5(2), 64-75.

Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R., van Langen, J., & Kievit, R. A. (2021). Raincloud plots: a multi-platform tool for robust data visualization [version 1; peer review: 2 approved].

Continue to order Get a quote

Calculate the price of your order

Type of paper needed:

Pages:

550 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

Free title page and bibliography
Unlimited revisions
Plagiarism-free guarantee
Money-back guarantee
24/7 support

On-demand options

Writer’s samples
Part-by-part delivery
Overnight delivery
Copies of used sources
Expert Proofreading

Paper format

275 words per page
12 pt Arial/Times New Roman
Double line spacing
Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Products

Recent Posts

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee