3 10 46 https://eprints.ibu.edu.ba/files/original/41cddaabb1e237ad0c086dbca13071d0.pdf 369ba7ca22ee871fc9b74adde3cf1d69 PDF Text Text Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Using Exploratory Data Analysis and Big Data Analytics for Detecting Anomalies in Cloud Computing Ibrahim Muzaferija1, Zerina Mašetić1 1 International Burch University, Sarajevo, Bosnia and Herzegovina ibrahim.muzaferija@stu.ibu.edu.ba zerina.masetic@ibu.edu.ba Abstract – While leveraging cloud computing for large-scale distributed applications allows seamless scaling, many companies struggle following up with the amount of data generated in terms of efficient processing and anomaly detection, which is a necessary part of the management of modern applications. As the record of user behavior, weblogs surely become the research item related to anomaly detection. Many anomaly detection methods based on automated log analysis have been proposed. However, not in the context of big data applications where anomalous behavior needs to be detected in understanding phases prior to modeling a system for such use. Big Data Analytics often ignores anomalous point due to high volume of data. To address this problem, we propose a complemented methodology for Big Data Analytics – the Exploratory Data Analysis, which assists in gaining insight into data relationships without the classical hypothesis modeling. In that way, we can gain better understanding of the patterns and spot anomalies. Results show that Exploratory Data Analysis facilitates anomaly detection and the CRISP-DM Business Understanding phase, making it one of the key steps in the Data Understanding phase. Keywords - Cloud Computing, Big Data, Data Mining, Anomaly Detection 1. Introduction With constant growth and advancements of the Internet, there are more systems connected to other connected systems, constantly generating and exchanging data. That data is referred to as Big Data and is constantly targeted by cyber-attacks as it contains sensitive and valuable information. The term “big data” refers to data that is so large, complex, or rapid that it’s not possible to process using traditional computing and data management tools. Big Data provides opportunities to improve research, operational efficiency, and decision-support applications with increased value for digital applications [1]. At the same time, Big Data represents the challenges to store, transport, process, mine, and serve the data. Data that is high in volume, velocity, variety, and veracity must be processed with advanced analytical tools and algorithms to reveal meaningful information and provide value. Cloud computing represents the use of distributed and shared resources such as computing, storage, networking, and analytical software, and provides fundamental support to address the challenges of Big 1 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Data. Cloud computing serves both as a technological enabler and producer of big data [1]. Anomalies represent unusual or behaviors that deviate from the normal. In efforts to increase cloud computing reliability, anomaly detection poses a frequent problem in threat detection and identification, as reported by Cloud Security Alliance (CSA) [2] which represents the world’s leading organization dedicated to securing cloud computing environments, conducts annual research with an aim to raise awareness of threats, risks, and vulnerabilities in the cloud environment. In their latest (2019) report [3], CSA re-examined the risks with cloud security and took a new approach, examining the problems in configuration and authentication, rather than the traditional focus on vulnerabilities and malware, highlighting the following threats: 1. Data Breaches 2. Misconfiguration and inadequate change control 3. Lack of cloud security architecture and strategy 4. Insufficient identity, credential, access, and key management 5. Account hijacking 6. Insider threat 7. Insecure interfaces and APIs 8. Weak control plane 9. Metastructure and applistructure failures 10. Limited cloud usage visibility 11. Abuse and nefarious use of cloud services In this research, we aim to address the threats which can be traced in user logs (numbered 1, 4, 5, 6, 8, 9 and 11) by utilizing Big Data Analytics and Exploratory Data Analysis in order to discover anomalies and contribute to increase of security in Cloud Computing applications. 2. Literature Review Anomaly detection in the cloud infrastructure and big data environment has been the topic of many research studies in the literature. Since the first introduction of cloud infrastructure in 2006 [4], cloud computing has greatly impacted the industries. The rapid development of Internet and Big Data technologies has resulted in increased service development on cloud computing, such as online banking services, electronic news services, government information systems, mobile services, etc. These systems handle sensitive and confidential data, making the anomaly detection mechanisms one of its core security requirements. In the review paper by Arif Sari [4], [5], different techniques and mechanisms used in the detection of anomalous activities within the cloud environment are described: threshold detection, statistical analysis, rule-based measures, data mining, and machine learning. We aim to apply statistical techniques and EDA 2 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 (Exploratory Data Analysis) in order to discover anomalies. In the “Big Data processing for Anomaly Detection” survey [6], Ariyaluran et al. present the details of the comparative analysis and the relationship of three different domains, which are anomaly detection, machine-learning algorithms, and real-time big data processing. This paper aims to contribute to complemented techniques for anomaly detection. Once anomalies are detected, we can utilize Machine Learning and real-time anomaly detection for future improvements. In their research, Dalal and Rele [6], [7] emphasize the steps in creating effective and reliable mechanisms for threat detection. They highlight the importance of the first CRISP-DM (Cross Industry Standardized Process for Data Mining) phase named “Develop Business Understanding”, where reasons for defects and answers for maintenance are taken into consideration. They discuss the phase “Analyze Data and Data Dependencies” where the aim is to analyze, combine, and compare the data with the present situation, without proposing EDA as a baseline for data understanding. Our work aims to employ EDA in order to complement the methodology. Also, they highlight the step named “Engage with Subject Matter Experts (SME’s)” for better dataset examination and analysis of the anomaly situation, along with a grouping of the threat factors. By employing these methods, we aim to set transparent expectations and bring out clarity to our results. In further research, we work closely with application development technical lead which serves as SME, and facilitates in clarification of log data, as well as threats, anomalies and our results 3. Methodology The research is implemented using a portion of the CRISP-DM (Cross Industry Standardized Process for Data Mining) methodology [8], which represents the common standards used by data scientists and data mining experts in order to build analytical and machine learning models. Prior to analytical and machine learning model creation, we need to construct a clean dataset of user behavior with anomalies labeled for future modeling. To do so, in this research we focus on the first three phases: Business Understanding, Data Understanding, and Data Preparation, as highlighted with red color in the figure below. Modeling and subsequent phases are researched in our extended study of anomaly detection in cloud computing. 3 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 1. CRISP-DM workflow In the Business Understanding phase, the goal is to determine business objectives, assess the situation from a business perspective, discuss with subject matter experts, determine data mining goals, and produce a project plan. In the Data Understanding, we collect and select raw data, describe and explore the data, consult with subject matter experts, and verify data quality. In the Data Preparation phase, which is often the most time-consuming phase, we select and clean the data, format data, and construct a clean dataset. We approach the mentioned phases using Big Data Analytics and Exploratory Data Analysis (EDA). Big Data Analytics examines large amounts of data in a non-traditional manner, that is using distributed and shared resources to support the data quantity and complexity [8], [9]. Exploratory Data Analysis [10] is an approach to analyzing data in order to summarize their main characteristics and uncover the underlying structure using statistical and visual methods. 3.1. Data Collection and Selection Cloud-based enterprise web application logs are produced by multiple servers and services, which are streamed to Elasticsearch [11] service, an open-source search, and analytics engine for all types of data. Elasticsearch is distributed, fast, and scalable, which makes it an ideal environment for big data ingestion, enrichment, storage, analysis, and visualization. 4 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 2. Raw data access from Kibana Raw data is accessed by locally restoring the Elasticsearch cluster snapshot taken for a period of three months. The cluster contains around 20 GB of semi-structured data collected from different application services and levels, indexed by a timestamp. Application logs are mapped to 175 attributes and accessed using Kibana [12], the Elastic Stack service for data analysis and visualization. Attribute selection is a part of the “Business understanding” and “Data understanding” phase, implemented together in consultations with application development technical lead, i.e., subject matter expert (which we’ll refer to as SME). The attributes describing the user’s application usage that were the most relevant for anomaly detection are selected for further analysis. The following table displays statistical information for selected attributes. 5 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Table 1. Selected data statistical information Attribute name Description Data type Range Missing timestamp Timestamp Date Time [2020-01-05 21:17, 0.0 % 2020-03-26 21:06] account_id Account ID, Nominal unique company f6afd09c-****-****-****- 8.87 % c30a935ccc37, ... account identifier client_country User country Nominal BA, US, ... 9.53 % company_name Company Name Nominal Company A, Company B, 10.17 % ... platform Application Nominal platform BrowserMNC, 0.0 % BackendMNC, ... principal_id User email Nominal developer@**.com, ... 9.64 % remote_address User IP address Nominal [ 0.0.0.0. - 255.255.255.255 9.12 % ] user_agent User-agent Nominal Mozilla/5.0 ( Windows NT 0.0 % 10.0; Win64; x64) … , ... error_message Error message Nominal validation error, auth error, 99.96 % ... message Log message Nominal Profiling, FrontTimings, ... 0.18 % level Log level Nominal Info, error 0.0 % path Parameterized Nominal PUT 99.78 % resource request /customer/***/ticket/***, ... resource Request Nominal (GET) /invoices, ... 0.0 % status_code Response code Nominal 200, 404, ... 10.17 % Once the relevant data is selected, we utilize Elastic Stack service named Logstash [13] for collecting the data, that is, obtaining the initial dataset in CSV format for further work. 6 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 3.2. Data Cleansing and Engineering In order to get an insight into data quality, graphical and statistical methods were used to detect anomalies, faults, outliers, missing values, etc. Moreover, we engineer new attributes in order to increase the interpretability or decrease data complexity. Exploratory Data Analysis assists understanding of relations between attributes and allows us to spot tendencies, as well as to identify the necessary cleaning steps we have to take. First, we apply filters to remove log data from automated services, such as health-checks and other application services that don't reflect the user’s interactions. Next, we remove attributes that contain a high fraction of missing values because the informational significance of attributes is inconsiderable. Values of “status_code” attribute are mapped to the corresponding descriptions for better interpretability. We engineer new attributes: “resource_method”, “resource_base” and “user_os”. The “resource_method” and “resource_base” attributes are created from the values of the “resource” attribute by using regular expressions to extract the relevant information. The “user_os” attribute is created in a similar manner, extracting the relevant information using regular expressions from the “user agent” attribute. Creation of these attributes allows us to focus on the most relevant information and decrease the cardinality of original attributes. 3.3. Dataset Creation The clean dataset contains 16 attributes describing the application usage, and 522,763 rows with a timestamp attribute range from 6th January to 26th March (81 days). Data is imported to RapidMiner [14], a data science software platform that provides an integrated environment for data preparation, visualization, machine learning, text mining, and predictive analytics. It is open source and used for commercial applications, as well as for research, education, training, rapid prototyping. In this phase, we continue with Exploratory Data Analysis in order to discover patterns beyond formal modeling or hypothesis testing tasks. Our aim is to utilize the business understanding to increase the understanding of data and relationships between attributes in order to spot anomalous trends. As the application is B2B based, we analyze the company data first: company account histogram, statistics and distribution. Next, we analyze the behaviors of users in company and general context. By analyzing the “user” and “user domain” attribute, we spot trends in company context usage and behavior. Analysis of application resource requests allows us to understand the usage in general context. 7 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 3. Counts of application resource requests From the figure above, we can spot trends and further analyze the resource usage. The resource request represents a user action, thus are highly valuable for the context of anomaly detection. Moreover, granular analysis facilitates the business understanding as we gain deeper insight into user generated data. Next, we analyze the application errors which are often one of the most informative attributes for the anomaly detection. Anomalies and cyber-attacks are often causing application errors, allowing us to quickly analyze error data and make distinctions between application anomalies, user anomalies and possible threats. Figure 4. Application error logs histogram 8 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 5. Application logs status codes histogram Application status codes are highly correlated with application resource usage. By analyzing status codes, we gain insight into applications performance and usage trends. Anomalies are most visible when analyzing the status codes. Dataset creation is concluded with the creation of an “anomaly” attribute, which represents whether a specific application log instance is anomalous. The criteria for creation of such attribute are drawn from the discoveries of EDA and confirmed through the consultations with SME. By addressing the CRISP-DM phases for Business Understanding, Data Understanding, and Data Preparation with the application of Exploratory Data Analysis, we are able to discover anomalies in application usage and user behavior. 4. Results and Discussion As web application has busines-to-busines context, we approach the analysis of log data from a company perspective. We find that companies using the application can have their application usage segmented into three categories: heavy, medium, and light users, as shown below in the Figure 6. Heavy users are the companies responsible for application development and support. Medium users reflect the companies with frequent application usage, while light users represent the companies that are onboarding to application or in initial phases of application usage. Distinction of company users per their level of usage helps us create a better business understanding. Because of unbalanced level of application usage per 9 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 company, we can expect an increased number of anomalies for heavy users, while companies with medium and light usage may have decreased the number of anomalies. Regarding the percentage of anomalies, it varies between companies with no specific pattern. Figure 6. Application usage per company When analyzing the histogram of application resource methods through the “resource_method” attribute, we find an anomalous request pattern, as shown below in the Figure 7. Consultations with SME yielded that resource request method anomaly corresponds to the service whose use has ceased, and the service behavior can be identified as anomaly. 10 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 7. Application resource methods histogram anomaly When analyzing individual users, we perform segmentation per company using the domain name in user email address. The histogram of user domains contributes to business understanding as we can spot user trends per each company. In the figure below, we present the user domain histogram focused on anomalous application usage of unknown domains. We discover that usage from unknown domains tends to be increased in the monthly peaks of application usage. Figure 8. User domain histogram focused on unknown domains Consultations with SME clarified that unknown domains such as “gmail.com”, “hotmail.com”, and 11 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 “outlook.com” are used by quality assurance developers and were marked as such. This has further decreased the number of visits from unknown domains. Moreover, consultations showed that users from unknown domains are companies in the trial phase, that is application demonstration phase, and are still eligible for anomaly detection. Application usage from other user domains is distributed as expected: two development companies take up the most traffic while others are medium and light users. Figure 9. Log message histogram anomalies In the figure above, we present an analysis result of log message histogram with revealed anomalies. We find that anomalies are caused by application development or, more specifically, integration attempts with other companies using the application. In the figure below, we present results from correlation analysis of the dataset. The correlation matrix shows increased correlation between attributes such as “platform” and “message”. These results help us to identify and discard highly correlated attributes and decrease the dataset complexity. 12 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Figure 10. Correlation matrix Correlation matrix also shows that attributes “status code” and “level” have a level of correlation. This indicates that application errors can be sourced from application status codes. In the figure below, status code histogram focused on error status code is depicted. We can spot the error trends together with identification of error sources. Figure 11. Status code histogram focused on error status codes With application of EDA, the resulting anomalies are used in the creation of labeled dataset for anomaly detection purposes. The dataset can serve as a baseline for creating various analytical and machine learning anomaly detection models such as frequency threshold detection, supervised anomaly prediction, unsupervised anomaly detection, etc. In the Table 2, we present the final dataset statistical information. 13 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 Table 2. Dataset statistical information Attribute name Type Missing Least / Min Most / Max Range timestamp Date and 0 Jan 6, 2020 Mar 26, 2020 9:06 80d 14h 48min 6:18 AM PM 58710 (3) 12345 (131,132) time account_id Nominal 3 12345, c84c286[...]ffea5, [52 more] company_name Nominal 3 Company XYZ Company A Company A, (3) (131,132) Company B, [52 more] country Nominal 3 XX (29) US (399,465) US, BA, IN, [12 more] platform Nominal 0 Backend (45%) Browser (55%) Browser, Backend user Nominal 6 fk***@*.com fs***@*.com fs***@*.com, (4) (48,738) de***@*.com, [209 more] remote_address Nominal 3 184.*.*.22 (3) 77.*.*.171 (41,561) 77.*.*.171, 144.*.*.229, [302 more] user_agent Nominal 0 Mozilla/[...]4.1 Mozilla/[...]ri/537.3 Mozilla/[...]36, (3) 6 (77,449) Mozilla/[...].0, [114 more] 14 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 error_msg Nominal 467,22 Getaddr[...].co ESOCKET[...]UT ESOCKET[...]UT, 5 m (1) (89) 502, [3 more] level Nominal 0 error (159) info (467,225) Info, Error message Nominal 0 Integ[...]led Profiling (264,851) Profiling, (159) frontTimigs, [1 more] status_code Nominal 93 405 Method 200 OK (453,461) [...]ed (1) resource_method Nominal 0 PUT (97) 200 OK, 204 No Content, [8 more] GET (373,123) GET, POST, [3 more] resource_base Nominal 0 produ[...]ile (8) endpoints (98,191) endpoints, customers, [17 more] user_domain Nominal 6 C*** (272) A*** (351,885) A***, M***, [9 more] user_agent_os Nominal 0 Unknown (3) Windows (411,762) Windows, OS X, [2 more] anomaly Binomina 0 True (882) False (466,502) False, True l 5. Conclusion This study has shown that the use of Exploratory Data Analysis contributes to and complements the implementation of CRISP-DM methodology phases: business understanding, data understanding, and data preparation. Moreover, we demonstrate that Exploratory Data Analysis is efficient method for detecting anomalies in big data. Summarizing data characteristics and discovering underlying patterns for data and its distribution brings value for both data understanding and data preparation phase. We confirm the benefits of proven method from previous studies: consultations with SME play a crucial role in the business understanding phase and give a valuable contribution in data understanding phase Next, consultations in the data understanding and data preparation phase facilitates the workflow and can help us increase the data value. Future efforts can be placed in implementation of subsequent CRISP-DM phases, that is, modeling, 15 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 evaluation and deployment. Modeling data using Machine Learning techniques enables complex pattern discovery, as suitable for big data datasets, and further improves anomaly detection as underlying mathematical relationships can be leveraged. While this has been proven in majority of studies conducted in the field of anomaly detection and supervised machine learning, we propose a use of unsupervised machine learning for finding new anomalies that will enable a creation of extended labeled dataset which can then be used for creation of supervised machine learning model for anomaly detection and prediction. 6. [1] References “Big Data and cloud computing: innovation opportunities and challenges” [Online]. Available: https://www.tandfonline.com/doi/full/10.1080/17538947.2016.1239771. [Accessed: 04-Sep-2020] [2] “Cloud Security Alliance (CSA)” [Online]. Available: https://cloudsecurityalliance.org/. [Accessed: 04-Sep-2020] [3] “Top Threats to Cloud Computing: Egregious.” [Online]. Available: https://cloudsecurityalliance.org/artifacts/top-threats-to-cloud-computing-egregious-eleven/. [Accessed: 04-Sep-2020] [4] “About AWS.” [Online]. Available: https://aws.amazon.com/about-aws/. [Accessed: 04-Sep-2020] [5] A. Sari, “A Review of Anomaly Detection Systems in Cloud Networks and Survey of Cloud Security Measures in Cloud Storage Applications,” Journal of Information Security, vol. 6, no. 2, pp. 142–154, Mar. 2015. [6] “Real-time big data processing for anomaly detection: A Survey,” Int. J. Inf. Manage., vol. 45, pp. 289–307, Apr. 2019. [7] “Cyber Security: Threat Detection Model based on Machine learning Algorithm - IEEE Conference Publication.” [Online]. Available: https://ieeexplore.ieee.org/document/8724096. [Accessed: 04-Sep-2020] [8] “DMME: Data mining methodology for engineering applications – a holistic extension to the CRISP-DM model,” Procedia CIRP, vol. 79, pp. 403–408, Jan. 2019. [9] “A Reference Model for Big Data Analytics” [Online]. Available: https://www.researchgate.net/publication/327728739_A_Reference_Model_for_Big_Data_Analytic s. [Accessed: 04-Sep-2020] [10] “Exploratory data analysis” [Online]. Available: https://psycnet.apa.org/record/2011-23865-003. [Accessed: 04-Sep-2020] [11] “Open Source Search: The Creators of Elasticsearch, ELK Stack & Kibana.” [Online]. Available: https://www.elastic.co/. [Accessed: 04-Sep-2020] [12] “Kibana.” [Online]. Available: https://www.elastic.co/kibana. [Accessed: 04-Sep-2020] 16 �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 12.34567/JONSAE2020123 [13] “Logstash.” [Online]. Available: https://www.elastic.co/logstash. [Accessed: 04-Sep-2020] [14] “RapidMiner.” [Online]. Available: https://rapidminer.com/. [Accessed: 04-Sep-2020] 17 � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Using Exploratory Data Analysis and Big Data Analytics for Detecting Anomalies in Cloud Computing Author Author Ibrahim Muzaferija, Zerina Mašetić Abstract A summary of the resource. – While leveraging cloud computing for large-scale distributed applications allows seamless scaling, many companies struggle following up with the amount of data generated in terms of efficient processing and anomaly detection, which is a necessary part of the management of modern applications. As the record of user behavior, weblogs surely become the research item related to anomaly detection. Many anomaly detection methods based on automated log analysis have been proposed. However, not in the context of big data applications where anomalous behavior needs to be detected in understanding phases prior to modeling a system for such use. Big Data Analytics often ignores anomalous point due to high volume of data. To address this problem, we propose a complemented methodology for Big Data Analytics – the Exploratory Data Analysis, which assists in gaining insight into data relationships without the classical hypothesis modeling. In that way, we can gain better understanding of the patterns and spot anomalies. Results show that Exploratory Data Analysis facilitates anomaly detection and the CRISP-DM Business Understanding phase, making it one of the key steps in the Data Understanding phase. Keywords Keywords. Cloud Computing, Big Data, Data Mining, Anomaly Detection Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021320 https://eprints.ibu.edu.ba/files/original/d8bd5c4881ddc5399123b176dd9fbcd2.pdf 1d4855c1060aa995fd1a0d8cbff1e775 PDF Text Text Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Feedback System Using Sentiment Analysis Abdulrahman Almonajed 1, Dino Kečo1, 1 International Burch University, Sarajevo, Bosnia and Herzegovina abdulrahman.almonajed@stu.ibu.edu.ba dino.keco@ibu.edu.ba Abstract – Today, when looking at the quality of an online item, the feedback itself plays a very important role. Based on the feedback we can decide whether the desired item is good or not, get a picture of the seller and so on. Many companies that have online shops display the most positive feedback while hiding bad ones or display only a few of them. In this research, we will help people by automating the process of deciding whether a feedback is positive or negative, which will give them time for other jobs and save money for hiring people who will work on the feedback. Since feedback on online articles is very important today, the process of determining positive and negative feedback should be made as quick and easy as possible. In this research, we will show a very simple and fast way to classify feedback as positive or negative, which means that the main question of this research is how to facilitate and speed up the process of determining the polarity of the feedback. We will use NLP using Python’s library called TextBlob. The used algorithm is called Naïve Bayes, it gave the accuracy of around 80%. Keywords - feedback, online article, sentiment analysis 1. Introduction These days, the number of online stores is growing very fast [1]. We can see that today we can buy whatever we want online. Also, through online shopping we can save a lot of money by being able to find things much cheaper than they are in local stores. By shopping through online shops, we can "escape" arrogant sellers, as well as annoying sellers who follow us during the shopping and "force" us to buy their products. Also, we can save a lot of time by avoiding traffic jams, waiting in line at the store, saving money by not paying for parking, saving our fuel, etc. We can even buy things we don’t have in our city or country. For leading companies such as Amazon, Alibaba, eBay, and so on, feedback from every user is very important. They receive thousands of feedback a day, which is very difficult to read and analyze, which is why they need to automate the process. Understanding and analyzing the feedback can improve the user experience, improve the products, and so on, but can also help the online shop owners to know which seller is not doing their job properly, whether it is cheating, etc. Also, there are online applications where we can book an apartment, rent a car, etc., such as on our BTT (Balkan Tourist Travel) application. This kind of web application is now well known in our region, so we decided to create one to facilitate the tourism process in Bosnia and Herzegovina. The application is intended for tourists who visit our country in large numbers. BTT application will make it easier for them to book everything they need during their stay in our country with a few clicks. The main goal of the application is to avoid numerous calls and �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 misunderstandings between our people and tourists. On the BTT we value feedback, so users can leave feedback on everything that they have used on our application. By doing so, we give our customers the opportunity to express their opinions, which will help us to achieve the best possible service. In this research, we will use the BTT application to apply and test our classification method. For the best process of development, we will be using only one part of the BTT web application. We will perform all the tests and modifications to achieve the best possible results. And if the results are satisfying we will include all the other parts of the application. Customers' opinion is not only important to large companies it is also important to small companies that are just getting started [2]. Therefore, determining whether the opinion is positive or negative must be automated as soon as possible and in the best possible way. This research will solve this problem and determine whether customers’ opinion is positive or negative in a very quick and easy way. The biggest problem this research solves is the hard work of reading the opinions, which can be praise or criticism, of users and determining whether it is positive or negative or spending the extra money to hire people to do that. Later, it will help identify whether the comment is spam or not, which can reduce time determining feedback's polarity, determine the language of the comment, and so on. 2. Literature Review Sentimental analysis, which will be used in this research, has been studied in detail for the last few years. There are a lot of research papers regarding sentimental analysis, but we will present only the ones that are useful for our research. In the paper [3], authors Akanksha Sharma and Dr. Ashim performed a Comparative Study of Different Approaches Used For Sentiment Analysis from customer reviews, where they stated that this process helps the owners of the online shop to make the right decision regarding their items. In their research, they have divided the feedback into three categories: positive, negative, and neutral. Where we can notice that in our research the classification of feedback is similar, from -1 to 1. 1 represents positive, 0 represents neutral and -1 represents negative. Their research is very similar to ours. They gathered feedback from e-shops, analyzed the feedback, and finally classified them. The authors mentioned Support Vector Machine (SVM), Naive Bayes, Lexicon Method, etc. At the end of their research, SVM was the best compared to other methods. Research paper [4], also performed a sentiment analysis on user feedback from online shops. Michael Gamon, the author of this research, uses over 40.000 feedbacks that he collected from two different sources, Global Support Services, and Knowledge Base Surveys. The author divided the feedback on a scale between 1 and 4, where 1 represented dissatisfied and 4 for very satisfied. In his research, he used a linear Support Vector Machine (SVM) for feedback classification with 10-fold cross-validation. As a result of his research, Michael created two clear classifications (classes). The first class determines �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 whether the feedback falls under 1 or 4 on the scale, the second class determines whether the feedback falls under 1 or 2 and 3 or 4. He used 10-fold cross-validation on both classifications with the first 2000 feedbacks in his dataset. The first class (whether the feedback falls under 1 or 4 on the scale) proved to be more accurate. The precision was 85.47 for the first class, while the second class was 69.23. Prashali et al. [5], the authors of the research, collected their research data for the classification from Kaggle website. The data was in excel format, containing 186 feedback. The goal of their research was to see how to improve the teaching and learning program. Their dataset was composed of students’ feedback on the teaching program. The result of their research was divided, as in our research, between -1 and 1. As we mentioned before 1 represents positive, 0 neutral, and -1 negative. We have to mention that in their research, they used polarity from sentiment analysis to determine whether the feedback is classified as positive, negative, or neutral. In the paper [6], the authors wrote about how owners of online stores should analyze every feedback they get in the shortest time possible. Since this affects their further business and cooperation with the seller on their online shop. Robots can cause fraud to star ratings on items on online shops, for that reason feedback on online shops must be analyzed using natural language processing (NLP). In this way, we can delete false feedback and quickly analyze feedback received. Swati N. Manke and Nitin Shivale classify their results in two categories, positive and negative. Author Peter D. Turney in his research paper [7], applied semantic orientation for determining whether the feedback is positive or negative. For his research, Peter used 410 samples of feedback, which he acquired from 4 different domains (banks, automobile, movie, and travel destination). He used an unsupervised learning algorithm to classify feedback as positive or negative. The precision of his algorithm was averaging 74%, the highest precision was on automobile 84%, while the lowest one was on movie 66%. The reason for the difference between the precision of automobiles and movies, which was a pretty huge one was because of some words depending on the context. In the domain of automobiles, some adjectives may have a negative meaning whereas in the movie sphere it can be the exact opposite meaning. For example, the adjective “unpredictable” would have a negative meaning in an automobile but in the movie a positive one. For assessing feedback to be positive or negative, the author Peter followed 3 steps: ● Draw out sentences which contain adjectives and adverbs, ● Predict semantic orientation of each extracted sentence, ● Categorize feedback as positive or negative according to the semantic orientation of the sentence. In [8], the authors used a model to analyze text from feedback written by the users in their research. Also, the number of stars of the star rating given by the user was taken for determining the results. Joachim Büschken and Greg M. Allenby tested their model on a hotel and restaurant dataset, which contained the feedback and the star rating. Their model was built based on Latent Dirichlet Allocation (LDA). In the �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 restaurant dataset, there are 696 samples (feedback and star rating) from different Italian restaurants. While in the hotel dataset, feedback and star ratings were collected from two hotels, one in New York and the second near the JFK airport. The number of samples collected from the hotel in New York is 3.212, while the second hotel is 1.255, which sums up to 4.467 feedback from the hotels. At the end of their research, the authors believe that bag-of-sentence is better than bag-of-words for user speech analysis. Saleem Abuleil and Khalid Alsamara in their research paper [9], wrote about analyzing user feedback using Natural Language Processing (NLP). The authors presented feedback in two formats, rating (structured data) and textual (unstructured data). Their research was applied on feedback that has been written in the Arabic language. In the Arabic language, adjectives take the form of describing another person or thing in a sentence. In their research, the authors convert unstructured data (text) into structured data (numerical). They categorized their results into two classes, positive and negative feedback. In the research paper [10], authors write about measuring customers’ satisfaction using sentiment analysis. For the classification method, they used sentiment classifier support vector machine (SVM). The main reason for that was that SVM gave the best results on the basis of the research paper [11]. The data set was collected from Twitter API. It contained the following: ● Likes (lists of users that liked specific tweet) ● Followers (lists of users that follow specific tweet) ● Mentions (lists of users that was mentioned on a specific tweet) ● Replies (lists of replies on a specific tweet), and ● Re-tweet (lists of users that share specific tweet) In this research, they used the database MySQL Database Management System. The authors classified their results in two classes, positive and negative. At the end of their research, their algorithm gave a precision of around 87%. 3. Methods and Materials The data that will be used in this research will be taken from the BTT web application. The number of feedback samples is more than 1000. The application consists of multiple feedback sites, but this research will be based on feedback from the rent-a-car section/site. The number of data we will test in this research will depend on the number of feedbacks on the BTT web application. Right now, there are more than 1000 feedback for the rent-a-car section, if new feedback is added, the system will cover them automatically once it runs. We only used cars’ feedback from the BTT web application. We take data in HTML format where we have only feedback, without other attributes from the table that are related to feedback for business logic. The attributes that we will not use are ID, user, and car_ID since it means nothing to us in determining whether the feedback is positive or negative. This means that only one column is left since the table contains 4 columns (ID, name, carID, and feedback), which we can see in the figure below. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Figure 1. Feedback in MySQL As we mentioned before, we will only use one column for the table, which is the feedback column. Figure 2 shows only feedback from the table in the HTML web page, from where we will take the feedback. Figure 2. Feedback on HTML page A. Data preprocessing Since this research is based on working with text, in the process of determining whether the given text is positive or negative, that text must be analyzed and processed. The system will be based solely on working with English text. We will implement natural language processing (NLP) in the process of further analyzing and processing the feedback. For the whole process, we will use the python programming language with its library TextBlob. The library TextBlob will be used to determine if the given feedback is positive or negative. TextBlob is a python library used for basic text tasks, such as sentiment analysis, translation, language determination, and so on. All of these tasks can be classified under NLP tasks. TextBlob allows us to view objects as a regular string in the python for processing the desired task [12]. The processes and analyzes done in this research are removing HTML tags, removing non-letters, removing whitespaces and empty elements, lowercase, tokenization, spell checking and correct misspelled words, and etc. To reduce the number of words of the feedback and make the classification as accurate as possible, usually removing stopwords is used [13]. When we check the list of nltk’s stopwords, we can see that it’s not a good idea to always remove stopwords from the dataset. For example, the stopword “not” it can change the meaning of the sentence at all. Since, the sentence “This car is not good” after removing stopwords will be “car good”. We can see that the original sentence is �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 negative, while the sentence after removing stopwords is positive. Of course, it's not always the case that removing stopwords will change the meaning of the sentence. Because of that, before removing stopwords it is good to know the sentences inside the dataset. The figure below shows the example of how removing stopwords can change the meaning of the sentence. Figure 3. Example of removing stopwords In Figure 4. we prove that removing stopwords sometimes can cause an issue. We can see that first sentence as polarity result -35, which means it's negative, while after removing stopwords from the sentence, the meaning is changes and polarity result became 70, which means the sentence is positive. Figure 4. Polarity result before and after removing stopwords 4. Results After processing the above analyzes and processes on each feedback we took from the BTT web application, we will begin the process of determining whether it is positive or negative. Here we come to sentiment analysis, which will be used from the mentioned python library. From TextBlob's sentiment analysis, we will use the polarity part which will give us a result between -1 and 1. Where -1 indicates very negative results, in our case very bad feedback, and 1 is positive. In Figure 5, we show the implementation of textblob's sentiment polarity and the polarity result or score. Figure 5. Implementation and result of TextBlob's sentiment property �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 From the figure above, we can see that the result is not so readable, where we can only check for the polarity result but we don't know for which feedback is that result. So we combined or merged the polarity score and feedback, to make the result more readable and understandable. The figure below, shows the way we combined the feedback and polarity score, and how the result became more readable and understandable from before. Figure 6. Feedbacks' polarity result The table below shows the total accuracy of our algorithm. Table 1. Result 5. Algorithm Approximate result Naïve Bayes ~ 80% Discussion Considering the research papers related to our research, which are already mentioned in the Section 2, we have notice that it is much faster and easier to determine if the feedback is positive or negative using the Python’s library TextBlob. As we mentioned before, it is not always good idea to remove stopwords from the text, as it can change the meaning of the sentences. In some researches, Naïve Bayes algorithm didn’t give the best result. There may be more causes such as, huge dataset with unnecessary sample or information, stopwords are removed, preprocessing is not done properly, and so on. In the table below, we showed the algorithms and results of several previous researches. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Table 2. Conspectus of previous works Author(s) Algorithm Result Joachim Büschken and Greg M. LDA (Latent Dirichlet Allocation ) 60-70% Allenby Michael Gamon SVM (Support Vector Machine) – 85.47% for the first class, while two classes the second class was 69.23% Al-Otaibi Shaha, Alnassar Allulo, SVM (Support Vector Machine) Alshahrani Asma, Al-Mubarak Amany, Albugami Sara, Almutiri Nada, Albugami Aisha Peter D. Turney 6. Around 87% PMI-IR (Pointwise Mutual Around 74% Information Information Retrieval) Conclusion To conclude the results, the feedback has been divided into two groups, positive and negative. Feedback, like in every web site helps the users that are first time on the online shop to determine which product is of good quality. In this research we proved that removing stopwrods in not always a good idea, because it can change the meaning of the sentence. Also the research will make it easier for the online shop owners to determine which feedback is positive and which is negative. In this way the owner will be able to recognize the quality sellers in a very easy and simple way. In the near future we are planning to improve this research by adding 'minus'. The minus will be added to sellers for every bad/negative feedback on his items. In that way we will be able to isolate bad sellers with bad items. If the seller receives a certain number of minuses he will be warned. If the sellers item gets a certain amount of minuses it will be automatically deleted. Also a method for recognising whether a feedback is spam or not will be implemented. This process will be initiated before the sentimental analysis. Since we want to perform the sentimental analysis only on „real“ feedback. This will speed up the process because we will not analyse large numbers of spam feedback. Also methods for translating foregin feedback to english language will be added. This research will be open-source so that every company or person will be able to use it, of course they will need to own a shop which receives feedback. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 REFERENCES [1] S. CK i G. Edwin, “Online Shopping - An Overview,” June 2014. [Na mreži]. Available: https://www.researchgate.net/publication/264556861_Online_Shopping_-_An_Overview. [2] A. Fundin i B. Bergman, “Exploring the customer feedback process,” June 2003. [Na mreži]. Available: https://www.researchgate.net/publication/240260148_Exploring_the_customer_feedback_process. [3] A. Sharma i A. Dr. Saha, “A comparative Study of different Approaches Used for Sentiment Analysis From Customer Reviews,” 14 Dec 2018. [Na mreži]. Available: https://poseidon01.ssrn.com/delivery.php?ID=5751021240710700920281070710870230680370490 040060050301200900180960751141181190270711060980510290180320020670021111090060881 06122026094048065075111125088015087089126069002034074006017116005086091025113001 0930131. [4] M. Gamon, “Sentiment classification on customer feedback: Noisy data, large feature vectors, and the role of linguistic analysis,” January 2004. [Na mreži]. Available: https://www.researchgate.net/publication/215470705_Sentiment_classification_on_customer_feedb ack_data_Noisy_data_large_feature_vectors_and_the_role_of_linguistic_analysis . [5] S. S. Prashali , R. K. Asmita , S. P. Rutuja i U. W. Yamini , “Sentiment Analysis of Feddback Data,” March 2019. [Na mreži]. Available: https://www.ijtsrd.com/papers/ijtsrd23090.pdf. [6] N. M. Swati i Nitin Shivale, “A Review onL Opinion Mining and Sentiment Analysis based on Natural Language Processing,” International Journal of Coumputer Applications, pp. 29-32, 2015. [7] D. T. Peter, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews,” July 2002. [Na mreži]. Available: https://www.aclweb.org/anthology/P02-1053.pdf. [8] J. Büschken i G. M. Allenby, “Sentence-Based Text Analysis for Customer Reviews,” 2016. [Na mreži]. Available: https://www.ku.de/fileadmin/160102/WiSe2015_2016/mksc.2016.0993-ePDF3.pdf. [9] S. Abuleil i K. Alsamara, “Using NLP Approach for Analyzing Customer Reviews,” 2018. [Na mreži]. Available: https://www.slideshare.net/cscpconf/using-nlp-approach-for-analyzing-customer-reviews-86265367. [10] S. Al-Otaibi, A. Alnassar, A. Alshahrani, A. Al-Mubarak, S. Albugami , N. Almutiri i A. Albugami, “Customer Satisfaction Measurement using Sentiment Analysis,” International Journal of Advanced Computer Science and Application, pp. 106-117, 2018. [11] J. Brynielsson, F. Johansson, C. Jonsson i A. Westling, “Emotion classification of social media posts for estimating people's reactions to communicated alert messages during crises,” 2014. [Na mreži]. Available: https://docplayer.net/11592731-Emotion-classification-of-social-media-posts-for-estimating-peoples-reactions-to-communicated-alert-messages-during-crises.html. [12] S. Loria, “textblob Documentation,” 26 April 2020. [Na mreži]. Available: https://buildmedia.readthedocs.org/media/pdf/textblob/latest/textblob.pdf. [13] S. Bird, E. Klein i E. Loper, Natural Language Processing with Python, O'REILLY, 2009. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Feedback System Using Sentiment Analysis Author Author Abdulrahman Almonajed Dino Kečo Abstract A summary of the resource. Today, when looking at the quality of an online item, the feedback itself plays a very important role. Based on the feedback we can decide whether the desired item is good or not, get a picture of the seller and so on. Many companies that have online shops display the most positive feedback while hiding bad ones or display only a few of them. In this research, we will help people by automating the process of deciding whether a feedback is positive or negative, which will give them time for other jobs and save money for hiring people who will work on the feedback. Since feedback on online articles is very important today, the process of determining positive and negative feedback should be made as quick and easy as possible. In this research, we will show a very simple and fast way to classify feedback as positive or negative, which means that the main question of this research is how to facilitate and speed up the process of determining the polarity of the feedback. We will use NLP using Python’s library called TextBlob. The used algorithm is called Naïve Bayes, it gave the accuracy of around 80%. Keywords Keywords. feedback, online article, sentiment analysis Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021319 https://eprints.ibu.edu.ba/files/original/173680acbb933aed28bb44102ca00405.pdf 0d0fc48c2c249e7e9811fba7c9bc6847 PDF Text Text Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Understanding Forms and Models of Cloud Computing Technologies Adopted in the Selected Institutions in Southwestern Nigeria Gbonjubola Oluwafunmilayo BINUYO1 1- African Institute for Science Policy and Innovation, Obafemi Awolowo University, Nigeria gobinuyo@gmail.com Abstract - The study examined the forms and models of cloud computing technology adopted in the selected institutions from four states in Southwestern Nigeria. The three purposively selected institutions were Federal, State and Private owned making twelve institutions. However, the administered questionnaire was filled in by the ten (10) IT personnel, ten (10) lecturers and five (5) students from each of the selected institutions making 300 respondents. The questionnaire elicited information on the forms and models of cloud computing technology adopted and the extent of use of the adopted cloud computing technologies in the selected institutions. Secondary data were obtained from relevant literature. Data collected were analysed with descriptive and inferential statistics. The study concludes that the forms of cloud computing technology adopted by the selected institutions in Southwestern Nigeria are infrastructure-as-a-service (IaaS), software-as-a-service (SaaS) and platform-as-a-service (PaaS) while software-as-a-service (SaaS) is often used by the institutions. Also, the models of adopted cloud computing technology are private, public, hybrid and community cloud computing by the selected institutions in Southwestern Nigeria. The adopted forms and models of cloud computing technology are used for different business functions such as payroll, procurement, human resources, accounting and finance, CRM, application development, and project management. Keywords-Cloud computing, Institutions and Nigeria 1. Introduction The aim of this study is to explicate the forms and model of cloud computing technology adopted in the selected institutions and determine the extent of use of forms of cloud computing technology and the business function deployed on cloud computing technology adopted by the selected institutions in Southwestern Nigeria. Scholars have defined cloud computing from their perspectives. Cloud computing depends on subscription service to accessing networked storage space and computer resources [1]. By implication, it is a paid service(s) to securing online information and communications technologies’ services. As cited in [1] that not all establishment are leapfrogging to adopting cloud computing technologies especially established institutions in developing countries like Nigeria [2]. Globally, higher institutions are encountering with the challenges of needed level of information and communications technology (ICT) required to enhancing good quality education and R&D activities especially in developing countries [3]. Giving yearly educational report of Republic of Yemen, it indicates that the educational sectors are challenged with hindrances to carrying out required quality education to the populace in the country. Among the hindrances to delivering good quality education at Republic of Yemen are due to inadequate needed infrastructure resources, under budget allocation to ICT, absence of ICT technical and teaching personnel [4]. At present, majority of activities are been conducted online. Among the activities are online document editing and writing, email checking, online interaction, collaboration, among others. Therefore, it is imperative globally for educational system to meet up with the advancement in ICT technology for rendering quality education [3]. Also, given the high cost attached to providing and maintaining the needed hardware and software, it is highly needed for educational system to adopt low cost advanced technology such as cloud computing. This cloud �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 computing addresses the challenge of high cost attached to both computer software and hardware needed to rendering quality education to the populace by providing ICT resources on a pay per use basis [3]. There have been diverse empirical studies on cloud computing technologies adopted in institutions [5-11]. Although, there are some theoretical review studies on the same phenomenon [4, 12-15] . However, scholars have noted that there is dearth of empirical studies on cloud computing technology in institutions especially Nigerian institutions [13,15,16]. Also, there is dearth of information on the forms and model of computing technology adopted in Universities in Nigeria, this is because cloud computing research is nascent in Nigeria [16], hence the need for this study. The remaining part of this paper is ordered as follows such as review of related literatures, method of research deployed, the study results and discussion, conclusion and recommendations. 2. Literature Review There is an increasing empirical research interest in cloud computing from both developing and developed economies. This cloud computing research interest have engineered vast intellectual and financial investment in cloud R&D [16]. Given that, it is highly imperative to know that cloud computing can be inform of service model and deployment model [16-18]. (a) cloud computing as a service model: It is service model when it entails Software, Platform and Infrastructure [17]. The discussion of cloud computing as a service is stated below: (i) Software as service (SaaS) was defined as distribution model that allows users to access applications run on their servers over the Internet and charged customers per usage [18]. In other words, it is a remote online application accessed by users/customers via the network using a simple web navigator [18]. In general, SaaS refers to any online services (cloud services) that users can access remotely or subscribed to and pay per usage basis. These types of cloud services entail accounting, invoicing, performance monitoring, communications, tracking sales and planning among others. Furthermore, using SaaS is like renting rather than purchasing it [18]. Unlike mainstream traditional software with limited license and the number of devices that can use it. SaaS offers the users the opportunity of subscribing to the software instead of purchasing it. (ii) Platform as a service (PaaS) allows for clients or customers to hire software, hardware, repository and network capacity through Internet. PaaS is of great interest to application developers because it provides for easy changes and upgrades to the features of the operating system in use and also allows for an application to be developed by developers distributed over different geographical locations across international boundaries. Costs can be reduced by the use of infrastructure services from a single cloud computing service provider rather than have and maintain several hardware facilities that often do identical functions. Examples of PaaS include Salesforce, IBM Bluemix, Cloudbees and Microsoft Azure among others. (iii) Infrastructure-as-a-Service (IaaS): This service delivery model enables clients to rent the equipment used in service operations and control the deployed applications and operating systems among others. Given that, �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 however, updating and patching of operating system at IaaS level are the responsibility of the users within the contractual period [19]. (b) Cloud computing as deployment model entails public, private, community and hybrid cloud [17, 20]. These models are discussed below: (i) Public Cloud: The most common type of cloud computing services skewed towards the public cloud deployment model because as the name implies, are publicly and openly available. Even though they can exist in private clouds, SaaS provisions like cloud storage, online office applications and IaaS and PaaS contributions like cloud-based web application development environments and hosting is in related to public cloud model. Public clouds are also deployed when organisations or individuals do not require the level of infrastructure and security present in private cloud model [21]. Intuitively, large organisations or enterprises may still deploy public clouds in situations where privacy is not required, such as online document collaboration, webmail or storage of non-sensitive documents. (ii) Private Cloud: It does not allow cloud resources to be shared with unknown third parties. It is otherwise known as internal cloud that is strictly for internal use of an establishment [22]. Private cloud loud resources perhaps located either onsite or offsite premises of the organization, hence, this model does not come with the benefit of reduced investment or expenditure in IT infrastructure or equipment. (iii) Community cloud: This type of model is solely for a group or collection of users within an organisation having a shared or common goal [23]. Here, IT resources are provided as a service to group of users in order to enable an elastic collaborative use of computing resource. It is often limited to selected or limited set of employees within an organisation such as security department, head of departments, a team or sub-unit in an organisation. (iv) Hybrid cloud: This model integrates two different deployment models such as public, private and community models. Organisations often combine two differing models to form a hybrid cloud in a bid to maximise efficiencies. In hybrid cloud, the combined clouds retain their identities but are bound together by standardized or proprietary technology [24]. Given cloud computing as service and deployment models, however, measuring the contribution of Nigerian scholars to the number and impact of cloud computing study was needed [16]. Content analysis and bibliometric was deployed in papers extracted from Scopus database within the specified time and country (2016 and Nigeria). The analysis of the extracted papers shows that majority of cloud computing study in Nigeria tend towards Education and Saas model of cloud computing [16]. In support of that assertion, [11] studied the effect and challenges of adopting cloud computing technology in government owned universities in the Southwestern Nigeria. In the study, one hundred (100) IT (information technology) personnel, fifty (50) para-IT personnel and fifty (50) students making two hundred (200) respondents in total were selected in each of the selected ten (10) universities using stratified sampling techniques with the aid of questionnaire. Out of the two thousand (2,000) questionnaire administered, one thousand, seven hundred and forty-two (1742) were retrieved which represents a respondent rate of 87.1%. Microsoft excel was used to analyse the data descriptively. The outcome of the study implies that the adoption of cloud computing has an important effect on enhanced availability, cost effectiveness, low environmental impact, reduced and reduced investment in physical asset among others. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Hence, the main issues challenging the use of cloud were data insecurity, regulatory compliance concerns, lock-in and privacy concerns. Cloud computing is an avenue to experience efficient and optimize IT (information and technology) services at least cost which is induced by pay as you use (PAYU) to cloud service providers [3]. There are other benefits attached to the use of cloud computing, among the benefits is high return on investment [25]. Given the benefits attached to the adoption and use of cloud computing, however, many sectors especially the higher education are skeptical in adopting cloud computing technology [3, 25]. On a contrary, cloud computing technology is highly being adopted by higher institutions mainly because of financial reasons [4]. Thinking beyond financial reason for adopting cloud computing, among the technical reasons for adopting cloud computing by IT manager or decision maker can be attributed to organizational, environment, technological and individual factors [4]. Cloud computing is a feasible in meeting the technological needs of an ogranisation efficiently, effectively and at reduced investment on physical asset with least environmental impact and IT complexity [1, 11]. [1] examined the behavioural intent to adopting cloud computing technology in large and small organization using an Enhanced Technology Acceptance Model (ETAM). [1] concluded that attitude and adopters’ ability to use cloud computing (self-efficacy) were better predictor of intention to adopt cloud computing technology. Perceived usefulness and perceived ease of use of cloud computing were better predictor of attitude to adopt cloud computing technology and perceived ease of use and the relevant of cloud computing to adopters’ work (job relevance) were the predictor of perceived usefulness. Recently, [15] systematically reviewed empirical studies on cloud computing technologies. The study showed from the reviewed studies that empirical studies on cloud computing technology are dearth of cloud computing usage/utilization. The study also identified challenges and benefits attributed to cloud computing adoption. The study empirically showed that universities in the selected area are willing to adopting cloud computing technologies. Meanwhile, [14] had earlier concluded from the reviewed literature on cloud computing technology adoption in organisations that the factors that determines the adoption of cloud computing technologies varies. [14] further noted that most of the reviewed studies operationalised the intention to adopt cloud computing in a binary form rather than the actual use of the technology. Meanwhile,[13] showed from the systematic literature review on empirical studies carried out on cloud computing technology adoption in universities that several universities have utilized different types of cloud computing service models. [25] examined the perception of IT and non-IT personnel on factors associated to the poor adoption of cloud computing technologies in African enterprises with Nigeria as a case study. The study concluded that the fear of unknown such as job loss, cyber threat, privacy issue and data theft were the hindrances to the adoption of cloud computing technology. In addition to that, [26] showed that top management support, competitive pressure, and compatibility are the factors attributed to cloud computing technologies. Based on the aforementioned studies, this paper adopts theory of Technology Acceptance Model (TAM) as a focusing device for the analysis of this study. Technology Acceptance Model explains the perceive usefulness of technology, perceive ease of use of technology and attitude toward using technology [27]. The three constructs are key determinants of technology adoption model. First, perceived usefulness (PU) explains thus that people �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 tend to use or not use a technology based on the usefulness perception of the technology. Second, perceived ease of use (PEOU) explains that potential users of technology are of the opinion that a given technology is useful and requires less effort to use it. Third, attitude of a user toward a technology was a major determinant of whether the user will actually use or reject the innovation [27]. Based on that, the applicable research method is adopted for this study. 3. Research Method This study deployed multi-stage sampling technique in data collection. Four states were randomly selected from six in Southwestern Nigeria. Three institutions otherwise called universities were purposively selected from each of the selected states. The justification for the purposive selection is to comprise one federal, one state and one private owned university from each of the selected four states making twelve universities in total. Furthermore, questionnaire was administered and filled in by the personnel in the purposive selected institutions: ten (10) IT personnel, ten (10) lecturers and five (5) students were considered from each of the selected institutions making three hundred (300) respondents. The yardstick for selecting the institutions is based on those institutions that are using cloud computing technologies while the purposive selection of the respondents in the institutions were based on referrer of expertise personnel on the subject matter. The questionnaire elicited information on the forms and models of cloud computing technology adopted. The respondents were asked to tick the forms and models of cloud computing adopted in their institutions. The forms of cloud computing adopted for this study include Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) while the models of cloud computing include private, public, hybrid and community cloud computing. Furthermore, respondents were to rank in five scales (5) the extent of use of the adopted cloud computing technologies in the selected institutions such as: no use (A), little use (B), moderate use (C), highly use (D) and lastly, often use (E); where Alphabet A is the lowest and Alphabet E is the highest. The respondents were further asked to indicate appropriately (multiple response is allowed) the type of cloud computing technologies deployed in the institutions such as Gmail-Based Institution Email Service, Dropbox, Docusign, Skydrive, Netsuite, Cisco-WebEx, Amazon Elastic or Web Services, Learning Management Systems (LMS), Microsoft Azure Cloud, Integrated Development Environments (IDEs), Cloud based APIs, and Cloud based .NET Platforms. In addition to that, the respondents were asked to rank the extent of use of the adopted cloud computing technologies for business function in five scales such as not applicable (A), little use (B), moderate use (C), highly use (D) and often use (E) where Alphabet A is the lowest and Alphabet E is the highest. The variables for business functions include payroll, application development, project management, accounting and financing, CRM/sales management, procurements, human resources and messaging and collaboration. Data collected were analysed with descriptive statistics such as frequencies and crosstabulation. 4. Results and Discussion The Table 1 in this study explains the three intuitions selected for this study such as Federal owned institutions, State owned institutions and Private owned institutions. Not only that, the table further shows the number of questionnaires administered to the selected institutions and the number of questionnaire retrieved. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 The table shows that out of three hundred (300) questionnaires administered, 56.3% (169) were retrieved and used for the analysis of this study. Meanwhile, from the perspective of [16] majority of cloud computing study in Nigeria tend towards Education and SaaS model of cloud computing, hence, this further contributes to those studies. Categories of the institutions Questionnaire Administered Questionnaire Retrieved Frequency Percentage Frequency Percentage Federal owned institution 100 33.3 57 19 State owned institution 100 33.3 63 21 Private owned institution 100 33.3 49 16.3 Total 300 100 169 56.3 Table 1 Number of Institutions Selected Table 2 explains the forms and models of cloud computing technology adopted in the selected institutions. The table shows that majority (78.3%) of the institutions adopts software-as-a-service, while 65.1% and 54.3% of the institutions also adopts platform-as-a-service and infrastructure-as-a-service respectively. The adoption of forms of cloud computing corroborates the reports of previous scholars on the forms of cloud computing technology adopted in institutions [17] [28] [29] and [30]. Hence, the adoption of these technologies will reduce the cost of operations of the selected institutions from keeping hardware, storage facilities, maintenance cost among others. Concerning models of cloud computing technology adopted by the selected institutions in the study area. Table 2 further shows that the selected institutions adopts private cloud computing (53.5%), public cloud computing (54.3%), hybrid cloud computing (51.9%) and community cloud computing (51.2%). This is line with posits of previous scholars on the models of cloud computing technologies adopted by institutions [20-23, 31]. In addition to that, this study corroborated [13] that several universities have utilized different types of cloud computing service models. By implication, universities in the study area adopted different forms and models of cloud computing based on their discretion, cost reduction, needful, necessity, and industrial revolution, technology push and demand among others. In support of the adopted theory for this study, the selected universities inductively adopted cloud computing technology based on perceive usefulness, perceive ease of use and attitude of user toward a technology as indicated as element of technology acceptance model by [27]. Table 2: Forms and Models of Cloud Computing Technology Adopted Characteristics Frequency Percent (%) Software-as-a-Service (SaaS) 101 78.3 Platform-as-a-Service (PaaS) 84 65.1 Forms of Cloud Computing �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Infrastructure-as-a-Service (IaaS) 70 54.3 Private Cloud 69 53.5 Public Cloud 70 54.3 Hybrid Cloud 67 51.9 Community Cloud 66 51.2 Models of Cloud Computing *Multiple response is applicable Table 3 explains the level of institutional use of the forms of cloud computing technology adopted by the selected institutions. Table 3 shows that majority (38.8%) the selected institutions that adopted infrastructure-as-a-service moderately use the technology follow by 24.8% of the institutions that highly use the infrastructure-as-a-service. Concerning the use of software-as-a-service by the selected institutions, Table 3 further shows that majority (34.9% and 32.6%) of the selected institutions moderately and highly use software-as-a-service respectively. Concerning the use of platform-as-a-service by the selected institutions, Table 3 shows that majority (26.4% and 41.1%) of the selected institutions little use and moderately use platform-as-a-service respectively. By implication, Table 3 shows that software-as-a-service (SaaS) is mostly used by the selected institutions in Southwestern Nigeria. This might be as a result of idiosyncratic of SaaS that connotes any cloud services that users can access remotely or subscribed to and pay per usage basis [18]. Among the SaaS cloud services that can be subscribed to or use remotely are accounting, invoicing, performance monitoring, communications, tracking sales and planning [18]. In addition to that, this study corroborates [16] that, majority of cloud computing study in Nigeria tend towards Saas model of cloud computing. Table 3: Level of Institutional Use of Cloud Computing Technology Characteristics Level of cloud computing usage (%) Forms of cloud computing A B C D E IaaS 14 7 38.8 24.8 0.8 SaaS 1.6 14 34.9 32.6 3.9 PaaS 10.9 26.4 41.1 3.9 1.6 *Multiple response is applicable Key: A = No use; B = Little use; C = Moderate use; D = Highly use; E = Often use Table 4 shows the cloud computing technology adopted by the selected institutions in the study area. The table shows that most of the cloud computing technologies adopted in the selected institutions are cloud based APIs �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 (55.8%), cloud based.NET Platforms (51.9%), Cisco-WebEx (48.8%), integrated development environment (IDEs) (43.4%), Amazon Elastic or Web Services (31.8%). More also, other cloud computing technologies adopted by the institutions includes Gmail-Based Institution Email Service (26.4%), Microsoft Azure Cloud (18.6%), Learning Management Systems (LMS) (16.3%), Skydrive (12.4%), Netsuite (8.5%), Dropbox (7.8%), and Docusign (0.8%). This shows that the selected institutions exhibited some level of cloud computing technologies. Perhaps, the necessity to adopt low cost advanced technology such as cloud computing warrant the selected institutions to adopting the cloud technologies. Meanwhile, [3] had postulated earlier that cloud computing technologies addresses the challenge of high cost attached to both computer software and hardware needed to rendering quality education to the populace by providing ICT resources on a pay per use basis. By implication, the selected institutions adopted cloud computing technologies so as to providing high quality that is affordable, accessible at least cost for the stakeholders in the institutions. Table 4: Cloud Computing Technology Adopted by the Selected Institutions Characteristics Frequency Percent (N=111) Gmail-Based Institution Email Service 34 26.4 Dropbox 10 7.8 Docusign 1 0.8 Skydrive 16 12.4 Netsuite 11 8.5 Cisco-WebEx 63 48.8 Amazon Elastic or Web Services 41 31.8 Learning Management Systems (LMS) 21 16.3 Microsoft Azure Cloud 24 18.6 Integrated Development Environments (IDEs) 56 43.4 Cloud based APIs 72 55.8 Cloud based .NET Platforms 67 51.9 *Multiple response is applicable The Table 5 in this study shows the extent of cloud computing technology in business function in the selected institutions in the study area. The selected institutions highly use (30.2%) and often use cloud computing technology in their payroll function. In addition to that, the table shows that the selected institutions highly (34.1%) and often use (25.6%) cloud computing technology in their application development function. Furthermore, Table 5 shows that the selected institutions moderately use (25.6%) and highly use (22.5%) cloud computing technology in their project management functions. The table shows that the selected institutions �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 moderately use (33.3%) cloud computing technology in their accounting and financing functions. Also, the institutions little use (27.9%) and moderately use (31.8%) cloud computing technology in their CRM/sales management function. This table shows that the selected institutions moderately use (39.5%) cloud computing technology in their procurements function. In addition, the selected institutions moderately use (37.2%) cloud computing technology in their human resources. Lastly, the selected institutions little use (34.9%) and moderately use (32.6%) cloud computing technology in managing and collaboration function. By implication, the payroll functions of the selected institutions have been digitised and can be done anywhere in the world (telecommuting). Not only that, the selected institutions have deployed cloud computing technologies in their project management, accounting and financing, CRM/sales management, procurements, human resources, managing and collaboration functions. Table 5: Extent of Use of Cloud Computing Technology in Business Function Characteristics Extent of use of cloud computing technology Business Function A B C D E Payroll 17.8 9.3 18.6 30.2 11.6 Application Development 10.1 7 8.5 34.1 25.6 Project Management 16.3 15.5 25.6 22.5 3.9 Accounting and Financing 17.1 24 33.3 7 0.8 CRM/Sales Management 21.7 27.9 31.8 3.1 - Procurements 22.5 21.7 39.5 2.3 - Human Resources 20.2 23.3 37.2 3.9 1.6 Messaging and Collaboration 11.6 34.9 32.6 7 3.1 *Multiple response is applicable Key:A = Not applicable; B = Little use; C = Moderate use; D = Highly use; E = Often use 5. Conclusion The study concludes that the forms of cloud computing technology adopted by the selected institutions in Southwestern Nigeria are infrastructure-as-a-service (IaaS), software-as-a-service (SaaS) and platform-as-a-service (PaaS) while software-as-a-service (SaaS) is often used by the institutions. Also, the models of adopted cloud computing technology are private, public, hybrid and community cloud computing by the selected institutions in Southwestern Nigeria. The adopted forms and models of cloud computing technology are used for different business functions such as payroll, procurement, human resources, accounting and finance, CRM, application development, and project management. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 6. Limitations and future work This study is limited to universities in Southwestern Nigeria, further studies perhaps consider the whole universities in Nigeria. The study did not consider factors influencing the adoption of cloud computing technologies, further studies may consider that. The study only use quantitative method in data collection and descriptive analysis, further studies may consider mixed method in data collection and analysis. 7. Acknowledgement The author appreciates the contributions of indispensable scholars who in one way or the other contributes to the scholastics of this paper. REFERENCES [1] O. T., Arogundade, et al., “Investigation of Factors Affecting Cloud Computing Adoption inn Nigeria”. Journal of Natural Science, Engineering and Technology, 2016, 15(2), 73-94. [2] A. Ume, A. Bassey, H. Ibrahim, “Impediments facing the introduction of cloud computing among organizations in developing countries: Finding the answer”. Asian Transactions on Computers, 2012, 2, 12-20 [3] S. Okai, M. Uddin, A. Arshad, R. Alsaqour, and A. Shah, “Cloud Computing Adoption Model for Universities to Increase ICT Proficiency”, SAGE, 2014, 1-10. DOI: 10.1177/2158244014546461 [4] S. Abdulnoor, M. D. Sulfeeza, and M. S. Siti, “Empirical Studies on Cloud Computing Adoption: A Systematic Literature Review”. Journal of Theoretical and Applied Information Technology, 2017, 6809-6832. [5] N. Sultan, “Cloud Computing for Education: A New Dawn?” International Journal of Information Management, 2010, 30, 109– 116. [6] T. Ercan, “Effective Use of Cloud Computing in Educational Institutions,” Procedia Social and Behavioral Sciences, 2010, 2, 938–942 [7] M. Mircea and A. Adreescu, “Using Cloud Computing in Higher Education: A Strategy to Improve Agility in the Current Financial Crisis”. IBIMA, 2011, 1-15. DOI:10.5171/2011.875547 [8] F. E. Mehmet and B. K. Serhat, B. K. Cloud Computing for Distributed University Campus”, International Conference on the Future of Education, Pixel Publishing International, 2011 [9] Y. G. Abdulsalam and U. Z. Fatima "Cloud Computing: Solution to ICT in Higher Education in Nigeria", Advances in Applied Science Research, 2011, 2 (6):364-369, Pelagia Research Library. [10] J. Anjali, and U. S. Pandey “Role of Cloud Computing in Higher Education", International Journal of Advanced Research in Computer Science and Software Engineering, 2013, 3(7), 966-972. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 [11] C. A. Oyeleye, T. M. Fagbola, and C. Y. Daramola, “The Impact and Challenges of Cloud Computing Adoption on Public Universities in Southwestern Nigeria. (IJACSA)” International Journal of Advanced Computer Science and Applications, 2014, 5(8), 13-19. [12] S. O. Olabiyisi, T. M. Fagbola, R. S. Babatunde “An Exploratory Study of Cloud and Ubiquitous Computing Systems”. World Journal of Engineering and Pure and Applied Sciences, 2012, 2(5):148-155. [13] M. S. Ibrahim, N. Salleh, and S. Misra, “Empirical Studies of Cloud Computing in Education: A Systematic Literature Review”. Springer International Publishing Switzerland, 2015, 725-737. DOI: 10.1007/978-3-319-21410-8_55 [14] H. Hassan, M. H. Mohd-Nasir, and N. Khairudin, “Cloud Computing Adoption in Organisations: Review of Empirical Literature”. SHS Web Conferences. 2017, 34. 1-6. DOI: 10.1051/shsconf/20173402001. [15] M. B. Ali, T. Wood-Harper, M.R.A. Mohamad, “Benefits and Challenges of Cloud Computing Adoption and Usage in Higher Education. Stanford University”, 2018, 1-22. http://dx.doi.org/10.4018/IJEIS.2018100105. [16] A. A. Ezenwoke, and E. Igbekele, “Cloud Computing Research in Nigeria: A Bibliometric and Content Analysis”. Asian Journal of Scientific Research. 2019, 12(1), 41-53 [17] M. Ahronovitz, D. Amrhein, P. Anderson, A. Andrade "Cloud Computing Use Cases White Paper", 4th ed. 2010. Accessed from http://www.cloud-council.org/Cloud_Computing_Use_Cases_Whitepaper-4_0.pdf accessed 4th November, 2020. [18] K. Hashizume, "An Analysis of Security Issues for Cloud Computing", Journal of Internet Services and Applications. 2012, 4(5): 3-13. [19] M. Murphy, L. Abraham, M. Fenn, and S. Goasguen, (2009), "Autonomic Clouds on the Grid", Journal of Grid Computing, pp. 1-18. [20] D. Catteddu, and G. Hogben, "Cloud Computing: Benefits, risks and recommendations for information security". 2009, 3-11. [21] A. Mansour, "The Adoption of Cloud Computing Technology in Higher Education Institutions: Concerns and Challenges (Case Study of Islamic University of Gaza)" 2013. [22] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud Computing: State-of-the-art and Research Challenges", Journal of Internet Services and Applications, 2010, 1(1): 7-18. [23] K. Sharma, S. Thakur, A. Kalia, J. Thakur, and S. Kumar, "Emerging Cloud Computing Paradigm: Vision, Research Challenges and Development Trends", International Journal of Research and Engineering and Technology, 2014, 3(5): 11-34. EISSN:2319- 1163, ISSN: 2321-7308, [24] Cloud Security Alliance (CSA) "Security Guidance for Critical Areas of Focus in Cloud Computing V2.1". 2009, 2-7. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 [25] G. A. Oguntala, R.A. Abd-Alhameed, and J. O. Odeyemi, “Systematic Analysis of Enterprise Perception Towards Cloud Adoption in the African States: The Nigerian Perspective”. African Journal of Information Systems, 2017, 9(4), 213-231. [26] S-K. Yoo, and B-Y. Kim, “A decision-making model for adopting a cloud computing system”. Sustainability, 2018, 1-15. Doi:10.3390/su10082952 [27] F. D. Davis, “A technology acceptance model for empirically testing new enduser information systems: Theory and results”. Doctoral dissertation. Cambridge, MA: MIT Sloan School of Management, 1985 [28] P. Buxmann, L. Sonja, and H. Thomas "Software as a Service", WIRTSCHAFTSINFORMATIK, 2008, 50 (6):500-503. [29] M. Anandarajan, and B. Arinze, (2010), "Factors that Determine the Adoption of Cloud Computing: A Global Perspective", International Journal of Enterprise Information Systems, IJEIS, 6(4): 55-68. [30] R. Miller, (2011), “Understanding the Different Levels http://www.businessservicemanagementhub.com/2011/03/16/understanding-the- of Cloud Computing", different-levels of- Cloud-computing/ accessed 7th October, 2020. [31] F. Shimba, "Cloud Computing: Strategies for Cloud Computing Adoption". Masters Dissertation at the school of computing Dublin. Dublin Institute of Technology, 2010. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Understanding Forms and Models of Cloud Computing Technologies Adopted in the Selected Institutions in Southwestern Nigeria Author Author Gbonjubola Oluwafunmilayo Binuyo Abstract A summary of the resource. The study examined the forms and models of cloud computing technology adopted in the selected institutions from four states in Southwestern Nigeria. The three purposively selected institutions were Federal, State and Private owned making twelve institutions. However, the administered questionnaire was filled in by the ten (10) IT personnel, ten (10) lecturers and five (5) students from each of the selected institutions making 300 respondents. The questionnaire elicited information on the forms and models of cloud computing technology adopted and the extent of use of the adopted cloud computing technologies in the selected institutions. Secondary data were obtained from relevant literature. Data collected were analysed with descriptive and inferential statistics. The study concludes that the forms of cloud computing technology adopted by the selected institutions in Southwestern Nigeria are infrastructure-as-a-service (IaaS), software-as-a-service (SaaS) and platform-as-a-service (PaaS) while software-as-a-service (SaaS) is often used by the institutions. Also, the models of adopted cloud computing technology are private, public, hybrid and community cloud computing by the selected institutions in Southwestern Nigeria. The adopted forms and models of cloud computing technology are used for different business functions such as payroll, procurement, human resources, accounting and finance, CRM, application development, and project management. Keywords Keywords. Cloud computing, Institutions and Nigeria Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021318 https://eprints.ibu.edu.ba/files/original/4185b962c7b2e1090b65243b0dbbab63.pdf c9e429afa68c8df8f76a72e2686eb35b PDF Text Text Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Contemporary housing trends in Sarajevo Emina Mehic1 1-International Burch University, Sarajevo, Bosnia and Herzegovina emina.mehic@stu.ibu.edu.ba Abstract – Within the last 20 years, there has been witnessed a significant increase of the urban population of Sarajevo, as a result of economic and social migrations. Consequently, this has caused an increasing demand for new housing which is mainly profit-oriented without any beneficial social, environmental or cultural implication. Primary objective of this research is to analyze the current situation and to assess the quality of the buildings not only as a housing solution, but as a complex that unites the community who inhabits it. This research will be conducted in a qualitative manner in analysis and statistical approach over the data related to the urbanization, building standards and positive effects of the building. Newly built parts of settlements Otoka and Stup will be used as case studies, since these parts of the city are most influenced by the mass production of the new housing solutions. This paper stresses out the correlation between high demand for the new housing and decreased quality of the housing without respecting minimum spatial and environmental standards, without basic amenities, social infrastructure and recreational and cultural activities. There is a need for improvements in contemporary housing design that will reflect with positive impacts on social, environmental, economic and cultural aspects of urban living. Keywords - Contemporary housing trends, qualitative analysis, Otoka, Stup 1. Introduction City of Sarajevo is becoming a large construction site, meaning that more and more residential buildings and buildings in general are being built. For the past couple of years, the fast appearance of the entire residential settlements is noticeable. The parts of the city that are affected the most are Otoka and Stup. One of the most characteristic housing solutions are definitely residential settlements called Stup Nukleus, a newly built residential and business complex in Stup, municipality of Ilidža and Nova Otoka in Otoka, municipality of Novi Grad. With the urbanization of the capital city of Sarajevo extending rapidly. It is not a surprising phenomenon that more and more investors are seeking an opportunity for profit. In order to realize why the interest is so high in these specific parts of the city, history and urban plans for Sarajevo will give us a more precise point of view. Otoka is a settlement in the capital city of Bosnia and Herzegovina, Sarajevo, located in municipality Novi Grad. Otoka is closely coupled with the following: Buća Potok (North side), Čengić vila (East side), �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Aneks (South-East side), Švrakino Selo (South side). Its residential core represents a chain of high-rise buildings (Streets: Žrtava Fašizma, Brčanska, Aleja Lipa). [2] The majority of residential buildings built in this part of the city was built by the government in early 70s when Otoka was considered one of the most prominent, modern and cleanest parts of the Sarajevo suburbia. The residential design of this part of the city was also advanced considering the other buildings. As shown on Figure 1, These were built during socialist regime, since significant attention was paid to environmental aspects of the settlement. There were designated areas for parks, elementary schools, preschools and shopping. [1] Originally residential settlements were built on the left side of Miljacka river, which before the 70's was mainly empty fields. Accordingly, there were no plans for extensive construction on the other side of the river, since the idea was to maintain Otoka Meandar as the green “lungs” of the city containing recreational areas and walking paths. The area to the North between two major traffic axis – Bulevar Meše Selimovića and Džemala Bijedića street were treated as industrial site. After the 1990’s war new buildings were erected in the Meandar area. “Stadium Otoka” was built in 1993 and it was additionally upgraded and renovated in 2011. “Istiklal Mosque” was also built in 2001, beside these two, Vistafon multipurpose hall and Olympic pool – two large scale projects were built in this period. Even though these are mainly sport and recreational buildings that provide social interaction and entertainment opportunities the green lungs of the city were seriously jeopardized. In the meantime, with the construction of the mentioned buildings industrial zones slowly started decaying and as the market needs and industry demand changed. The industrial companies that owned the area were destroyed in the shady privatization processes that followed the war. Industries that have survived the war and privatization, were allocated outside of the city. This created an opportunity to transform the entire industrial zone into residential settlement.[6] [7] �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Examining urbanization plans we can conclude that the first residential zone was expected to be at maximum 6-8 floor height, but today we can see that the floor height almost doubled and we can notice 12-13 story buildings. The building blocks that we are examining now in Nova Otoka were initially planned with a maximum height of 21 meter, but with the change of the regulatory plan in 2017, their height increased to 42m. However, even though the height of the buildings was increased the distances or the number of the pertaining facilities remained the same. Another important issue is vehicular congestion that is happening on a daily basis in this part of the city, because Otoka as mentioned is the geographical center of the city. It is a connection point from the hill settlements and the valley, with tram connection and the main road. Furthermore, once the Otoka settlement was previously built vehicle traffic was directed with neighbourhood lanes planned in a ring style around the perplexing which added to more secure conditions generally and decreased the congestion. Stup, shown on Figure 3., is a settlement in the capital city of Bosnia and Herzegovina, Sarajevo, located in the municipality Ilidža. Geographically it is located in the western part of the city further from the city centre. It is encompassed by the river Miljacka on the South, and on the North by the river Dobrinja. Neighboring settlements are Briješće, Alipašin most, Alipašino Polje, Olimpijsko selo, Nedžarići, Zračna luka Butmir, Ilidža, Pejton, Otes and Azići. This part of the city was quite rural since it was considered on the outskirts of the city, so mainly low-rise, single family houses and industrial buildings were located in this area. These were mainly owner-occupied housing and there were now larger scale buildings. Once the regulatory plan was provided, Stup area was separated into zones. One of the zones - Stup Nukleus was designated as a residential settlement zone comprising recreational and green areas. However, there were multiple missteps during the implementation of the plan itself. The Institute for Development Planning of �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Sarajevo Canton hasn’t specifically stated the dimensions of the single buildings, but rather provided zones for approved buildings with pertaining area coverages and building indexes. On the other hand, the developer chose to ignore the regulations and building indexes and built the entire buildable area. This has 8caused very high building density, for instance we have several cases of 6-meter distance between two 13 story buildings. Regarding the historical narrative of Stup Nukleus the site in 1992 was owned by a farming cooperative. After the war, the area became privately owned. Construction of the Stup Nucleus residential settlement began in 2011. The Municipality of Ilidža drafted a Study on the socio-economic justification for the establishment of a public institution in the Stup II settlement in November 2017, which plans for the construction of the school to begin this year, but it never happened. The closest school to this settlement is currently Aleksa Santic Elementary School, located in the Aerodromskom naselju, which is more than one kilometer away, and access to it is very dangerous because of the frequent traffic, especially for younger children. Regarding the vehicular connection of Stup, it is connected to the main traffic axis- Džemala Bijedića street and it contains one of the biggest road loops that is connecting city to other magistral roads that are leading to Mostar, Zenica or Tuzla. With this being said, we can now incorporate the general characteristics of both settlements to create a detailed analysis of the new building construction trends and he future of building in the capital city of Sarajevo. [3] Figure 3. Stup in Yugoslavia, as spacious new settlement near to industrial zone [www.klix.ba] 2. Methodology The case study will show the quality, trends, potential problems and possible improvements for contemporary housing trends in Sarajevo. This will allow us to contain all necessary information that will be relevant for our research. The results will be used to give recommendations for the design of residential housing in the future. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 3. Case study Urbanistic criteria: On Figure 4. below the regulatory plan of Stup Nukleus can be seen. Based on the urban typology and regulation plan proposed we will be able to bring up some conclusions and find relevant data that will affect the evaluation of the results. [9] Figure 4. Regulation plan of Stup Nukleus [Institute for Planning Development of Sarajevo Canton] Stup Nukleus was built in three separate phases and even though the majority of it was built during the first phase. The construction process started in 2001 and it consisted of 5 buildings with heights varying between 5 and 12 stories high. Smallest distance between these buildings is 6 meters and it is between the 10 story building and 7 story building which creates a big issue in terms of vistas, day light and extreme, almost inhuman density. [4] Buildings are taking around 7.471 m2 of the site area which is 20.245 m2. We can come to a conclusion that more than a third of the actual site is covered by the buildings. Furthermore, this brings us to the calculation of Urban Density Index (expressed through floor area ratio) which in this case equals 0,36902939. This is quite a lot taking into a consideration that buildings are over 10 stories high, creating the image of very high physical concentration and spatial congestion. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Figure 5. Regulation plan of Stup Nukleus in the first phase of development [Institute for Planning Development of Sarajevo Canton] The second phase, represented on the Figure 7., of Stup Nukleus development contained incredible amount of 11 buildings ranging from 6 to 13 floors high. The smallest distance between these buildings is 7,5 m. The total area covered by the buildings is 18. 455 m2 out of 51. 056 m2 of the total site area. The Urban Density Index (expressed through floor area ratio) for the second phase of Stup Nukleus is 0,3614658414 which is smaller than the first mentioned phase. [3] Figure 6. Regulation plan of Stup Nukleus in the second phase of development [Institute for Planning Development of Sarajevo Canton] �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Figure 7. Completion of Stup Nucleus I [Tibra Pacifc] However, the situation on site is considerably worse than the first phase. Because the amount of extremely high buildings is much more pronounced than before and some parts of the site are simply incapable of receiving any daylight. There are also cases where the buildings are facing each other to extent of creating privacy issues. Figure 8. Construction of Stup Nucleus 2 in third phase of Stup Nucleus development [Tibra Pacific] The third phase contains similar situation like it is shown on regulatory plan bellow, it contains 4.508 m2. These one is still in development so it is hard to get the exact value for the UDI, it contains 3 buildings and 1the highest one is 9 floors high. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Figure 9. Regulation plan of Stup Nukleus in the third phase of development [Institute for Planning Development of Sarajevo Canton] On the other hand, when we talk about Nova Otoka we can notice 5 new buildings with two of them with the same height of 12 floors, which as mentioned before has doubled after the change of regulatory plan. The covered area of Nova Otoka is 10.601 m2 out of the total area of 26.930 m2, and one more building that is in further location, not in between these buildings has area of 2031 m2. [10] It is important to notice that the UDI in Nova Otoka is 0,3936502042. It is high, but there is a factory in between the buildings that is contains the rest of this field. This technically means that here the building density is almost close to ~ 0,86. For the general size of sit it is high and it takes large portion of space. Figure 10. Regulation plan of Otoka [Institute for Planning Development of Sarajevo Canton] �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 Environmental, social and cultural criteria: Based on the documentation and geographical analysis of the site, where Stup Nukleus is located we can conclude that there is no park in the close proximity of the complex, never the less there are no amenities for children or any similar project planned. The closest park that is intended for recreational and leisure purposes is 25 minutes walk from the complex and it is 1.9 km away. On the other hand, based on the analysis of Nova Otoka site we conclude that there is only one very small park within the complex, however the amenities for children are quite limited. Closest larger park that is intended for recreational and leisure purposes is 37 minutes walk from the complex and it is 3.3 km away. Considering the social aspects of the mentioned complexes we can notice a very bad trend of lack of care for the social interaction. It is important to mention the better position of Otoka compared to Stup that didn’t have any predispositions for social and cultural facilities, which Otoka inherited from socialist Yugoslav construction. After the careful examination of the site, we have concluded that Stup Nukleus has 5 privately owned coffee shops and 3 restaurants which based on the population and building density is not enough. [8] Beside these private commercial activities, there is no any sort of entertainment, recreational or cultural enforcing amenity in either one of the sites we are examining in this case study. [5] Architectural criteria: Stup Nukleus is commonly considered to be one of the worst complex built in Sarajevo in last two decades. The main issue we have discovered based on the interviews, was that the insufficient distance between buildings. [11] We will select the sample apartment from these buildings. The example that we used is the apartment with 2 bedrooms and has total of 58 m2. The selected type is the most common and the most repetitive type of the apartment in the entire complex. Regarding the layout and the dimensions of the rooms it is noticeable that from the lobby the living room with the kitchen and dining are accessible. The total area for these spaces is of 17,80 m2. From this space you can access the balcony 9,20 m2. To the left of the lobby there is a bathroom, area of 4,06 m2. The master bedroom is 14,37 m2 with access to the loggia. To the right of the front door is a pantry, area of 1,80 m2, while access to a smaller bedroom that has area 8,39 m2, from the living room. Some of these apartments are above the 7th floor. Which brings us to the next point and that is the disadvantages of Stup Nukleus buildings. This disadvantage is the insufficient amount of natural light. This issue is closely connected to the distance between buildings. The floors above the 7th floor, do have access to the natural light. Other parts are poorly designed and they get at most 3 hours of daylight. Looking upon the window to space ratio, we can notice that there is lack of windows throughout of the apartment, the rooms are small and they are really hard to fit any larger piece of furniture. As well on the floor plan you can see that the kitchen and bathroom are too small. Beside that, this apartment as you can see is facing the north side. This side is the side that gets the small amount of light in it. The issue is with �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 the air circulation from the kitchen to the only window in the left part of apartment that has to go through living and dining room. Figure 11. 58 m2 apartment in Stup Nucleus as average size apartment [www.olx.ba] On the other hand, as mentioned before, Nova otoka is also a project from the same construction company as Stup Nukleus complex and it is considered to be more contemporary and higher level than Stup. Since Nova Otoka was just recently completed, we were able to find more information about the technical execution of the construction and about building layout itself. This apartment is located on the west side of the complex and it is on 12th floor, meaning there is just one floor above it. [3] Further more As mentioned before Nova otoka is also a project from the same construction company as Stup Nukleus complex and it is considered to be more contemporary building than the previously mentioned building. Since Nova Otoka was just recently completed, we were able to find more information about the technical execution of the construction and about building it self. Floors facilities: two floors basement, ground floor and 12 residential floors. The basement floors are designed as parking spaces, ground floor contains offices, while the 12 floors above the ground floor are planned as housing units. The complex contains 12 floors, but the last 2 floors are two story penthouses. This complex apartment size varies from 32,49 m2 to 133,63 m2 where average area of the apartments is 65 m2. This apartment is located on first floor of 12 story building A. It is on South and facing the main road, which is very frequent and has high vehicle density during the day, especially the Otoka settlement due to the issues with traffic jams is know to be the start of the jams making vehicle concentration very high. �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 In order to make better comparison, we will select the similar size of the apartment from “Nova Otoka” complex, which has 57.16 m2. The apartment consists out of living room with connected kitchen and dining room with the total area of 22,64 m2. Master bedroom with area of 13,15 m2 is directly next to children bedroom: 7,10 m2 and within the lobby with area of 3,72 m2, across from the bedroom there is toilet with area of 4,24 m2 within the living room we can notice the balcony: 6,31 m2. Figure 12. 57 m2 apartment in Nova Otoka as average size apartment [www.olx.ba] Additionally, more significantly the shape, the rooms inside of the buildings are just not practical, because placing a bed in middle of the room, leaves around 70 cm of space that is accessible. This is a new practice and it proving to be bad and non-functional. Resident will always have a lack for space for wardrobes. 4. Conclusion Evaluating the situation and the data presented above, we can state that the analysis showed that most of these new complexes like the Stup Nukleus and Nova Otoka are built mainly for profit, without any concern for environmental, social or cultural benefits of such developments. There is lack of care for providing smart residential building solutions or on the other hand any basic social, recreational and cultural infrastructure resulting in inhuman, unsocial and quite hostile built environment without any sense of identity. A significant improvement can be done by adding areas like parks and playgrounds for children. Instead, the developers are opting for rather cruel profit machine that will bring money exclusively to the investors. There is a significant influence of scale, more precisely building density and distances between buildings on the overall quality of the studied complexes. One of the main issues especially noted in Stup Nukleus �Journal of Natural Sciences and Engineering, Vol. 2, No.2 (2020) DOI number: 10.14706/JONSAE2021311 is that there is an evident lack of daylight in between the buildings, especially where the distance between two buildings is not more than 8 meters. This is causing privacy issues, issues with vistas which can also lead to the further psychological issues. From the regulatory plan is very important to state that the density and the height of the buildings is not by any regulations or laws that are set in place. The case study has shown that the layout of the bedrooms within the buildings is highly questionable, based on their position and the size. The versatility, the flexibility and the functionality of certain spaces, bedrooms foremost, are dubious due to their limited size. [4] Furthermore, it is important to conclude with saying that there needs to be improvements and persistency of government to pursue the execution of the initially set regulatory plans. Moreover, there is an evident need for a clear set of residential standards in terms of room size, layout, orientation etc. These standards should be used and applied as regulatory mechanisms. This will prevent any future mistakes. On the other hand, the investors need to keep in mind all of the aspects of living, rather than just providing profitable housing solutions without any amenities. Lastly, the final users of the housing should be more aware of all the consequences and implications of the inadequate residential settlements, instead of focusing just on price per m2. 5. REFERENCES [1] Bošnjak, Katarina. “URBANI IDENTITET SARAJEVA.” AABH, 5 Nov. 2016 aabh.ba/urbani-identitet-sarajeva/. [2] “Općina Novi Grad Sarajevo.” Općina Novi Grad Sarajevo, 2015; www.novigradsarajevo.ba/index.php?option=com_content&view=article&id=17&Itemid=21. [3] Sarajevo, Canton. “Building Regulations and Laws for Canton Sarajevo”, 2017, propisi.ks.gov.ba [4] Bachelard, G. (1994). The Poetics of Space. Boston: Beacon Press books [5] Finci, J. (1962). Development of Disposition and Function in Residential Culture of Sarajevo. Sarajevo: [6] NP Oslobodjenje. Grabrijan, D., & Neidhardt, J. (1957). Architecture of Bosnia and the Way to Modernity. Ljubljana. [7] Ernst, J. Z., Vukicevic, B., Jakulj, T., & Ilich, W. (2017, August 22). Sarajevo Paradox: Survival throughout History and Life after the Balkan War. Retrieved from Columbia University: from http://www.columbia.edu/cu/ece/research/intermarium/vol6no3/ernst.pdf [8] Federalni zavod za statistiku. (n.d.). Retrieved from http://fzs.ba/index.php/popis-stanovnistva/popisstanovnistva-2013/preliminarni-rezultati-popisa-2013/ [9] “PACIFIC’ d.o.o. Kiseljak.” TIBRA, 2019, tibra-pacific.com/tibra_new/. [10] Otoka, Nova. “NOVA OTOKA.” NOVA OTOKA, 1 Aug. 2015, www.novaotoka.com/en/home.php [11] Općina Ilidža https://www.opcinailidza.ba/uploads/files/shares/REGULACIONI%20PLANOVI/Regulacioni%20plan 20Stup%20Nukleus.pdf � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Contemporary housing trends in Sarajevo Author Author Emina Mehić Abstract A summary of the resource. Within the last 20 years, there has been witnessed a significant increase of the urban population of Sarajevo, as a result of economic and social migrations. Consequently, this has caused an increasing demand for new housing which is mainly profit-oriented without any beneficial social, environmental or cultural implication. Primary objective of this research is to analyze the current situation and to assess the quality of the buildings not only as a housing solution, but as a complex that unites the community who inhabits it. This research will be conducted in a qualitative manner in analysis and statistical approach over the data related to the urbanization, building standards and positive effects of the building. Newly built parts of settlements Otoka and Stup will be used as case studies, since these parts of the city are most influenced by the mass production of the new housing solutions. This paper stresses out the correlation between high demand for the new housing and decreased quality of the housing without respecting minimum spatial and environmental standards, without basic amenities, social infrastructure and recreational and cultural activities. There is a need for improvements in contemporary housing design that will reflect with positive impacts on social, environmental, economic and cultural aspects of urban living. Keywords Keywords. Contemporary housing trends, qualitative analysis, Otoka, Stup Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021317 https://eprints.ibu.edu.ba/files/original/2ec0ac41ff0e0b71ab32b474c716b5ce.pdf c6d7c6996b753e2359491dfe742709ae Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Lesson Plan A resource that gives a detailed description of a course of instruction. Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource FPGA-based Implementation of IIR Filter for Real-Time Noise Reduction in Signal Author Author Aladin Kapić1, Rijad Sarić1, Slobodan Lubura1, 2, Dejan Jokić Abstract A summary of the resource. Filtering of unwanted frequencies represents the main aspect of digital signal processing (DSP) in any modern communication system. The main role of the filter is to perform attenuation of certain frequencies and pass only frequencies of interest. In a DSP system, sampled or discrete-time signals are processed by digital filters using different mathematical operations. Digital filters are commonly categorized as Finite Impulse Response (FIR) and Infinite Impulse Response (IIR). This research focuses on the full VHDL implementation of digital second-order lowpass IIR filter for reducing the noisy frequencies on the FPGA board. The initial step is to determine, from continuous time domain function, the transfer function in the complex {s} domain, then map transfer function in complex {z} domain and finally calculate the difference equation in discrete-time domain of the system with adequate coefficients. Prior to the FPGA implementation, the IIR filter is tested in MATLAB using a signal with mixed frequencies and signal with randomly generated noise. The digital implementation is completed by using fixed-point binary vectors and clocked processes. Keywords Keywords. digital signal processing; IIR filter; digital design; FPGA; VHDL; Bode diagram Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021316 https://eprints.ibu.edu.ba/files/original/c91f0c7eb4c25c44efb93b1215302dc1.pdf 11fd7f01ac78fa3f60095d139353018c PDF Text Text Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 Quantitative estimation of cooling load capabilities of residential buildings using machine learning Nedret Bećirović, Ismail Bejtović, Jasmin Kevrić International Burch University, Sarajevo, Bosnia and Herzegovina nedret.becirovic@stu.ibu.edu.ba ismailbejtovic@hotmail.com jasmin.kevric@ibu.edu.ba Abstract – Based on previous research on energy efficiency of the buildings, particularly their cooling load capabilities we will develop a collection of machine learning methods for detecting buildings with best cooling load capabilities. This collection will study the influence of 8 input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on one output parameter, that is cooling load of buildings. The results of this study support the practicability of using machine-learning software to estimate building parameters as a convenient and accurate approach, as long as the methods chosen are well suited for the type of data in question. Keywords – cooling load, energy efficiency, machine learning, neural network. 1. Introduction Considering growing electrical energy consumption in the residential sector [1] and Global Warming it is noticeable that energy consumption for cooling will surpass energy consumption for heating in the foreseeable future. Heating and cooling load are two very important parameters in the efficient building design. These two parameters are closely related to the materials that the building is made of, so construction decisions made early on have a great impact on the final result. There has been a considerable body of research [2] on this field and on this dataset but with no focus on the cooling load itself. Various software for simulation of energy consumption has been used over the years often in conjunction with architectural design. Accuracy of the simulation varies often across from one software package to another [3]. Therefore this work is envisaged as an addition to the existing software solutions. It is often the case that building parameters are compared separately with cooling and heating load, and simple correlation has been sought [4]. Multiple regression analysis was very popular for prediction of energy consumption until it was proven that a simple Neural Network is much better than Multiple Linear Regression Analysis with a large database [5]. �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 For architects it is very important to single out and rank parameters that have the strongest impact since normality assumptions do not hold for very complicated problems. For example, glazing areas will have minimal impact on the cooling load. Surface area and overall height are parameters with strongest impact. This work is done in hope it will help future architects, energy advisors for building smart buildings and generally in the field of energy efficiency. Further studies could help with choosing suitable materials for the construction. 2. Data This study is based on UCI database made, non-gaussian dataset made by a CAD software Ecotect. Dataset represents 12 different building forms, where each form is composed of 18 building blocks of the same volume (3.5 x 3.5 x 3.5), and houses have also the same volume, which is 771.75 m3, but different height and surface area. Materials used in these 18 blocks are all contemporary and with best U-values which are well defined for walls, floors etc with variations in glazing area and orientation [2]. With twelve building forms and three glazing area variations with five glazing area distributions each, and for four orientations, (12x3x5x4) 720 building samples. 12 building types are considered without glazing but with four sides of orientation (4x12). In all it gives 768 different building types. [2] Since parameters are identified which have the strongest impact a new dataset can be constructed where some parameters can be locked in value and others can be varied. Data-mining is the identification of the parameter which has the greatest influence of the result. Statistical tools will be used tools but also inputs from builders, architects, masons etc. will give great value to the study. They can also provide knowledge of feasibility of building parameters. How much a particular building feature costs in the real world. This is a well understood, relatively large dataset with 786 buildings each having 8 parameters. This is not a skewed dataset, so this dataset is not treated as such, meaning that data were not sifted through. Some light pruning, or trimming of data is an essential part of the random and best first search methods. Data are though skewed in another way. Dataset is non-gaussian, and it is of great importance to find any bias that may have influenced the dataset using classical statistical analysis which visually gives an outlying parameter. There were not any parameters which should be given more or less weight in the neural network model. Finding a dataset of real buildings or extracting data from buildings with a great cooling load was �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 also a goal for this work. Glazing area did not have much importance in this data set for finding cooling load. New modern types of materials are changing the paradigm of the builders' philosophy and focus of this work changed back on the study of the virtual buildings i.e. our dataset. It would be best to actively follow the research on the field, particularly if there has been a report on a construction of the buildings based on research using this or a similar dataset. Dataset has been normed, quantified and classified in a very understandable and logical way by Xifara-Tsanas, (see Table 1). Table 1. Mathematical representation of the input and output variables to facilitate the presentation of the subsequent analysis and results. Mathematical Name Number of possible values x1 Relative compactnes 12 x2 Surface area 12 x3 Wall area 7 x4 Roof area 4 x5 Overall height 2 x6 Orientation 4 x7 Glazing area 4 x8 Glazing area distribution 6 y2 Cooling load 636 representation 3. Methods Classical statistical tools like histograms and scatter plots are firstly applied to dataset. Seeing the data on the graph is a great help in understanding the data. It gives the idea in which direction study has to go. Improving a model can take two different directions: make the model simpler or add complexity. Making a simpler model involves feature reduction, pruning branches and removing learners from an ensemble. Adding complexity means fine-tuning involving model-combination or adding more data sources [6]. Out of many software tools, WEKA is chosen because it is easy to use and it is easily accessible. Searching for the best computer intelligence method that is suitable for artificial dataset was the first step. Which algorithm to use is to be based on dataset form and trial and error method. Getting a good result from the start with a random forest method gave indication in which direction to go. For the analysis of the available data set, five different regression algorithms were used: �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 • Linear Regression • Random Forest • REPTree • SMOreg • Multilayer Perceptron These algorithms are recommended for these types of datasets [7]. Regression analysis was helpful to model the relationship between dependent variables (cooling load) and independent variables (8 attributes in our dataset), and because a class from a data set (cooling load) has a large number of different instances. Cross validation was used with ten folds, to get insight of how the model will behave to an unknown dataset. All of the above algorithms are regression algorithms, with the same goal, but working in different ways. Linear regression models are linear predictor functions whose model parameters are estimated from the data. Linear regression models are often fitted using the least square approach, but they may be fitted in many other ways [8]. Random forest is an ensemble method, which creates multitude of decision trees, and gives as output mean prediction of individual trees. This algorithm applies bootstrap aggregating, or bagging, to its tree learners. Compared to decision tree random forest tends to provide more accurate classification of a feature, because of the decreased bias and variance. The more decision trees are chosen the more computational power is required [9]. Reduced Error Pruning Tree (REPTree) is a fast decision tree learner, which creates multiple trees in different iterations and selects the best one from all created trees. REPTree builds regression tree information gain and prunes it using reduced-error pruning. For numeric attributes it sorts values only once [10]. SMOreg uses a support vector machine for regression. RegSMOImproved for SMOreg are used to learn parameters, but many other algorithms can be used, like Platt’s SMO [11]. Multilayer perceptron is a class of feedforward artificial neural networks. It consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. It is by far the most popular architecture because of its structural flexibility, good representational capabilities, and the availability of a large number of training algorithms [12]. Feature selection is a key part of the applied machine learning process, just as model selection is. Feature selection should be considered as a part of the model selection process. If not, bias can inadvertently be introduced into models and it results in overfitting. �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 Feature selection must be included within the inner-loop when using accuracy estimation methods such as cross-validation. This means that feature selection is performed on the prepared fold right before the model is trained [7]. Dataset used in this work is small both in number of features and samples and it does not suffer from the “curse of dimensionality” [13] p.4. Feature selection and feature extraction methods are not recommended for this type of datasets with a small number of features [13] but extracting the information about which variables are most important, is important in this type of study. Choosing this particular approach is a type of rudimentary data mining. Four attribute evaluators and two search methods combinations are used: • CfsSubsetEval and BestFirst • ClassifierAttributeEval and Ranker • ClassifierSubsetEval and BestFirst • CorrelationAttributeEval and Ranker CfsSubsetEval creates subsets of attributes, where predictive ability of each feature and level of redundancy is considered. Features need to be highly correlated with class and low intercorrelation. Best first search method is used with CfsSubsetEval. ClassifierAttributeEval evaluates the worth of an attribute by using a user-specified classifier. For example if we use linear regression on our dataset, linear regression needs to be chosen for the classifier attribute evaluator. Ranker search method is used with classifier attribute evaluators. Classifiersubseteval evaluates attribute subsets on training data or a separate hold out testing set. Same as classifier attribute evaluator it uses classifier to estimate how good are subsets. Bestfirst search method is used with ClassifierSubsetEval. CorrelationAttributeEval evaluates the worth of an attribute by measuring the correlation between it and the class. Each value of an attribute is treated as an indicator. Ranker method is used with CorrelationAttributeEval. Best-first search method searches the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility. Bestfirst may start with the empty set of attributes and search forward, or start with the full set of attributes and search backward, or start at any point and search in both directions. Ranker search method ranks attributes by their individual evaluations, where it is used with attribute evaluators. �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 4. Results and Discussion Classical statistical tools like probability distribution were used firstly in order to get the sense of the data. Table 2 represents the attribute subset evaluator used on random forests. Random forests with Classifier Subset Evaluator and Best First search method gave the best results for all the combinations. Best First search method is a heuristic or informed search; it evaluates the second step before taking the first. Then it chooses which way to go. For this combination of methods only attribute nr.2 (Surface Area), is not considered. Since the volume of the buildings is fixed it is logical that surface area has a little variation and therefore a little impact on the result. Table 2. Results for combination of random forest and search methods Random Forest Attribute Correlation Mean Root Relative Root Relative Selected Evaluator and Coefficient Absolute Mean Absolute Squared Error Attribute Error Squared Error 1.4319 2.2692 16.6687 Search Method CfsSubSetEval 0.9711 and BestFirst Classifier s 23.8241% 3, 5, 6, 7 17.1345% 1, 2, 3, 4, % 0.9582 1.0079 1.6320 AttributeEval and 11.7324 % 5, 6, 7, 8 Ranker Classifier 0.959 2.0323 2.6933 AttributeEval and 23.3581 28.2775% 1, 2, 4, 5 17.0046% 1, 3, 4, 5, % Ranker ClassifierSubsetE 0.9854 0.9967 1.6196 val and BestFirst CorrelationAttrib 11.6030 % 0.9852 1.0079 1.6320 uteEval and 11.7324 6, 7, 8 17.1345% % 1, 2, 3, 4, 5, 6, 7, 8 Ranker CorrelationAttrib 0.9841 uteEval and 1.0859 1.6904 12.6408 17.7479% 5, 1, 3, 7 % Ranker Relationship between the volume of a built form and the surface area of its enclosure is called compactness. Roundness is a similar feature. R. Buckminster Fuller, engineer and an architect claimed that round houses have best energy efficiency, and an attempt to extract this feature has been made, but with no results. �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 Surface area, attribute nr.2, directly shows compactness of the building and by similarity, roundness. Classifier attribute evaluators removed this feature and gave the best correlation coefficient meaning that compactness has no impact on cooling load. Usage of geometric compactness for such evaluative purposes is criticized on multiple grounds. It does not capture the specific morphology of the building shape, disregards transparent blocks of the structure and does not correlate with orientation att. nr. 6 [14]. High correlation coefficient with all attributes included, except for surface area finally pointed that compactness does not affect thermal load. Our model gave similar results using the same dataset as Tsanas and Xifara [2] with slightly better correlation coefficient which is shown in Table 3 for classifier attribute evaluator and ranker, in Table 4 for correlation attribute evaluator and ranker. Table 3. Ranking of attributes according to attribute evaluator and ranker ClassifierAttributeEval and Ranker Mathematical representation Name Ranked x1 Relative compactnes 6.8134 x2 Surface area 6.8134 x4 Roof area 5.5105 x5 Overall height 5.2827 x3 Wall area 2.3935 x7 Glazing area 0.1718 x8 Glazing area distribution 0.0306 Table 4. Ranking attributes according to correlation attribute evaluator and ranker CorrelationAttributeEval and Ranker Mathematical representation Name Ranked x5 Overall height 0.8958 x1 Relative compactnes 0.6343 x3 Wall area 0.4271 x7 Glazing area 0.2075 �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 x8 Glazing area distribution 0.0505 x6 Orientation 0.0143 x2 Surface area -0.673 x4 Roof area -0.8625 Further study is to be done with different variations of cross folds with above-mentioned algorithms. Results would be standing stronger if another dataset to test our algorithm was available. “K-nearest neighbor” algorithm gave poor results. It is a “data sensitive” algorithm, vulnerable when faced with large amounts of data. Different datasets would be a great boost to this work to test methods against them. Parameter tuning is an iterative process, and Weka makes it easy to use it, without need to understand how parameters work. Especially, when dealing with feature selection, bias can be inadvertently introduced into models as it can give unforeseen consequences, mostly overfitting [7] [15]. Numerical values calculated by software simulations, lies very closely to previous results. Close values as compared to similar studies on the same dataset is a characteristic of the machine learning scientific field and using different methods and coming to the same results is an achievement [16]. 6. Conclusion Results of the previous study were repeated [17], and further work was done with examining cooling load resulting in slightly better correlation coefficient than in article with high scientific impact [2]. Trial and error are at the core of machine learning. Choosing right algorithms is a trade-off between speed, accuracy, and complexity. Starting with simple combinations and then adding complexity is the core of dealing with machine learning while constantly having in mind what type of data is dealt with. Empirical study gives answers to what algorithm to use or what parameters to choose. Knowing beforehand what method will work best is almost impossible. Constantly iterating different combinations of similar methods with systematic workflow and using Weka is a way forward. New and easy accessible software packages makes it easier to spot and exploit new research areas, which previously were inaccessible due to low computing capability. REFERENCES �Journal of Natural Sciences and Engineering, Vol. 1, (2019) DOI number: 10.14706/JONSAE2019114 [1] Y-T. Chen, “The Factors Affecting Electricity Consumption and Sector – A Case of Taiwan”, 2017. [2] A. Tsanas, A. Xifara, “Accurate quantitative estimation of energy performance of residential building using statistical machine learning tools”, Science Direct, 2012, p 9. [3] A. Yezioro, “An applied artificial intelligence approach towards assessing building performance simulation tools”, Energy and Buildings, 2007, p 40. [4] T. Catalina, J. Virgone, “Cooling energy demand evaluation by means of regression models”. Proceedings of the Eleventh International Conference Enhanced Building Operations, New York City 2011, pp 6. [5] D. Datta, S. A. Tassou, D. Marriot, “Application of Neural Networks for the Prediction of the Energy Consumption”, 1997. [6] Mathworks, “Mastering Machine Learning: A Steb-by-Step Guide with MATLAB.” Available at: https://www.mathworks.com/campaigns/offers/mastering-machine-learning-withmatlab.confirmation.html?ab_test=b_version. [7] J. Brownlee, “Machine Learning Mastery With Weka”, Wellington: Jason Brownlee 2019. [8] X. Yan, X. Su, “Linear Regression Analysis: Theory and Computing”, World Scientific, 2009. [9] D. Natingga, “Data Science Algorithms in a Week”, 2017. [10] S. Kalmegh, “Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News.”, IJISET- International Journal of Innovative Science, Engineering and Technology, 2015, Vol. 2 Issue 2. [11] S. K. Shevade, “Improvements to the SMO Algorithm for SVM Regression”, IEEE Transactions on Neural Networks, 2000, vol. 11, no. 5-6. [12] P. Thomas, M. C. Suhner, “A new Multilayer Perceptron Pruning Algorithm for Classification and Regression Applications”, Neural Processing Letters, Springer Verlag, 2015, p 31. [13] M. S. Raza, U. Qamar, “Understanding and Using Rough Set Based Feature Selection – Concepts, Techniques and Applications”, Springer, 2017. [14] W. Pessenlehner, A. Mahdavi, “Building Morphology, Transparence and Energy Performance”, Eight International IBPSA Conference, Netherlands, Eindhoven, 2003. [15] M. Kosinski, Y. Wang, “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images”, Journal of Personality and Social Psychology, 2018. [16] J. Christian, “Statistician: Machine Learning Is Causing A Crisis in Science”, Available: https://futurism.com/machine-learning-crisis-science. [17] A. Bajek, A. Hasandić, “Energy Efficiency of the buildings.” Sarajevo: International Burch University 2017. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Quantitative estimation of cooling load capabilities of residential buildings using machine learning Author Author Nedret Bećirović, Ismail Bejtović, Jasmin Kevrić Abstract A summary of the resource. Based on previous research on energy efficiency of the buildings, particularly their cooling load capabilities we will develop a collection of machine learning methods for detecting buildings with best cooling load capabilities. This collection will study the influence of 8 input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on one output parameter, that is cooling load of buildings. The results of this study support the practicability of using machine-learning software to estimate building parameters as a convenient and accurate approach, as long as the methods chosen are well suited for the type of data in question. Keywords Keywords. cooling load, energy efficiency, machine learning, neural network. Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021315 https://eprints.ibu.edu.ba/files/original/27eac86735340248e4eda9d6b63e242a.pdf dfc7fdc2237adebaad2030ec2e8f4107 PDF Text Text Leveraging Raspberry Pi as a server for the integration of the NETCONF protocol within IoT systems based on YANG Dalibor Đumić1, Slobodan Lubura2 1 2 International Burch University, Sarajevo, Bosnia and Herzegovina University of East Sarajevo, East Sarajevo, Bosnia and Herzegovina dalibor.dumic@stu.ibu.edu.ba slobodan.lubura@ets.ues.rs.ba Abstract – Herein the idea of leveraging Raspberry Pi as a server for the integration of an incipient network management protocol, the Network Configuration Protocol (NETCONF), within IoT systems based on YANG is presented. The practical realization of this idea requires the implementation of the NETCONF protocol together with REpresentational State Transfer web services (RESTful). Such an interesting and innovative practical realization like this opens new additional possibilities in domotics systems and these possibilities will be discussed in this paper. Keywords – Django, domotics, Internet of Things, NETCONF, Raspberry Pi, RESTful web services, YANG 1. Introduction In each home network there are always heterogeneous devices that are expected to be connected. All of these devices are different if compared because they can be based on different hardware platforms, the controller services can be of a different nature, and also the software components for enabling the network access can vary [1]. For example, when we compare wearable technology based on the IoT like a smartwatch or wristband with smart home devices such as a washing machine or air conditioner, we will notice different capabilities in terms of memory usage, processing speed, and power consumption [2]. Because of that, the IoT devices can be generally classified based on their key characteristics: ● communication flows in the system, ● memory management, ● data manipulation and processing, ● power control and consumption. For example, a smart coffee machine is not always powered on because it performs certain tasks when required, but only when a user turns on it via a user interface such as a mobile application whenever the user wants to drink a coffee or when the user is on the way to home and wants to have already prepared coffee. These kinds of devices consume less power for communication. There are many actuators in home automation systems that must be managed by systems connected to the Internet via network protocols [3]. �The focus of this paper is on the practical implementation of the methodology proposed in [4] and this methodology was carried out by the empirical study of the NETCONF protocol that will be used as a network protocol for enabling the connection of the gateway with the Internet. The gateway will perform effective management of sensors and devices in a home network and it will be based on the RESTful technologies. The paper is organized into five sections. Section 1 introduces us to the IoT systems and the purpose of this paper. In Section 2, the NETCONF protocol and its features are introduced. The proposed integration of the NETCONF protocol in the IoT is detailed in Section 3. The results of the proposed integration of the NETCONF protocol are noted in Section 4. The benefits of the proposed integration and the main conclusions are discussed in Section 5. 2. The Network Configuration Protocol (NETCONF) and its features A. NETCONF The Network Configuration Protocol (NETCONF) is a network management protocol with great features such as installing, manipulating, and deleting the configuration of the devices in the network. Its purpose is managing network devices, retrieving its configuration data, and uploading or manipulating new configuration data of the network devices [5]. That means devices on the network can take different states according to their configuration. To switch between the device’s states, the configuration datastores are used. By definition, a configuration datastore contains a set of information that is needed for the configuration, and thereby that configuration is required to change the state of a device to chosen operational state from its initial default state. NETCONF currently supports event notification features and the following multiple configuration datastores [6]: ● "running" – this configuration is always present and it is used as the currently active configuration ● "startup" – this configuration is used in the next startup ● "candidate” - this configuration that can be used instead of currently running configuration through an explicit commit. By using NETCONF operations, it is possible to manipulate device configuration. The NETCONF operations are invoked as Remote Procedure Calls (RPCs) from the client to the server. Some minor operations are [6]: ● “commit” - commits the "candidate" configuration to "running", ● “copy-config” - copy one configuration datastore to another, ● “edit-config” - changes the contents of a configuration database, ● “get-config” - retrieves configuration datastore, ● “lock” - prevent changes to a datastore from another party, and ● “unlock” - releases lock on a datastore. �Configuration data stored on devices and the protocol messages between devices are encoded in Extensible Markup Language (XML) on both client and server side. Any script or application can be the client in order to be runned as part of a network manager. The server is typically a network device. There is a rule that a device on the network must support at least one NETCONF session. The main NETCONF message exchange between client and server in a single NETCONF session [7] is illustrated in Figure 1. At the start, the device and controller create a NETCONF session and share their list of their own capabilities by sending <hello> messages. A capability describes a supported data model. After the session has started,, the NETCONF executes exchanges <rpc> and <rpc-reply> messages. The <rpc> message consists of an enclosed NETCONF command which is sent from the controller to the device. The <get> command in the <rpc> message is used to get the running configuration and state information of the device (3). The <editconfig> request is used to write a specific configuration on the device (5). The <rpc-reply> message is sent from the device to the controller in response to a <rpc> message. The response data for the given method invoked is encoded as one or more child elements enclosed in the <rpc-reply> message. Figure 1. NETCONF messages The information that a client retrieves from the server consists of two parts: configuration data and state data [6]. The purpose of the configuration data is to give a description of actions that will change a system from its previous state into the state described in the configuration data, while the purpose of the state data is to provide information such as read-only status data and collected statistics. For specifying NETCONF data models and operations, the YANG data modeling language is used. A. YANG To perform the NETCONF operations, a YANG module has to be defined as a hierarchy of data such as configuration data, state data, RPCs, and notifications. By defining the YANG module, a description of all data sent between both NETCONF client-side and server-side becomes completed. Each YANG module is consisting of statements and some of the statements are previewed in Table 1 [8]. Table 1. YANG statements Statements augment choice Description Extends existing data hierarchies Defines mutually �container extension feature grouping key exclusive alternatives Defines mutually exclusive alternatives Allows new statements to be added to YANG Indicates parts of the model are optional Groups data definitions into reusable sets Defines the key leafs for lists Defines a leaf node in the data hierarchy A leaf node that can appear multiple times leaf leaf-list list notification rpc typedef uses A hierarchy that can appear multiple times Defines notification Defines input and output parameters for an RPC Defines a new type Incorporates the contents of a "grouping" With the help of XML parsers and XSLT scripts, a translation of the YANG module into an equivalent XML syntact becomes possible. Every YANG module consists of a set of built-in types and has a type mechanism through which additional types may be defined. The modeler of the YANG module can add constraints to the model to prevent impossible or illogical data. The purpose of these constraints is to provide information about the data being sent from the server and help a client to understand the data that the server will accept in order to avoid sending incorrect data from the client to the server. Table 2 briefly describes some other common YANG constraints [9] Table 2. YANG constraints Statements length Description Limits the length of string �mandatory max-elements min-elements Requires the node appear Limits the number of instances in list Limits the number of instances in list must XPath expression must be true pattern range reference unique when Regular expression must be satisfied Value must appear in range Value must appear elsewhere in the data Value must be unique within the data Node is only present when XPath expression is true Generally said, the YANG module is a single data model that contains three types of statements: ● module-header statements – they describe the module and provide the information about the module ● revision statements – they provide information about the history of the module ● definition statements – they are the body of the module where the YANG module is defined. In order to use the YANG module, it firstly has to be defined or modeled to the specific problem domain. After that, the YANG module can be loaded, compiled, or coded into the server. In the end, the NETCONF server may implement any number of the YANG modules [10]. 3. Proposed Methodology After the empirical study of the NETCONF protocol and retrieving its features, an implementation of the proposed integration was divided into two parts: server-side and client-side, as it is shown in Figure 2. Figure 2. Both client and server sides are communicating over the Internet [4] A. Server-side To implement the proposed integration, the following requirements are defined: �● small physical dimensions, because it has to be hidden in home installation and not visible; ● able to boot Linux Operating System, since the Linux OS is open-source; ● has General Purpose Input Output (GPIO) pins for interfacing with the sensors and devices, ● has Ethernet port and/or WiFi module, and ● CPU based on ARM for fast computing. A great match for the single board with the following characteristics is Raspberry Pi 3 B+, which is based on a 1.4GHz 64-bit quad-core ARM Cortex-A53 processor. The good thing about Raspberry Pi is that it has the GPIO module which can be used through several programming languages such as C, C#, Python, Java, etc. The fact is that the integration will be implemented by using Python programming language and it makes Raspberry Pi a perfect match [11]. A server would be connected via appropriate connection lines to these rooms as it is shown in the Figure 3. Figure 3. Raspberry Pi as server connected to sensors and devices in each room via GPIO line [4] In order to build a server, the Netopeer2, a set of tools implementing network configuration based on the NETCONF protocol, is installed [12][13]. Each room in a home has sensors and relays for controlling devices. For each room, a custom YANG module is created, and each custom YANG module manipulates with data such as temperature, humidity, open or closed status, turned off or turned on status, etc. Thanks to custom YANG modules, the server can easily manage the information related to the sensors and relays in the home. The structure of the simplest custom YANG module for a room is shown in the section “Appendix”. �B. Client-side On the client-side, any device which supports the NETCONF protocol can communicate with the server. However, the challenge is to develop an application by means of RESTful services. It should send the RPC commands such as “edit-config” or “get-config” directly to the NETCONF server in order to retrieve information about rooms in the user’s home. Finally, its interface must be user-friendly and rich with data charts, data graphs, toggle buttons, etc. The very first step is to develop a script that shall “talk” with the NETCONF server. Thanks to the enormous possibilities of the Python programming language, it is possible to communicate with the server via the NETCONF protocol by using ncclient library. The ncclient library enables an easy way of the client-side scripting around the NETCONF protocol, and as well as the possibility of the application development [14]. The next step was to develop a web application and merge it with the script based on ncclient library. There are many high-level Python web frameworks and one of them is Django. Django is specific because it encourages rapid development and clean, pragmatic design [15]. By combining Django and ncclient, a powerful user-friendly web application is created, and it will fulfill its main purpose – to collect all information about the conditions such as temperature and humidity in the rooms of the user’s home and to control devices in the rooms of the user’s home, all of it over the NETCONF protocol. 4. Results On the client-side we have an application based on both front-end and back-end development in the Django framework and merging its back-end with the ncclient module for interfacing with the server as shown in Figure 4. Figure 4. Developed client application �On the server-side we have Raspberry Pi computer booting Linux OS which runs Netopeer2 and sysrepo modules for enabling the NETCONF protocol and interfacing the data through YANG modules. The Raspberry Pi is connected to several sensors and actuators, as shown in the Figure 5: Figure 5. Raspberry Pi running as the NETCONF server The URL of the recorded video of the methodology proposed in this paper can be found below in the reference section [16]. A clip from the recorded video is shown in Figure 6 and it can be seen that two processes are running parallely: sysrepo and netopeer2. Figure 6. Testing the proposed methodology An overview of both client and server sides is shown in Figure 7. �Figure 7. Used technologies on both client and server sides The complete overview of the proposed integration is shown in Figure 8. Figure 8. Overview of the complete integrated system 4. Conclusion Through the empirical study of the NETCONF protocol, great capabilities of the NETCONF protocol are discovered. The NETCONF protocol allows us to have an unlimited number of YANG modules with different structures of the data. This characteristic of the NETCONF protocol is of crucial importance for using it in the home automation system and similar systems. The proposed integration is not a challenge anymore. Thanks to the powerful Python Web framework and ncclient Python library, it is possible to develop a rich web application that can be outperformed on many devices such as single board computers, desktop computers, notebooks, and even tablets. APPENDIX Implemented module for a room in the YANG language: module room1 { namespace "urn:sysrepo:room1"; prefix r1; description "The room yang module."; revision 2019-09-14 { description "Initial revision."; } container room-data { description "Room 1 info."; leaf temperature { �description "Actual temperature inside the room."; type uint8 { range "0..125"; } } leaf humidity { description "Actual humidity inside the room."; type uint8 { range "0..100"; } } leaf ac-status { description "Informs whether the AC is switched on or off."; type boolean; } } } ACKNOWLEDGMENT Many thanks to the experts from the RT-RK Institute for Computer Based Systems in Banja Luka who contributed and influenced so much to the development of this research from the early stages of the project. REFERENCES [1] M. Tooba, A. Muhammad and A. M. Martinez-Enriquez, "Smart Solution for Heterogeneous Device Interoperability in IoT," 2018 Seventeenth Mexican International Conference on Artificial Intelligence (MICAI), Guadalajara, Mexico, 2018, pp. 70-75, [2] Van den Abeele, F., Hoebeke, J., Moerman, I., & Demeester, P. (2015). Integration of Heterogeneous Devices and Communication Models via the Cloud in the Constrained Internet of Things. International Journal of Distributed Sensor Networks. [3] Vijay S., Banga M.K. (2018) Management of IoT Devices in Home Network via Intelligent Home Gateway Using NETCONF. In: Kumar N., Thakre A. (eds) Ubiquitous Communications and Network Computing. UBICNET 2017. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 218. Springer, Cham [4] D. Đumić, S. Došlić, M. Antić, B. Milić, “Integration of the NETCONF Protocol in the Internet of Things by means of RESTful Web Services”, 6th International Conference on Electrical, Electronic and Computing Engineering IcETRAN, pp. 983 - 987, ETRAN Society, June 2019 [5] R. Enns, M. Brojklund, J. Schoenwaelder and A. Bierman, “Network Configuration Protocol (NETCONF)”, Internet Engineering Task Force (IETF), ISSN: 2070-1721, June 2011. [Online]. Available: https://tools.ietf.org/html/rfc6241 [6] H. Ji, B. Zhang, G. Li, X. Gao and Y. Li, "Challenges to the New Network Management Protocol: NETCONF," 2009 First International Workshop on Education Technology and Computer Science, Wuhan, Hubei, 2009, pp. 832-836, doi: 10.1109/ETCS.2009.189. �[7] M. Dallaglio, N. Sambo, F. Cugini and P. Castoldi, "Management of sliceable transponder with NETCONF and YANG," 2016 International Conference on Optical Network Design and Modeling (ONDM), Cartagena, 2016, pp. 1-6 [8] M. Dallaglio, N. Sambo, F. Cugini, P. Castoldi, “Management of sliceable transponder with NETCONF and YANG”, International Conference on Optical Network Design and Modeling, pp. 1 – 6, IEEE, May 2016 [9] P. Shafer, “An Architecture for Network Management using NETCONF and YANG”, Internet Engineering Task Force (IETF), ISSN: 2070-1721, June 2011, [Online]. Available: https://tools.ietf.org/id/draft-ietf-netmod-arch-07.html [10] M. Brojklund, “YANG – A Data Modeling Language for the Network Configuration Protocol (NETCONF)”, Internet Engineering Task Force (IETF), ISSN: 2070-1721, October 2010. [Online]. Available: https://tools.ietf.org/html/rfc6020 [11] The Raspberry Pi Foundation. “Raspberry Pi 3 Model B+”, [Online], Available: https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/ [12] Czech Educational and Research Network (CESNET), “Netopeer2 – The NETCONF Toolset”, [Online], Available: https://github.com/CESNET/Netopeer2 [13] sysrepo - YANG-based datastore for Unix/Linux application, [Online], Available: http://www.sysrepo.org/static/doc/html/start_page.html [14] S. Bhushan, L. Poulopouls, Python library for NETCONF clients, [Online], Available: http://ncclient.readthedocs.org/ [15] Django Software Foundation, [Online], Available: https://docs.djangoproject.com/en/3.0/ [16] NETCONF Protocol + Raspberry Pi + Django = Home Automation || Yugoscientiz © 2019, [Online], Available: https://www.youtube.com/watch?v=ZoiYGt2NbCA � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Leveraging Raspberry Pi as a server for the integration of the NETCONF protocol within IoT systems based on YANG Author Author Dalibor Đumić1, Slobodan Lubura Abstract A summary of the resource. Herein the idea of leveraging Raspberry Pi as a server for the integration of an incipient network management protocol, the Network Configuration Protocol (NETCONF), within IoT systems based on YANG is presented. The practical realization of this idea requires the implementation of the NETCONF protocol together with REpresentational State Transfer web services (RESTful). Such an interesting and innovative practical realization like this opens new additional possibilities in domotics systems and these possibilities will be discussed in this paper. Keywords Keywords. Django, domotics, Internet of Things, NETCONF, Raspberry Pi, RESTful web services, YANG Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021314 https://eprints.ibu.edu.ba/files/original/277ccb2bcc1a93885c5603d23beeeaa1.pdf 713fcc8ab28178f5189f971fd2845cb6 PDF Text Text Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 Student Attendance Pattern Detection and Prediction Ibrahim Muzaferija1, Zerina Mašetić2, Samed Jukić3, Dino Kečo4 1 International Burch University, Sarajevo, Bosnia and Herzegovina ibrahim.muzaferija@stu.ibu.edu.ba zerina.masetic@ibu.edu.ba samed.jukic@ibu.edu.ba dino.keco@ibu.edu.ba Abstract – Since the early beginnings of education systems, attendance has always played a crucial role in student success, as well as in the overall interest of the matter. The most productive way of increasing the student attendance rate is to understand why it decreases, try to predict when it is going to happen, and act on causing factors in order to prevent it. Many benefits of predicted and increased attendance rate can be achieved, including better lecture organization (i.e. lecture time and duration, lecture class choice, etc). This paper describes the steps in the extraction of knowledge from the university's student database and making a model that predicts whether the student will attend the class or not. Results show that the attendance patterns are best reflected when employing a decision tree algorithm, a C4.5 model that is interpretable and able to predict the attendance with 0.81 AUC performance measure. Keywords - Data Mining, Educational Data Mining, Machine Learning 1. Introduction Data mining (DM) is an approach to discover useful information in data. It uses statistical and machine learning (ML) techniques to operate on large volumes of data to discover hidden patterns and relationships that describe the behaviors of systems that produced the data. Relationships and patterns discovered provide helpful insight into decision making, as well as making predictions, thus solving numerous problems. In recent years, there has been an increase in the use of ML techniques in many fields, such as education, economics, business, statistics, medicine, and sport. The main objective of this paper is to apply ML techniques in the educational field to analyze student behaviors and to predict whether the student will attend the class. Traditionally, educational institutions are collecting large volumes of data related to students, classes, faculty members, and educational processes. However, collected data is often not analyzed enough to provide significant results. In general, collected data is used for producing simple reports that are not highly significant in contributing to the decision making process in the institutions. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 Currently, educational systems aim to enhance the teaching and learning process by carefully analyzing collected data, and discovering patterns related to student behavior and their final outcome. Reasons are to identify which students will perform well, so that they can be awarded scholarships and more importantly, to identify the students who may fail so that some form of help and assistance may be offered to them. Besides identifying students by their performance, it's also important to discover which aspects of teaching and learning systems facilitate student learning and success. One of the aspects that are closely related to student performance is student attendance, meaning that students who have a higher attendance rate also have a higher success rate in the end [1]. The paper is structured in seven sections: 1. Introduction section; 2. The previous work section describes the previous efforts for the topic; 3. The methods and materials section describes data cleaning and processing steps; 4. The model creation section describes model selection and creation methodology; 5. The results section provides model results and evaluation; 6. In the discussion section, a comparison between this study and previous studies is made; 7. The conclusion section provides recommendations for future work in the area of educational data mining. 2. Previous Work Gurmeet Kaur and Williamjit Singh [2] applied machine learning methods from the WEKA tool in order to predict students' performance from the College of Science and Technology – Khan Younis. Thir work was concluded with two classification algorithms, Naive Bayes and J48, which provided an accuracy of 63.59% and 63,53% respectively. C. Anuradha and T. Velmurugan [3] conducted a comparative analysis of the evaluation of classification algorithms in the prediction of students' performance. The dataset was obtained from the college database, containing 19 attributes that describe the student, his family, and the living environment, as well as previous performances. Their goal was to compare algorithms in predicting students’ performance in end semester examinations. The results show that Bayesian classifiers, as well as JRip and J48, had the highest accuracy which is very close to 70%. Abeer Badr El-Din Ahmed and Ibrahim Sayed Elaraby [4] describe the importance of Educational Data Mining (EDM) and Knowledge Discovery in Databases (KDD) in achieving the main goal of higher education institutions, that is, providing quality education to students. They used classification algorithms to identify those students who needed special attention in order to reduce the failing ratio and taking appropriate action at the right time, resulting in a decrease of the falling ratio by more than 15%. Anal Acharya and Devadatta Sinha [5] used a dataset that contains a huge number of features that describe a student, by applying feature selection algorithms like Correlation-Based Feature Selection (CBFS) and Information Gain Attribute Evaluation (IGATE), they reduced the number of features and performed cross �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 modeling with five machine learning algorithms: Decision Trees (DT), Bayesian Networks (BN), Artificial Neural Networks (ANN), Support Vector Machines (SVM) and Multi-Layer Perceptron (MLP). Features related to gender, university, time, and family are the ones having the highest information gain, as well as the models created using decision tree algorithms, provide 10-15% more reliable performance in comparison to other classification algorithms. The study conducted by Havan Agrawal and Harshil Mavani [6] confirms that past performances have indeed got a significant influence over current performances. Further, they used neural network algorithms and confirmed that the accuracy of the algorithms is proportional to dataset size, meaning that with the increase of dataset size, the algorithms generalize the problem better. In this paper, we’ll address the problem with a selection of best-performing machine learning algorithms for EDA, as proposed by Anal Acharya and Devadatta Sinha [5] and Gurmeet Kaur and Williamjit Singh [2], such as Logistic Regression, Decision Tree, Rule-based, k-NN, etc. Moreover, an increased number of data samples is obtained in order to improve the algorithms generalizing ability, in contrast to the number of data samples used in the previous study conducted by Gurmeet Kaur and Williamjit Singh [2]. 3. Methods and Materials The research is based on CRISP-DM [7] methodology as it describes common approaches used by data mining experts, while the paper contains a simplified version of the processing model shown below. Figure 1. Data processing workflow A. Data selection Initial data was obtained from International Burch University’s Student Academic System [8] and contains 2nd-year student attendance data from the years 2016/2017 and 2017/2018. Although the dataset doesn't contain all the details about the students and their classes (such as day of the week in which the class was held, exact start and end time of classes, professor ID, etc.), it’s enough to extract the patterns of student attendance behavior and create a model that predicts it. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 The data was obtained as an SQL file, and after importing the file to the local database, RapidMiner [9] was used to fetch the tables and store them in CSV format. Every further operation is done using the RapidMiner, as it has the Weka [10] extension. The following table displays whether or not an attribute of the original dataset was copied over to the data mining dataset. All the selected attributes were considered relevant to the task of predicting student attendance to classes. Table 1. Initial dataset attribute selection Table Attribute Accepted Notes student_id x No need for additional IDs student_number x No need for additional IDs student_id ✓ Student ID course_code ✓ Course ID branch x Same values in other tables year x Same values in other tables semester x Same values in other tables student_id ✓ Student ID attendance_id ✓ Class attendance ID attendance_id ✓ Class attendance ID course_code ✓ Course ID branch ✓ Branch year ✓ Year semester ✓ Semester number course_date ✓ Starting date of the week in which class was held type ✓ Type of the class students student_courses student_attendance course_attendance �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 topic x Not relevant / High cardinality duration ✓ Duration of the class B. Data Cleansing In order to get an insight into data quality, graphical and statistical methods were used to detect anomalies, faults, outliers, missing values, etc. First, the dataset was divided into four parts: 1st semester of 2016, 2nd semester of 2016, 1st semester of 2017, and 2nd semester of 2017. After examination, data related to both semesters of the year 2016 contained no anomalies and were consistent, thus were labeled as clean data. Furthermore, 2nd semester of the year 2017 contained incomplete data due to university system failure (class attendance from the last 2 weeks is missing), and 1st-semester data were not consistent (having a huge number of recorded attendances in the 14th week and almost none in 15th week). The dataset contained automatic attendance values that were irrelevant for creating a model and those samples were removed. Some attendance samples recorded before and after the semester were marked as outliers. Samples related to midterm and final exams showed the decrease of recorded attendances due to the nature of exam weeks, as instead of multiple lectures in those weeks, only one was held - the exam. Those samples were not relevant in predicting the lecture attendance and were discarded. C. Deriving Data From the course_date attribute, containing the date of the week in which the class was held, week attribute was derived, containing week number in the semester. The attribute attended is added to the table student_attendances and contains the value 1, which reflects that the student attended the class. Later when joining tables, this attribute will have missing values which indicate that students didn't attend the class. The dataset contains only the records of students that attended the class and no records of students that didn't attend. In order to populate the attribute attended with reflection did the student attend the class, joining the tables is necessary. First, by performing an inner join of student_courses and course_attendance tables, matching course_code from one table with course_code from another, a new table is created containing a matched list of students per course attendance IDs. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 Next, by performing a left join of the previously created table and student_attendance table, matching both attendance_id and student_id from one table with attendance_id and student_id from another table, a new table is created containing attendance values where the student attended the class and missing values where the student was absent. Finally, missing values were replaced with 0, indicating that the student was absent. D. Dataset Creation During the data preparation phase, attributes considered most relevant were selected to shape the model's prediction capabilities. Then, using the RapidMiner tool, all data was cleaned and exported as a CSV dataset that will be used in training and testing the model. The final dataset contains about 58,000 attendance samples from the 2nd semester of the year 2016, and the following table displays qualitative and quantitative aspects of all the attributes present on the final dataset. The goal attribute (or prediction class) is “attended” which indicates did the student attend the class (marked as 1) or not (marked as 0). Table 2 - Final dataset attribute description Attribute Data type Range Missing values Distinct values Unique values Statistics id integer [1,58019] 0 58019 58019 — attended integer 0,1 0 2 0 Least: 1 (21327) Most: 0 (36692) course_code nominal MAN 201, (...) 0 85 0 Least: IRES 305 (5) Most: MAN 201 (6784) branch nominal A,B,C,D,E, F 0 6 0 Least: D (1628) Most: A (37368) type nominal Recitation, lecture, lab 0 3 0 Least: recitation (1954) Most: lecture (46511) duration integer [1,4] 0 4 0 Min: 1 Max: 4 Average: 1.684 week integer [1,15] 0 15 0 Min: 1 Max: 15 Average: 7.861 4. Model Creation This machine learning problem belongs to the classification types [11]. In order to reach the business goal, the complete understanding of data is required to generate the model. Currently, there are several modeling algorithms for classification types of problems, and they are shown in the table below. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 In order to correctly create, evaluate and validate the model, one of the key steps is the separation of the data into training, testing, and validation. Table 3. Machine Learning algorithms Type Name Functions Logistic Regression ID3 (Decision Tree) C4.5 (J48) Trees Random Forest One-Rule Rules PRISM Memory-Based k-NN The most convenient method for training and testing separation is called Cross-Validation [12], as it splits the data into folds, and crosses the results of training and testing with different folds. The cross-validation is conducted using five folds of training data. Validation data will not be used in cross-validation in order to provide reliable testing results at the end. 5. Results All the decision tree algorithms had the minimal gain set to “0.01” in order to prevent premature pruning of the tree branches, and pruning confidence threshold to “0.25”. Other model settings have been kept on the default values because they are preselected for optimal model performance. After applying manifold training and testing methods known as cross-validation [13], building the models with different algorithms yielded promising results, as shown using the metrics such as accuracy, the area under the curve (AUC), precision, recall, fallout, and f-measure [14]. Moreover, models have been evaluated with validation data holdout and the results match with the cross-validation testing results presented below. Table 4. Evaluations of created models Algorithm Accuracy AUC Precision Recall Fallout F-Measure �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 Logistic Regression 75.37% 0.803 71.09% 55.63% 13.16% 62.41% ID3 68.38% 0.697 56.20% 63.31% 28.68% 59.54% C4.5 77.41% 0.812 73.04% 61.12% 13.12% 66.55% Random Forest 66.48% 0.700 56.41% 38.73% 17.39% 45.92% One-Rule 74.60% 0.500 69.25% 55.65% 14.39% 61.69% PRISM 64.15% 0.500 71.90% 4.07% 0.93% 7.70% K-NN 70.42% 0.672 58.13% 69.86% 29.25% 63.45% The machine learning algorithm that creates the most accurate model is a decision tree algorithm known as C4.5. The reason is the enhanced method of tree pruning that reduces misclassification errors due to noise and too many details in the training data set, as described in the study conducted by Anuja Priyam et al [15]. The accuracy of the model is fairly satisfying, taking into consideration that previous works provided an accuracy of less than 70%. As opposed to previously mentioned studies, our data set contains more examples thus produces a more accurate prediction model. This process allows the extraction of relevant information from the model and helps draw the lines of action for this business problem. Table 5. Confusion matrix for C4.5 model true 0 true 1 class precision predicted 0 31878 8291 79.36% predicted 1 4814 13036 73.03% class recall 86.88% 61.12% In regards to interpretability, the decision tree generated by the C4.5 algorithm is easy to interpret as the size of the tree generated is 357 and the number of leaves is 230. The most important attribute on the dataset, as taken from the model, is the course code. Furthermore, it's wrong to assume that one student attending classes has the same cost, from a business perspective, as one that never goes to class. That means that students that attend classes are beneficial and students that miss classes have a cost. With that in mind, the model needs to help in finding the solutions that decrease the overall cost. There are four possibilities: 1. We predicted the student would attend class and he did; 2. We predicted the student would not attend class and he did not; 3. We predicted the student would attend class, but he did not; 4. We predicted the student would not attend class, but he did. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 Point 1 is the best scenario, so it needs to have a negative cost (to be a benefit). Point 2 is the worst case, so it needs to have the highest cost. Point 3 is also negative, but not as negative as the previous one. Point 4 is positive, but not as good as the first point. With that information, it is possible to build a cost matrix for the class “Attended”: Table 6. Cost matrix for the model Actual T F T -15 15 F -5 5 Prediction Building the cost-matrix doesn’t affect the model’s performance but aids in the final outcome of prediction by introducing the business bias and targeting to increase the business value. 6. Discussion The possible issue with the study conducted by Gurmeet Kaur and Williamjit Singh [2] is the small number of instances (as low as 52) contained in the dataset and used to build the model. In order to make a model more accurate and more prone to generalization, Havan Agrawal and Harshil Mavani [6] propose using a higher number of instances, which made the model described in this paper more accurate. Moreover, crossvalidation, as one of the extra steps that are taken in model construction, increased the model’s overall ability to generalize and provide higher accuracy than models in previous studies. While conducting the research, it was noticed that the quantity and quality of data plays a crucial role in the final outcome and performance. We highly devise to use a high number of instances in future studies, and continuum stream of attendance data in deployed models to continuously train the model as the trends responsible for student attendance dynamic behavior progresses over time. The feature engineering task in the data preparation step has yielded significant model improvement as compared to the models from previous studies that are built without deriving new attributes. Moreover, the induction of external data has also improved the performance of the model as outliers were removed. 7. Conclusion This study has shown that patterns for student attendance exist and can predict whether the student will attend the class. The importance of student data quantity and quality is presented, as well as the methods for cleaning and transforming the data. The creation of a machine learning model should include cross- �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 validation as one of the key steps, and we devise using multiple algorithms for achieving the best results. When there is a business value to achieve, it’s recommended to use a cost-matrix to further adjust the model and increase the business value. The model for predicting student attendance can be used to improve in the area of causing factors and increase the attendance ratio, which will subsequently increase the passing ratio, i.e., the number of students that graduate. Future works can include an increase in data set examples, as well as dimensionality increase by adding attributes for external factors of students’ attendance, such as a professor who held the lecture and weather information of the day. REFERENCES [1] A. S. N. Kim, S. Shakory, A. Arman, C. Popovic, and L. Park, “Understanding the impact of attendance and participation on academic achievement,” 2019. [Online]. Available: https://doi.org/10.1037/stl0000151. [Accessed: 14-Feb-2020]. [2] “Prediction Of Student Performance Using Weka Tool,” Vidya Publications. [Online]. Available: http://ijoes.vidyapublications.com/paper/Vol17/02-Vol17.pdf. [Accessed: 26-Nov-2018]. [3] “A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Students Performance.” [Online]. Available: http://www.indjst.org/index.php/indjst/article/view/74555/58051. [Accessed: 26-Nov-2018]. [4] A. B. El-Din Ahmed and Ibrahim Sayed Elaraby, “Data Mining: A prediction for Student’s Performance Using Classification Method,” HR PUB. [Online]. Available: http://www.hrpub.org/download/20140105/WJCAT3-13701793.pdf. [Accessed: 26-Nov-2018]. [5] “Early Prediction of Students Performance using Machine Learning Techniques,” Semantics Scholar. [Online]. Available: https://pdfs.semanticscholar.org/6447/4a9172a97cdf5d39c6fdcc21fc0c61fc7df3.pdf. [Accessed: 26-Nov2018]. [6] “Student Performance Prediction using Machine Learning.” [Online]. Available: http://www.ece.uvic.ca/~rexlei86/SPP/otherswork/V4I3-IJERTV4IS030127.pdf. [Accessed: 26-Nov2018]. [7] “IBM Knowledge Center.” [Online]. Available: https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.crispdm.help/crisp_ove rview.htm. [Accessed: 19-Dec-2018]. [8] International Burch University, “Home,” International Burch University. [Online]. Available: https://www.ibu.edu.ba/. [Accessed: 19-Dec-2018]. [9] “Lightning Fast Data Science Platform for Teams | RapidMiner©,” RapidMiner, 19-Jan-2016. [Online]. Available: https://rapidminer.com/. [Accessed: 19-Dec-2018]. [10] “Weka 3 - Data Mining with Open Source Machine Learning Software in Java.” [Online]. Available: https://www.cs.waikato.ac.nz/ml/weka/. [Accessed: 19-Dec-2018]. [11] “[No title].” [Online]. Available: https://www.cs.princeton.edu/~schapire/talks/picasso- minicourse.pdf. [Accessed: 10-Nov-2019]. �Journal of Natural Sciences and Engineering, Vol. 1, (2020) DOI number: 12.34567/JONSAE2020123 [12] “[No title].” [Online]. Available: https://www.cs.princeton.edu/~schapire/talks/picasso- minicourse.pdf. [Accessed: 10-Nov-2019]. [13] “3.1. Cross-validation: evaluating estimator performance — scikit-learn 0.21.3 documentation.” [Online]. Available: https://scikit-learn.org/stable/modules/cross_validation.html. [Accessed: 10-Nov2019]. [14] L. Egghe, “The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations,” Inf. Process. Manag., vol. 44, no. 2, pp. 856–876, Mar. 2008. [15] “Comparative Analysis of Decision Tree Classification Algorithms” [Online]. Available: https://inpressco.com/wp-content/uploads/2013/03/Paper17334-3371.pdf. [Accessed: 05-July-2020]. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Student Attendance Pattern Detection and Prediction Author Author Ibrahim Muzaferija1, Zerina Mašetić2, Samed Jukić3, Dino Kečo4 Abstract A summary of the resource. Since the early beginnings of education systems, attendance has always played a crucial role in student success, as well as in the overall interest of the matter. The most productive way of increasing the student attendance rate is to understand why it decreases, try to predict when it is going to happen, and act on causing factors in order to prevent it. Many benefits of predicted and increased attendance rate can be achieved, including better lecture organization (i.e. lecture time and duration, lecture class choice, etc). This paper describes the steps in the extraction of knowledge from the university's student database and making a model that predicts whether the student will attend the class or not. Results show that the attendance patterns are best reflected when employing a decision tree algorithm, a C4.5 model that is interpretable and able to predict the attendance with 0.81 AUC performance measure Keywords Keywords. Data Mining, Educational Data Mining, Machine Learning Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021313 https://eprints.ibu.edu.ba/files/original/2ce3ee80befbef6a1b69b1e7067b0262.pdf 371f86fa32eeedceeb3439bff85b522d PDF Text Text Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Overview of Human Lineage Genetic Marker Studies in Bosnia and Herzegovina: Y chromosome story Aldin Pirić1, Sabahudin Ćordić1, Lejla Smajlović-Skenderagić1, Serkan Dogan1, Damir Marjanović1,2 1 International Burch University, Sarajevo, Bosnia and Herzegovina 2 Institute for Anthropological Research, Zagreb, Croatia aldin.piric@stu.ibu.edu.ba sabahudin.cordic@stu.ibu.edu.ba l.smajlovic.skenderagic@ibu.edu.ba serkan.dogan@ibu.edu.ba damir.marjanovic@ibu.edu.ba Abstract – Modern Bosnia and Herzegovina is a state consisting of multiple ethnicities and regions located in the Western Balkan, with a very complex history. The earliest historical findings show that its area was inhabited since the Paleolithic. From that time, this part of Europe, especially the region of the Modern Bosnia and Herzegovina, could be recognized as the crossroad for the different human migration and the meeting point for different cultures, religions and gene pools. Mitochondrial DNA is being used for maternal lineage testing, while the Y chromosome is being used for paternal lineage testing. Therefore, these markers are being referred to as lineage markers. Lineage markers are often used for parental lineage monitoring in population genetics, human genetics, as well as in forensic genetics. The main intention of this paper is to construct a short overview of the Y chromosome studies performed in Bosnia and Herzegovina within the last two decades. Keywords - Bosnia and Herzegovina, lineage markers, molecular markers, population genetic studies, Y chromosome 1. Introduction Existent archeological artifacts are proving that territory of Bosnia and Herzegovina has been populated since Neolithic [1]. However, some of the archeological findings imply that the first inhabitants settled here in the Paleolithic era [2]. In the early Bronze Age, Indo-European tribes known as the Illyrians settled in the various region of the modern territory of Bosnia and Herzegovina. [3] the tribes were governed by the Romans for more than five centuries [4]. During that time, a lot of the residents of the Roman empire, including Roman soldiers settled down in the region [1]. After the fall of the Roman empire, this area remained a borderline between the Eastern and Western empires which encouraged. various tribes, such the Avars, the Slavs, and others, that massively invade this region. Additionally, two important events, along with several other historical episodes, significantly impacted the structure of B&H human population. The first of those are large migration waves from the North-East (which were extremely intensive during the 6th and 7th centuries) which moved different Gothic, Avar and Slavic clans into the area. The second one was the expansion of the Ottoman Empire into this part of the Balkans in the fifteenth century [5]. All these historical episodes left their imprint on the �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 population structure of modern B&H inhabitants and created fascinating genetic diversity within. Therefore, it is not surprising that modern B&H population is one of the most genetically studied regional populations, especially by the use of so-called lineage genetic markers. Unlike autosomal markers, Y-linked and mitochondrial markers do not undergo each generation shuffling, but instead get passed down from one generation to the next, with the only differences being induced by mutations. For these reasons, these markers are often used for parental lineage monitoring in population genetics, human genetics, as well as in forensic genetics. Mitochondrial DNA is being used for maternal lineage testing, while the Y chromosome is being used for paternal lineage testing. Therefore, these markers are being referred to as lineage markers [6]. Previously published papers presented a short historical overview of earlier published human population studies in Bosnia and Herzegovina, conducted within the last three centuries [7,8]. However, usage of the lineage markers within those papers was just briefly noted. Expansion of human population studies based on these genetic markers, as well as the significance of the obtained results, initiated us to put more attention on this part of BH population genetics. Therefore, this paper will extensively elaborate usage of the Y chromosome DNA markers within analysis of the BH human genetic pool, including the most recent data published after previously mentioned papers. 2. Human Y Chromosome as Genetic Marker Y chromosome has been given many different definitions, some of them being “nonrecombining desert” and “gene-poor chromosome”. Compared to other chromosomes, the Y chromosome has low number of genes with half of its sequence consisting of repeated elements. Moreover, it lacks the recombination ability and is in continuous decay. The Y chromosome is inherited through the patrilineal inheritance pattern, i.e., from father to son, meaning that each male person from the same patrilineal lineage would have an identical profile. The relatively small degree of molecular diversity between markers located on this chromosome comes from the absence of gene recombination in 95% of its length and the mechanism of random mutations as the only possible source of polymorphisms [6]. Denver convention criteria classifies the human Y chromosome as G chromosome, that is, the category of the shortest chromosomes in the human set, consisting also of chromosomes 21 and 22. It contains about 50 million base pairs, which makes out around 1.8% of the total human genome. The Y chromosome contains important information used in determining the parental lineage of a specific male. This is possible because the Y chromosome contains highly polymorphic regions. The human Y chromosome is present in a sole copy in normal males, inherited from the father, and, as already mentioned, 95% of its complex does not undergo recombination. Only 5% of this chromosome has the potential ability to interact with the X chromosome, and the interacting region is called the pseudoautosomal region of the Y chromosome [6]. The Y chromosome has an important role in forensic analyses in cases of rape of women, in particular, those involving more than one man, especially in cases of mixed samples when there is an overwhelming amount of female DNA. Y-STR (Short Tandem Repeat) and Y-SNP (Single Nucleotide Polymorphism) markers are useful in cases of parenthood testing or further kinship through the male line, when the children �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 are male, and in the process of identification when only kin from the father’s line is known. In addition, the Y chromosome is more and more being used in human migration studies due to its property of not undergoing recombination throughout the transfer of genetic material between generations [9]. Actually, since the first Y chromosome polymorphism was published [10], an entire decade has passed before the binary, and later STR markers, located on the NRY region found their wider application in phylogenetic studies monitoring human migration patterns, through the construction of phylogenetic trees [11]. The SNP patterns can be used to determine lineages which are referred to as haplogroups. Haplogroups can also be inferred from readily available Y-STR genotyping data. Vast amount of forensic Y-STR data is available for the use in population genetic studies [12]. 3. Overview of the Y Chromosome Population Genetic Studies in Recent B&H Inhabitants The analysis of STR and SNP variation, autosomal, and Y-chromosome markers were studied so that molecular genetic diversity of B&H could get incorporated into regional and European frames, but also to provide necessary reference for statistical calculations used in forensic genetics. In order to ensure the most relevant calculation, the data are still periodically updated. Initial results were obtained by observing 28 Y-chromosome biallelic markers in the B&H population [13]. This study was constructed on the ground of regional data and designed to include 256 male individuals. The results showed extremely close genetic relationship between three populations (three main Bosnian and Herzegovinian ethnic groups) and their close relationship to other populations in the Balkans. Of course, further elaboration of this issue required additional studies with a multidisciplinary approach, application of additional molecular markers, expansion of the sample and structural investigation of each ethnic group, as well as the analysis of ancient genetic material from the archeological skeletal samples. In the same year (2005) very first Y STR population data set for the BH human population was published [14]. Hundred tested males have been voluntary donors. The PowerPlex®Y System has been used in order to amplify 12 Y-STR loci by via PCR. These STR loci are: DYS19, DYS385a, DYS385b, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438 and DYS439. From eighty-one detected Y-STR haplotypes (from a total number of 100 obtained samples) 69 were unique, 7 appeared two times, 4 appeared three and only 1 ﬁve times. Statistical analysis incorporated: gene diversity, major allele frequency, the most frequent haplotypes, allele frequency distribution and observed haplotype diversity [3] for 12 PowerPlex®Y loci. Four years later, with the intent to improve existing database and to obtain more specific results for local populations for a variety of DNA markers, group of authors decided to analyze additional individuals from Canton Sarajevo area. Estimation of genetic diversity at 12 Y-chromosomal STR loci included in the PowerPlex® Y System was used to extend the existing database, and create a more realistic view of the state of the genetic structure within regional Bosnian and Herzegovinian human population, in particular regarding the diversity among the isolated and non-isolated local populations. In addition, the intent of that study was to estimate genetic distinctiveness of the Canton Sarajevo population within the general B&H population as well as with populations of geographically neighboring countries. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Y-STR haplotypes were generated for a sample of 100 unrelated, healthy male individuals living in Canton Sarajevo (Bosnia and Herzegovina) using PowerPlex®Y System kit [15]. Within this pool, the totals of 81 different haplotypes were detected with 71 of them unique. Absolute frequencies of the remaining 10 haplotypes were two for six haplotypes, three for two haplotypes, five for one haplotype and six for one haplotype. Obtained results suggested that the local population of Canton Sarajevo, with respect to the detected haplotype and gene diversity, may be considered a projection of general B&H population. Since this population represents the largest regional population in Bosnia and Herzegovina with emphasized migration influx this is quite a logical outcome. Four years later, in 2013, Y chromosome diversity of the B&H population was examined again, but with the increased number of STR loci. The sampling was performed using buccal swabs from unrelated, healthy men originating from all regions of Bosnia and Herzegovina. Total number of samples obtained was 100. DNA samples were typed for 23 Y STR loci, with 6 new loci: DYS481, DYS533, DYS576, DYS549, DYS643, and DYS570, which are included in the new PowerPlex® Y 23 amplification kit. The absolute frequency of generated haplotypes was calculated, and results showed that only two samples shared the identical Y 23 haplotype. DYS418 was identified as the most polymorphic locus, with 14 detected alleles and the minimum polymorphic loci were DYS437, DYS389I, DYS393, and DYS391. Decreasing the number of repeating haplotypes is very important in forensic DNA analysis, and this study showed that it can be achieved by increasing the number of highly polymorphic Y STR markers [16]. Whit Athey’s Haplogroup Predictor was used to determine Y chromosome haplogroup frequencies via Y chromosome marker frequencies from the same 100 individuals [17]. According to those results, the most frequent haplogroup seems to be I2a, with a commonness of 49%, followed by R1a and E1b1b, each accounting for 17% of all haplogroups within the population. Remaining haplogroups encountered in this study are J2a (5%), I1 (4%), R1b (4%), J2b (2%), G2a (1%) and N (1%). Preliminary B&H population data published before 10 years was confirmed by these results. The prediction about B&H population as a part of the Western Balkan area, which served as the Last Maximum refuge for the Paleolithic human European population was also confirmed in this paper. Furthermore, these results corroborated the hypothesis that this region was an important stopping point on the “Middle East-Europe highway” during the Neolithic farmer migrations. Finally, since these results were almost completely in accordance with previously published data on B&H and neighboring populations that were generated by Y chromosome single nucleotide polymorphism (Y-SNP) analysis, it was concluded that in silico analysis of YSTRs is a reliable method for approximation of the Y chromosome haplogroup diversity of an examined population. In the meantime, the same STR set of loci was employed to explore the distribution and polymorphisms of 23 short tandem repeat (STR) loci on the Y chromosome in the Turkish population recently settled in Sarajevo, Bosnia and Herzegovina and to investigate its genetic relationships with the homeland Turkish population and neighboring populations [18]. This study included 100 healthy unrelated male individuals from the Turkish population living in Sarajevo. Amplification was performed using PowerPlex Y 23 amplification kit. The studied population was compared to other populations using pairwise genetic distances, which were represented with a multi-dimensional scaling plot. Haplotype and allele frequencies of the sample population were calculated and the results showed that all 100 samples had unique haplotypes. The most polymorphic locus was DYS458, and the least polymorphic DYS391. The observed haplotype �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 diversity was 1.0000 ± 0.0014, with a discrimination capacity of 1.00 and the match probability of 0.01. Rst values showed that the observed population was closely related in both dimensions to the Lebanese and Iraqi populations, while it was more distant from Bosnian, Croatian, and Macedonian populations. At the end, the conclusion is that Turkish population living in Sarajevo can be observed as a representative Turkish population because results were the same as those published for the population from Turkey. This study showed that populations which are geographically close, were related genetically to each other. The methods for haplogroup prediction were encountered in this study [19]. 23 loci from previously obtained Y-STR haplotypes from 100 unrelated healthy Turkish males, who had recently settled in Sarajevo, were utilized for the purpose of determining the haplogroups via Whit Athey’s Haplogroup Predictor software. In total 90 studied haplotypes had the Bayesian probability greater than 92.2 % and had the range between 51.4% and 84.3% for the 10 haplotypes left. 17 differently distributed haplogroups were found, with Y-haplogroup J2a being the most prevalent one, with abundance percentage of 26% of all samples, while haplogroups R1b, G2a, and R1a were less prevalent, with the range from 10% to 15% of all the samples. These 4 haplogroups together contribute to 63% of all Y-chromosomes. in total 11 haplogroups (E1b1b, G1, I1, I2a, I2b, J1, J2b, L, Q, R2, and T) had a range from 2% to 5%, whereas other haplogroups, namely E1b1a and N were found in only 1% of all samples. Results have shown that a large percentage of the Turkish paternal line is linked with West Asia, Europe Caucasus, Western Europe, Northeast Europe, Middle East, Russia, Anatolia, and Black Sea Y chromosome lineages. Conclusion is that the analyzed Turkish population can serve as a representative sample for the Turkish population residing in Turkey, because results were consistent with those data published earlier in the literature for Turkish population in Turkey. In years 2016 and 2017, similar studies were performed on the human population residing in Tuzla, Bosnia and Herzegovina. Namely, Tuzla Canton is one of the most populated regions in Bosnia and Herzegovina, thus its genetic analysis could serve as proof of past demographic events. The first study, which included in total 100 unrelated healthy adult males genotyped using 23-Y STR loci included within PowerPlex Y23 kit [20], employed statistical tests such as haplotype diversity, allele frequencies and Rst-based genetic distances calculated between new dataset and the one from Bosnia and Herzegovina and other places. The distances were afterwards visualized through multidimensional scaling plot and neighbor-joining phylogenetic tree analyses. Discrimination capacity of the PowerPlex Y23 kit appeared to be high, because all 100 individuals had the unique haplotypes, and newly incorporated loci seem very informative. However, no significant difference between the study population and the general population of Bosnia and Herzegovina, as well as between the population of Tuzla and neighboring populations. [20] In the Second study, for the same 100 unrelated male individuals from Tuzla Canton, Bosnia and Herzegovina (B&H) in silico haplogroup assignments were made and it was based on 23-loci Y-STR data using the following four different algorithms [21]. Dominant haplogroups were I, R and E with their sublineages I2a, R1a, and E1b1b. It is in connection with the published Y-SNP data for the B&H population. In general, results which are represented in this study did not only constitute a concordance study on the four haplogroup assignment algorithms which are also most popular, but they also give a deep knowledge about differentiation that can be find within population of B&H based on Y haplogroups for the first time. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 Those studies-initiated publication of the few more papers which were including Y STR data from B&H human population. The first one was published in 2015 and it was focused on the clustering of the European human population based on the Y-STR data [22]. Three overall clusters were formed as a result of autosomal STR loci analyses, namely the European, Asian and African. However, Y-STR analyses highlighted formations of new sub-clusters. This is confirmed since the European cluster was easily divided into four distinct groups represented as four branches of the phylogenetic tree, while the Asian population cluster consists of two sub-clusters. Given the aforementioned clustering trends evident in both phylogenetic trees, it was concluded that clusters were indeed formed as a consequence of geographical proximity that triggered a mixing of gene pools, which in turn resulted in the formation of neighboring populations that exhibit strong genetic similarities. Overall, this study effectively highlights that Y-STRs could be a more informative tool in structural population studies as they are more informative than autosomal STRs because they not only enable continental clustering but are also a great tool for additional regional studies as well. Formation of four sub-clusters of European populations is once again proving the great potential of Y-chromosomal markers in the wide spectrum of genetic analyses. The second one was published in 2018 and it was focused on the analysis of the Balkan human population based on the Y-STR data [12]. This study aimed to provide insight into genetics relations in Balkan population using silico analysis of Y-STR haplotypes and predicting haplogroups as well as doing network analysis of the same haplotypes. The population dataset was obtained using 23, 17, 12, 9 and 7 Y-STR loci for 13 populations, including Bosnia and Herzegovina (B&H), Croatia, Slovenia, Greece, Macedonia, Romany (Hungary), Hungary, Serbia, Montenegro, Albania, Kosovo, Romania and Bulgaria. The overall dataset consists of 2179 samples with 1878 different haplotypes. Between thirteen analyzed Balkan populations, in four of them 12a was recognized as the major haplogroup. Each population with 12a as the major haplogroup (B&H, Croatia, Montenegro and Serbia) was from the former Yugoslavia republic. The last two major populations from Yugoslavia, Macedonia and Slovenia, had E1b1b and R1a haplogroups as the most prevalent. E1b1b haplogroup was the most prevalent in the population of Macedonia, Romania, as well as Albania and Kosovo. Comparing I2a haplogroup clusters to E1b1b and R1b haplogroup clusters, the former one is more compact, which indicates a larger degree of homogeneity within the haplotypes that belong to that haplogroup. This study indicates that an effective approach for utilization of publicly available Y-STR datasets may lie in combination of haplogroup prediction and network analysis. 4. Conclusion Describing something that lasts for two decades as "a beginning" is quite unusual. However, that is the truth in the case of Y chromosome human population-genetic studies in Bosnia and Herzegovina. There are still many interesting features hidden within the existent diversity of local human populations in this small, but intriguing, country that are still waiting to be discovered and described. Several preliminary hypotheses were completely changed, such us origin of R1b haplogroup within this region, or significantly questioned, such us origin of notably high frequency of I2a haplogroup in Bosnia (as Balkan LGM refugium marker or �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 “Slavic migration marker” increased by founder effect) [23]. Those, and many other Y chromosome stories are just waiting to be told. REFERENCES [1] Malcolm, N. (1996). Bosnia: A short history. NYU Press. [2] Hadžibegović, I., & Imamović, M. (1994). Bosna i Hercegovina od najstarijih vremena do kraja Drugog svjetskog rata. [3] Wilkes, J. (1995). The Illyrians. Wiley-Blackwell. [4] Klaić, V. (1990). Povijest Bosne do propasti kraljevstva. Svjetlost Sarajevo. [5] Marjanovic D., et al. Doc Praehistorica, 23 (2006) 21.-6. [6] Marjanović, D., et al. (2018). Forensic genetics: Theory and application. [7] Marjanović, D., Pojskić, N., Kapur, L., Haverić, S., Durmić-Pašić, A., Bajrović, K., & Hadžiselimović, R. (2008). Overview of human population-genetic studies in Bosnia and Herzegovina during the last three centuries: history and prospective. Collegium antropologicum, 32(3), 981-987. [8] Lasić, L. (2016). Historical Overview of the Human Population-Genetic Studies in Bosnia and Herzegovina: Small Country, Great Diversity. Collegium antropologicum, 40(2), 145-149. [9] Semino, O., Passarino, G., Oefner, P. J., Lin, A. A., Arbuzova, S., Beckman, L. E., ... & Marcikiæ, M. (2000). The genetic legacy of Paleolithic Homo sapiens sapiens in extant Europeans: AY chromosome perspective. Science, 290(5494), 1155-1159. [10] Casanova, M., Leroy, P., Boucekkine, C., Weissenbach, J., Bishop, C., Fellous, M., ... & Siniscalco, M. (1985). A human Y-linked DNA polymorphism and its potential for estimating genetic and evolutionary distance. Science, 230(4732), 1403-1406. [11] Underhill, P. A., Myres, N. M., Rootsi, S., Metspalu, M., Zhivotovsky, L. A., King, R. J., ... & Kutuev, I. (2010). Separating the post-Glacial coancestry of European and Asian Y chromosomes within haplogroup R1a. European Journal of Human Genetics, 18(4), 479. [12] Šehović, E., Zieger, M., Spahić, L., Marjanović, D., & Dogan, S. (2018). A glance of genetic relations in the Balkan populations utilizing network analysis based on in silico assigned Y-DNA haplogroups. AnthropologicAl review, 81(3), 252-268. [13] Marjanovic, D., Fornarino, S., Montagna, S., Primorac, D., Hadziselimovic, R., Vidovic, S., ... & Andjelinovic, S. (2005). The peopling of modern Bosnia‐Herzegovina: Y‐chromosome haplogroups in the three main ethnic groups. Annals of Human Genetics, 69(6), 757-763. [14] Marjanovic, D., Bakal, N., Pojskic, N., Kapur, L., Drobnic, K., Primorac, D., ... & Hadziselimovic, R. (2005). Population data for the twelve Y-chromosome short tandem repeat loci from the sample of multinational population in Bosnia and Herzegovina. Journal of Forensic Science, 50(1), JFS2004289-2. [15] Ćenanović, M., Pojskić, N., Kovačević, L., Džehverović, M., Čakar, J., Musemić, D., & Marjanović, D. (2010). Diversity of Y-short tandem repeats in the representative sample of the population of Canton Sarajevo antropologicum, 34(2), 545-550. residents, Bosnia and Herzegovina. Collegium �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2019114 [16] Kovačević, L., Fatur-Cerić, V., Hadžić, N., Čakar, J., Primorac, D., & Marjanović, D. (2013). Haplotype data for 23 Y-chromosome markers in a reference sample from Bosnia and Herzegovina. Croatian medical journal, 54(3), 286-290. [17] Doğan, S., Ašić, A., Doğan, G., Besic, L., & Marjanovic, D. (2016). Y-Chromosome Haplogroups in the Bosnian-Herzegovinian Population Based on 23 Y-STR Loci. Human biology, 88(3), 201210. [18] Dogan, S., Primorac, D., & Marjanović, D. (2014). Genetic analysis of haplotype data for 23 Ychromosome short tandem repeat loci in the Turkish population recently settled in Sarajevo, Bosnia and Herzegovina. Croatian medical journal, 55(5), 530. [19] Doğan, S., Doğan, G., Ašić, A., Bešić, L., Klimenta, B., Hukić, M., ... & Marjanović, D. (2016). Prediction of the Y-Chromosome Haplogroups within a recently settled Turkish Population in Sarajevo, Bosnia & Herzegovina. Collegium antropologicum, 40(1), 1-7. [20] Babić, N., Dogan, S., Čakar, J., Pilav, A., Marjanović, D., & Hadžiavdić, V. (2017). Molecular diversity of 23 Y-chromosome short tandem repeat loci in the population of Tuzla Canton, Bosnia and Herzegovina. Annals of human biology, 44(5), 419-426. [21] Dogan, S., Babic, N., Gurkan, C., Goksu, A., Marjanovic, D., & Hadziavdic, V. (2016). Ychromosomal haplogroup distribution in the Tuzla Canton of Bosnia and Herzegovina: A concordance study using four different in silico assignment algorithms based on Y-STR data. Homo, 67(6), 471-483. [22] Dogan, S., Ašić, A., Buljubašić, S., Bešić, L., Avdić, M., Ferić, E., ... & Marjanović, D. (2015). Overview of European population clustering based on 23 Y-STR loci. Genetika, 47, 901-908. [23] Primorac, D., Marjanović, D., Rudan, P., Villems, R., & Underhill, P. A. (2011). Croatian genetic heritage: Y-chromosome story. Croatian medical journal, 52(3), 225-234. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Overview of Human Lineage Genetic Marker Studies in Bosnia and Herzegovina: Y chromosome story Author Author Aldin Pirić1, Sabahudin Ćordić1, Lejla Smajlović-Skenderagić1, Serkan Dogan1, Damir Marjanović1,2 Abstract A summary of the resource. Abstract – Modern Bosnia and Herzegovina is a state consisting of multiple ethnicities and regions located in the Western Balkan, with a very complex history. The earliest historical findings show that its area was inhabited since the Paleolithic. From that time, this part of Europe, especially the region of the Modern Bosnia and Herzegovina, could be recognized as the crossroad for the different human migration and the meeting point for different cultures, religions and gene pools. Mitochondrial DNA is being used for maternal lineage testing, while the Y chromosome is being used for paternal lineage testing. Therefore, these markers are being referred to as lineage markers. Lineage markers are often used for parental lineage monitoring in population genetics, human genetics, as well as in forensic genetics. The main intention of this paper is to construct a short overview of the Y chromosome studies performed in Bosnia and Herzegovina within the last two decades. Keywords Keywords. Keywords - Bosnia and Herzegovina, lineage markers, molecular markers, population genetic studies, Y chromosome Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021312 https://eprints.ibu.edu.ba/files/original/6e51695a244d6e444275de0abb1daba0.pdf 047401ca20f00a9f8f24fea2eb6cc2e4 PDF Text Text Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 Sentiment Analysis on Twitter Data using Big Data Obada Almonajed, Samed Jukić 1 International Burch University, Sarajevo, Bosnia and Herzegovina almonajed.obada@ibu.edu.ba samed.jukic@ibu.edu.ba Abstract –With the increasing number of users and data on the Internet, especially social media sites, sentiment analysis topic became one of the important and essential fields for most. Collection of people's feelings and sentiment and classifying the data attracted most businesses and companies. Recently, twitter sentiment analysis has attracted much attention, because of Twitter's growth and popularity. The solution for handling enormous amounts of data from social media is a new term called Big data. Big data is not just for having a large amount of data, but also the importance of processing and the usage of the data. In this paper, we collect live data from Twitter using Apache Spark; and apply machine learning algorithms provided by Apache Spark machine learning library for classification of each Twitter message. Naive Bayes and Logistic Regression are used for testing the model. Naive Bayes algorithm gave better results, where it has an average accuracy around 75%, while the Logistic Regression algorithm was around 69%. Keywords–big data, sentiment analysis, twitter, apache spark, social media, machine learning. 1. Introduction Social media, one of the best things about it is in its name; social. It connects various people across the world by sharing information to them and receiving information from them. The main purpose of social media is to connect people and allow them to share thoughts and opinions. It allows also to read the news, watch videos, read stories, view and share photos. Social media is becoming an integral part of our lives. It is a way of connecting and building a relationship with others. It allows you to hear what people say and to respond. The most popular platforms are Facebook, Twitter, YouTube, Instagram, Snapchat. Since social media allows people to connect those days social media are very important for businesses. It takes advantage of social media to increase brand exposure and customer reach. Publishing to social media is very simple. For example, a company can create a page on Facebook, and post new products, sales announcements, market brands, and products as images or text or video. No matter the size of the business, it is important to recognize the value and trend for better understanding and utilizing the platform. People can talk about your business without your knowledge. So, as a company, it is important to know and monitor social media conversations about the brand. Based on reviews, the company can always adjust the present market situation and satisfy customers in a better way. In order to identify the text written by �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 your customers, a sentiment analysis tool is used. Sentiment analysis or opinion mining is used to determine the emotional tone of message or text. The main usage of this tool is to understand how people feel and think about something. The tool is very useful for companies and can affect decision making.Using machine learning, companies can analyze the content on social media to see the meaning behind the messages. An Enormous number of people across the world use social media. In order to gain such data, store, and process, we will use Big data. Big data is not only for storing a large amount of data but the ability to analyze. Big data allows us to get and analyze real-time data from social media. For this paper, one of the fastest big data platforms Apache Spark will be used. Compared with Hadoop, it can be faster up to one hundred times[1]. Apache Spark framework provides native bindings for Java, Python, Scala, Machine Learning, and support SQL. The purpose of the paper is to collect data from Twitter and determine and classify the feeling of the user into positive or negative using machine learning and Apache Spark. 2. Literature Review Pang et al. [2], in the paper, they came out that unigram is a better model over others. Regardless of whether there is no large difference between unigram precision and mix of unigrams and bigrams precision, where the precision using unigrams has 82.9% and precision using the mix of unigrams and biagrams is 82.7%; both predicted with SVM algorithm. However, Dave et al. [3] have inverse results, where bigrams gave preferable precision over unigrams utilizing SVM and Baseline algorithms. SVM brings about 87.2% precision for the first test and 85.8% precision for the second test for bigrams. Pak et al. [4] gathered around 300.000 various tweets for Twitter. The tweet can be classified into three classes, positive, negative, or neutral. They thought about that, the emoji in the message represents the actual sentiment of the text. Thus, if ':(' emoji is included in the message, regardless of what is the content: the message has negative sentiment. Likewise, if a tweet has ':)', the message is considered as negative sentiment. For learning algorithms, they utilized multinomial Naïve Bayes, SVM and Conditional random fields, yet Naïve Bayes indicated the best outcomes. To make the precision of the classifier better, they removed some n-grams, since it isn't showing any sentiment. Authors of the paper [5], have researched the usage of Apache Flume and Apache Hive which is built on top Hadoop for analyzing Twitter data. In the research[6], the authors wrote and discussed a recommendation system that provides a summary of users’ feedback, comments, and reviews about different subjects using the Hadoop framework. Similarly, the authors of the researches [7], built a recommendation system that recommends services. The researchers of the paper[8], build a Hadoop framework for determining and analyzing the customers’ feedback toward a product from social networks, that framework extracts and analyzes the feedback of social user relationship management. Go et al. [9] broke down Twitter suppositions utilizing various machine learning algorithms. The algorithms are Naïve Bayes, Maximum Entropy (MaxEnt), and Support Vector Machine (SVM). They remembered �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 emojis for the training data and utilized two classes for tweets' classification, positive and negative classes. In the wake of training data, they infer that emojis have a negative effect on data while applying MaxEnt and SVM algorithms on the data, however don't influence Naive Bayes. What is specific in their study is that, they explore the usage of unigrams, bigrams, combination of unigram and bigram and parts of speech. They conclude with the result the mix of unigrams and bigrams beats every other model, and parts of speech tags were not valuable at all. 3. Methodology A. Sentiment analysis With the usage of sentimental analysis, it can be learned whether the customers are satisfied with some new service or not. Twitter is mainly used for firms to get customer feedback. Simple articles are being written to identify whether people like or dislike something new. Firms are using that information to make a decision so that they can make some service better and improve the firm’s sales. When sentiment analysis is applied on content, it means users are looking for the opinion in the text. Is the product review positive or negative? Are customers satisfied with the product or not? Are positive opinions greater than negative or not? All kinds of questions can be answered with Sentiment Analysis. By sentiment analysis, users can learn how customers' view the company's product or service. Shortly we can say sentiment analysis is being used for agree/disagree, like/dislike, for/against [10]. For example, the sentence ‘I recommend this product to everyone.’, the word ‘recommend’ indicates that the writer is happy, and the sentiment is positive. In this paper, positive and negative words will be collected and used to train the machine to be able to classify the messages. For getting, storing, and classifying such data users will use Big data tools. Big data is data that exceeds the processing capacity of conventional database systems [11]. Big data means that there is a large number of data to collect. If users want to always get data from social sites faster, they should use big data. As data is more and more increased, it is becoming harder to control them, so Big data is the solution. Hadoop for years was the leading open source framework for Big data; recently Apache Spark is the leading and most popular framework. Hadoop and Spark almost perform the same tasks, but Spark is more preferable, especially when it comes to speed; because the way it processes data is faster. B. Data and Findings For the work and experiment, we used one document. The document contains different examples of messages with their outputs (classes) either positive or negative. The document is used to train and test the system because this computer program is going to be supervised learning, which is learning from example. They are using the known dataset for the training system called Stanford Twitter Sentiment Corpus (STS) [12]. Each tweet in this dataset has the following data: ID of the user, timestamp of the tweet, the username �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 of the user who posted the tweet, and the tweet itself. Next to each tweet, there is a class, either positive or negative. The document contains about 1 million samples of positive and negative tweets. In the following Figure, we show example of the dataset: Figure 1. Samples of the Dataset C. Process First of all, we need to install Spark and include it in the Scala project. After that, we need to initialize a Spark Context, which is going to tell Spark how to access a cluster. The Spark Context takes a parameter, which is known as SparkConf or Spark Configuration. SparkConf allows the user to configure some common properties which will be passed to Spark Context, like application name, master URL. memory size, key value-pairs, and other properties. Figure 2. Configuration After configuration of the application, we started with the online collection of tweets. For online and realtime data, Spark streaming is required. Spark streaming receives live data from Twitter and divides them into batches, where the user can later apply actions and process the data. In the next figure, we show implementation of Spark Streaming. Figure 3. Spark Streaming User can get tweets from a specific secondary user, or all tweets that start with special word, or all tweets that contains special hashtag ’#’. In our system, we collect all tweets containing special hashtag, and include that hashtag into the arguments of the system. Now, after all configurations we are able to collect data from Twitter. and save them to a file. In our system, we are saving the data to the text file. In the next figures, we show how to fetch data and how to save data into text files. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 Figure 4. Fetching Tweets Figure 5. Save data in text format D. Spark Machine Learning Library The next and most important step is to classify each tweet to positive or negative class. Use Spark machine learning library, which contains different algorithms. Data, in order to be analyzed, it has to be converted to vectors. For that, use a well known and very useful tool called Hashing. Hashing is translating text data to numeric data. In Spark, most common and used hashing is HashingTF.it is important to say that, before analyzing the caught data from Twitter, it is a prerequisite to hash each data, as it is shown in the figure below. Figure 6. Hashing data We used two algorithms for comparing the better one, Naive Bayes and Logistic Regression. Logistic Regression is a binary classification, which means it can classify data into one of two groups. While Naive Bayes can be used for multiple groups.First, we have used a 10 cross-validation. Cross-validation is splitting a dataset into more than one pan. It is used to ensure that every data has been used for training and testing data. Training data is always larger in size than testing data. If a user has 1000 samples of data, the user can take 800 for training and 200 for testing. Since he has used 10 cross-validation, it means 9 folds for training and 1 fold for testing. Table 1. Cross validation example 1-fold Training �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 2-fold Training 3-fold Training 4-fold Training 5-fold Ttraining 6-fold Training 7-fold Training 8-fold Training 9-fold Training 10-fold Testing Next, just move the testing data to another place in dataset, and another place in the table, like in table 2 where testing data is now 1-fold and it is at the top and beginning of the dataset. As we can understand testing data has to be moved each fold cross validation to one place and each data will be in testing and training part. Table 2. Cross validation example 2 1-fold Testing 2-fold Training 3-fold Training 4-fold Training 5-fold Ttraining 6-fold Training 7-fold Training 8-fold Training 9-fold Training 10-fold Training For each fold, it is important to calculate the accuracy; so, at the end you will determine its performance and if the classifier and data are good or not.Cross-validation and the accuracy are very important, they indicate to how well the learner will be able to make right and correct prediction for new data. For algorithms of learning, we used two machine learning algorithms as we mentioned before, Naïve Bayes and Logistic Regression. Results showed that Naive Bayes is better at prediction of the text. More details about the results will be described in the next section. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 4. Results To train and test the system use Stanford Twitter Sentiment Corpus (STS) dataset which is available online. It contains more than one million samples. After the completion of testing on our data the results as well as accuracy of each k-fold is shown in the table below: Table 3. 10-fold cross validation k-fold Naive Bayes Logistic Regression 1-fold 77.3 68.8 2-fold 70.4 73.4 3-fold 75.7 74.3 4-fold 77.2 67.7 5-fold 76.4 64.6 6-fold 73.6 66.5 7-fold 69.8 75.3 8-fold 79.1 65.8 9-fold 77.3 67.2 10-fold 74.5 71.05 To calculate the accuracy of the classifier, true positive plus true negative over total number of testing data: Figure 7. Formula to Calculate the Accuracy Code regarding our program: Figure 8. Code to Calculate the Accuracy ‘predictionAndLabel’- this is displaying the actual prediction result and the prediction of the system. Real example from our system is shown in the following figure, where it is shown the prediction of the system and real prediction of the data. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 Figure 9. Prediction and Actual Example: One sentence: ‘Good project, I liked it.’, result of classification using Naïve Bayes algorithm was: 1.0 means positive. while the result of LogisticRegression algorithm was: 0.0 which means negative sentiment. Another example: ‘I love it :)’, prediction of Naïve Bayes is 1.0 and the Logistic Regression is also 1.0 which is positive and correct. The total accuracy of both algorithms, Naive Bayes and Logistic Regression, after cross-validation is shown in the following table. Table 4. Accuracy Average Accuracy Naive Bayes Logistic Regression 75.13% 69.465% From this table we can see that Naive Bayes average accuracy is somewhere around 75 percent. Logistic Regression accuracy is a bit lower than Naive Bayes and its accuracy is around 69 percent. There is some difference, not so big. That difference is around 6 percent. As a conclusion for those results we take the right to say that Naive Bayes algorithm provides great results. Logistic Regression with a this, bit lower percentage, can be considered as a great algorithm as well. After the users have finished the training of the system, use it for catching the data from Twitter and predict the data using both algorithms,Naïve Bayes or Logistic Regression. To get better results, we should use Naive Bayes rather than Logistic Regression. Finally, the best way is to save data in a text file, so the companies can easily keep track of the users' opinion about the company's products and about the company in general. 5. Discussion In our paper, as you could see, we proved how text classification can be done in a fast and easy way by using Spark. Use Spark as Big data and for applying machine learning algorithms. Use two well-known machine learning algorithms, Naive Bayes and Logistic Regression. Using these algorithms we achieved a very high model's accuracy by applying to data sets that contained different types of sentences and emoticons. Also, we have shown how emoticons can help in improving the model's accuracy, if used correctly. Using more data in training and testing sets in our cross-validation method, we would achieve better results. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 In this section of paper, an endeavor was made to compare the various methods and results of algorithms performance.Considering the research papers related to our research, which are already mentioned in Section 2, notice that in any case, the text should always be predicted using different methods and then decide which method is the best for achieving our goal. In the following table notice that, summarize different Supervised Machine Learning approaches for Twitter sentiment analysis. Table 5. Summary of previous work Paper Pak and Methods Supervised Algorithms Multinomial Datasets Tweets Results Multinomial Naive Bayes with Paroubek [4] Machine Naive collected using bigrams Learning Support Vector Twitter API superior Bayes, Machine (SVM), accomplished a performance contrasted with unigrams and and trigrams. Conditional Random Field (CRF) Go et al [9] Supervised Naive Machine Learning Bayes, Tweets The Maximum Entropy Maximum collected using (MaxEnt) with both unigrams Entropy Twitter API and bigrams accomplished a (MaxEnt), and precision of 83% contrasted Suppor Vector with the Naive Bayes with a Machine precision of 82.7%. (SVM) Pang et al [2] Supervised Support IMDb The Machine Vector unigrams Learning Machine accuracy utilizing the mix of (SVM), Naive unigrams and bigrams is 82.7% Bayes, and with Support Vector Machine MaxEnt (SVM). Support accuracy has They utilizing 82.9% proved Vector and that Machine (SVM) is superior to Naive Bayes and Maximum Entropy (MaxEnt), where the accuracy utilizing unigrams has 81.0% with Naive Bayes and 80.4% with Maximum Entropy (MaxEnt), and the accuracy utilizing both unigrams and bigrams has 80.6% with Naive �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 Bayes and 80.8% with Maximum Entropy (MaxEnt). Some earlier research and studies utilized various groups of sentiment, similar to satisfaction, sadness, frustration, dread and shock. While, in our research, we classified the tweets into two groups, positive or negative, no third group. Most researches were about applying ML algorithms on tweets for sentiment analysis, without the use of Big data. While, we used Big data with the machine learning algorithms in our research. From Table 5., see that Go et al got better accuracy using Naive Bayes algorithms. They did an additional procedure, which we neglected, and that is related to emoticons, they deleted any tweet that contains both positive and negative emoticons. This may happen if a tweet contains two subjects. Although we don't know the accuracy of the model in the research of Pak and Paroubel, we can surely say that they did a good research, because they followed the steps necessary to determine if the text is positive or negative. The steps followed included the removal of any URLs and usernames (user-names follow the "@" symbol) and removal of any characters that repeat more than twice turning a phrase such as OOMMMGGG to OOMMGG, which is applied by a regular expression. 6. Conclusion In this paper it was shown how usage of Spark as Big data can help us classify text from tweets to positive and negative in a very simple yet very fast way.By using common algorithms Naïve Bayes and Logistic Regression we have achieved a very high by applying to large data sets that contained a various number of different emoticons and sentences. We determined that Naïve Bayes is much better than Logistic Regression by training and applying cross validation to our dataset, where its highest accuracy was around 79%. That is the most relevant result regarding the usage of Big Data. Also, in our paper we have demonstrated and shown how it is fast and easy to use and understand it, and how it is powerful with large data sets. For that reason, we can conclude that it is the best tool regarding Twitter sentiment analysis. But not only can sentimental analysis be used for Twitter, it can be used for any type of documentation or data. In the near future our plan is to have and use richer data sets for training, Spark Graphs for better data visualization and usage of real-time data rather than offline data. It can be achieved easy; just classification methods have to be applied and used right after getting each tweet from Twitter. We can see from the previous related works that are mentioned in the Chapter 2, sentiment analysis on Twitter data can be used in many different areas. From those papers, we can conclude that the main goal was to determine the products' quality, so we can say that the main goal is to make it easier for companies to check whether the item is good or not for the customers. Also, politicians and companies want to know what people write in real time about them, so they request monitoring tools to know the opinions, feelings and sentiments that their potential customers are publishing. This method can also be used in film production, since we can see that many Twitter users write their opinion about watched films, about the actors, and so on. �Journal of Natural Sciences and Engineering, Vol. 3, (2020) DOI number: 10.14706/JONSAE2020114 REFERENCES [1] P. P. Chitturi, Apache Spark for Data Science Cookbook, Packt Publishing Ltd, 2016. [2] B. Pang, L. Lee i S. Vaithyanathan, »Thumbs up? Sentiment Classification using Machine Learning,« 2002. [Mrežno]. Available: https://www.cs.cornell.edu/home/llee/papers/sentiment.pdf. [3] D. Kushal, S. Lawrence i D. M. Pennock, »Mining the Peanut Gallery: Opinion Extraction and,« 2003. [Mrežno]. Available: https://www.kushaldave.com/p451-dave.pdf. [4] A. Pak i P. Paroubek, »Twitter as a Corpus for Sentiment Analysis and Opinion Mining,« 2010. [Mrežno]. Available: https://pdfs.semanticscholar.org/6b7f/c158541d5a7be2b2465f7d8a42afa97d7ae9.pdf?_ga=2.1218413 55.1543760336.1572899814-899645452.1571167125. [5] Sanggeta, »Twitter Data Analysis Using FLUME & HIVE on Hadoop,« February 2016. [Mrežno]. Available: http://www.irdindia.in/journal_ijraet/pdf/vol4_iss2/27.pdf. [6] J. P. Verma, B. Patel i A. Patel, »Big Data Analysis: Recommendation System with,« 2015. [Mrežno]. Available: https://www.researchgate.net/profile/Jaiprakash_Verma/publication/282686173_Big_Data_Analysis _Recommendation_System_with_Hadoop_Framework/links/57f4afb708ae280dd0b77681.pdf. [7] K. R. Shrote i A. V. Deorankar, »Review based service recommendation for big data,« February 2016. [Mrežno]. Available: https://ieeexplore.ieee.org/document/7538334. [8] F. Z. Ennaji, A. E. Fazziki, M. Sadgal i D. Benslimane, »Social intelligence framework: Extracting and analyzing opinions for social CRM,« November 2015. [Mrežno]. Available: https://ieeexplore.ieee.org/abstract/document/7507229. [9] A. Go, R. Bhayani i L. Huang, »Twitter Sentiment Classification using Distant Supervision,« 2009. [Mrežno]. Available: https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf. [10] l. Bing, Opinions, Sentiment, and Emotion in Text, Cambridge University Press, 2015. [11] E. Dumbil, Big Data Now, O'Reilly, 2012. [12] Kazanova, »Sentiment140 dataset with 1.6million twwets,« 2017. [Mrežno]. Available: https://www.kaggle.com/kazanova/sentiment140. [13] C. t. W. projects, »Twitter,« 2007. [Mrežno]. Available: https://en.wikipedia.org/wiki/Twitter. [14] A. Pak i P. Paroubek, »Twitter as a Corpus for Sentiment Analysis and Opinion Mining,« 2010. [Mrežno]. Available: https://pdfs.semanticscholar.org/6b7f/c158541d5a7be2b2465f7d8a42afa97d7ae9.pdf?_ga=2.1218413 55.1543760336.1572899814-899645452.1571167125. � Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Journal of Natural Sciences and Engineering Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706 Publisher An entity responsible for making the resource available International Burch University Description An account of the resource Journal of Natural Sciences and Engineering (JONSAE) is a peer-reviewed, biannually published international journal focusing on empirical and theoretical research in all branches of Engineering and Natural Sciences. It is published on the behalf of Faculty of Engineering and Natural Sciences of International Burch University and aims to provide the best content regarding by publishing original research papers, review articles, special issues, feature articles, and book reviews. All manuscript submissions are subject to initial appraisal by the Editor, and, if found suitable for further consideration, to peer review by independent, anonymous referees. All peer review is double-blind and submission is online. The journal welcomes theoretical, applied, interdisciplinary and methodological work, with preference on empirical research, critical approach and problem-solving methods in manuscripts. Language A language of the resource English Dublin Core The Dublin Core metadata element set is common to all Omeka records, including items, files, and collections. For more information see, http://dublincore.org/documents/dces/. Title A name given to the resource Sentiment Analysis on Twitter Data using Big Data Author Author Obada Almonajed, Samed Jukić Abstract A summary of the resource. Abstract –With the increasing number of users and data on the Internet, especially social media sites, sentiment analysis topic became one of the important and essential fields for most. Collection of people's feelings and sentiment and classifying the data attracted most businesses and companies. Recently, twitter sentiment analysis has attracted much attention, because of Twitter's growth and popularity. The solution for handling enormous amounts of data from social media is a new term called Big data. Big data is not just for having a large amount of data, but also the importance of processing and the usage of the data. Keywords Keywords. Keywords–big data, sentiment analysis, twitter, apache spark, social media, machine learning Identifier An unambiguous reference to the resource within a given context 2637-2835 DOI Digital object identifier 10.14706/JONSAE2021311 Publisher An entity responsible for making the resource available Faculty of Engineering and Natural Sciences, IBU