Analysis of multi-threading and cache memory latency masking on processor performance using thread synchronization technique

,

Análise de mascaramento de latência de memória cache e multi-threading no desempenho do processador usando técnica de sincronização de thread

Introduction
Human carries out a wide range of operations simultaneouslyor concurrency, as it will be referred to in this study.
Breathing, blood circulation, digestion, thinking, and walking, for instance, can all happen concurrently.All the senses, vision, touch, smell, taste, and hearing can be used at the same time.Computers also have the ability to perform operations concurrently.It is typical for personal computers to compile a program, send a file to a printer, and receive electronic mail messages over a network simultaneously.
Operating systems on single processor computers create the illusion of concurrent execution by rapidly switching between activities, but on these computers, only one instruction can execute at a time.Most programming languages do not allow for concurrent activities.Instead, control statements provide sequential control, allowing one action to be performed at a time, with execution moving on to the next action after the previous one has finished (Hassanein et al., 2020).Java allows for concurrency through the language and APIs (Carvalho et al., 2023).
It specifies that an application should contain separate threads of execution, where each thread has its own method call stack and program counter, while also sharing application-wide resources such as memory.This capability is known as Multithreading (Adam, 2022).A thread is a unit of control within a process.When a thread runs, it executes a function in the program.The process associated with a running program begins with one running thread, known as the main thread, which executes the "main" function of the program.In a multithreaded program, the main thread creates other threads that execute other functions.
These other threads can also create additional threads, and so on.Threads are created using constructs provided by the programming language or the functions provided by an application-programming interface (API).Each thread has its own stack of activation records and its own copy of the CPU registers, including the stack pointer and the program counter, which collectively describe the state of the thread's execution (Eslamimehr et al., 2018).

Multithreaded Servers
Multithreaded Server: A server with multiple threads is referred to as a Multithreaded Server (Vayadande et al., 2022).When a client sends a request, a thread is created which allows the user to interact with the server.Several threads need to be created to handle multiple requests from multiple clients simultaneously (Figure 1).Multithreaded (MT) CPUs are designed to conceal the impact of memory latency by running multiple instruction streams in parallel (Cai et al., 2022).An MT CPU has multiple thread contexts and interleaves the execution of instructions from different threads.As a result, if one thread is delayed by a memory access, other threads can continue to make progress (Rafi et al., 2022).However, the threads in a multithreaded process share the data, code, resources, and address space of their process.Thread usage in a program significantly reduces the overhead involved in creating and managing threads, as well as sharing per-process state information.Since thread creation has lower overhead, it is focused on single-process multithreaded programs (Cheikh et al., 2020).
The operating system must determine how to allocate the central processing units (CPUs) among the processes and threads in the system.In some systems, the operating system chooses a process to run, and the selected process chooses which of its threads will execute.Alternatively, the operating system schedules the threads directly.At any given moment, multiple processes, each containing one or more threads, may be executing.For example, some threads may be waiting for an I/O request to complete.
The scheduling policy determines which of the ready threads is chosen for execution.Generally, each ready thread is given a time slice (referred to as a quantum) of the CPU.If a thread decides to wait for something, it voluntarily gives up the CPU.Otherwise, when a hardware timer determines that a running thread's quantum has ended, an interrupt occurs, and the thread is preempted to allow another ready thread to run.If there are multiple CPUs, multiple threads can execute simultaneously.On a computer with a single CPU, threads appear to execute concurrently, even though they actually take turns running and may not receive equal time.
Therefore, some threads may appear to run at a faster rate than others do.The scheduling policy may also consider a thread's priority and the type of processing it performs, giving certain threads priority over others.It is assumed that the scheduling policy is fair, meaning that every ready thread eventually gets a chance to execute.The correctness of a concurrent program should not depend on the threads being scheduled in a specific order.Switching the CPU from one process or thread to another, known as a context switch, requires saving the state of the old process or thread and loading the state of the new one (Muthukrishnan et al., 2023).Since there may be several hundred context switches per second, context switches can potentially introduce significant

Thread States
A state diagram, occasionally referred to as a state machine diagram, is a form of behavioral diagram in the Unified Modeling Language (UML) that displays changes between different entities.A state machine diagram represents the actions of a solitary entity, detailing the series of occurrences that an entity experiences throughout its existence in reaction to occurrences.At any given moment, a thread is considered to be in one of numerous thread conditionsdepicted in the UML state diagram in Figure 2.  Occasionally, a running thread shifts to the dormant state as it awaits for another thread to accomplish a task.A dormant thread shifts back to the running state solely when another thread alerts it to resume execution.

Timed Waiting State
A runnable thread that can be executed can enter the timed waiting state for a specified period.It returns to the executable state when that time ends or when the event it is waiting for occurs.Threads in timed waiting and waiting states cannot use a processor, even if one is available.A thread that can be executed can transition to the timed waiting state if it specifies a wait interval when waiting for another thread to complete a task.This thread returns to the executable state when another thread or when the timed interval ends -whichever happens first, notifies it.Another way to put a thread in the timed waiting state is to make a thread that can be executed sleep.A sleeping thread remains in the timed waiting state for a specified period (known as the sleeping interval), after which it returns to the executable state.Threads sleep when they temporarily have no work to do.

Blocked State
A runnable thread moves to the obstructed state when it tries to carry out a task that cannot be promptly finished and it needs to temporarily pause until that task is done.For instance, when a thread sends a request for input/output, the operating system obstructs the thread from running until that I/O request is completed.At that moment, the obstructed thread shifts to the running state, allowing it to continue execution.

Terminated State
A runnable thread enters the finished state (sometimes called the inactive state) when it successfully completes its task or otherwise stops, perhaps due to some error.Entering a terminate pseudo-state indicates that the lifeline of the state machine has ended.A stop pseudo-state is represented as a cross.In unified markup language (UML) of the state diagram of figure 1, the finished state is followed by the UML, final state (the bull's eye symbol) to indicate the end of the state transitions.
Latency concealing: Provide each processor with useful work to do as it awaits the completion of memory access requests.Latency concealing provides the possibility of communications to be fully overlapped with computation, leading to high efficiency and hardware utilization.Multithreading may be a practical mechanism for latency.
Multithreading is a useful mechanism for reducing latency.A multithreaded computation typically begins with a sequential thread, then some supervisory overhead to set up (schedule) various independent threads, then computation and communication (remote accesses) for individual threads, and finally a synchronization step to stop the threads before starting the next unit (Fig 3).

Statement of the Problem
Though the human mind can perform functions simultaneously, individuals struggle to switch between parallel streams of thoughts.Occasionally, a running thread transitions to a waiting state while it awaits another thread to carry out a task.A waiting thread only returns to the running state when it is notified by another thread to continue execution.
Another method to put a thread in the timed waiting state is to make a running thread go to sleep.A sleeping thread remains in the timed waiting state for a specified period (known as a sleep interval), after which it goes back to the running state.Threads sleep when they temporarily have no work to do.
For instance, a word processor might have a thread that periodically creates a backup (i.e., writes a duplicate) of the current document to the disk for recovery purposes.If the thread did not sleep between consecutive backups, it would need to have a loop that continuously checks whether it should write a copy of the document to the disk.This loop would consume CPU time without accomplishing any productive work, thereby decreasing system performance (Abellan et al., 2015).
Moreover, when multiple threads share an object and one or more of them modify it, unpredictable outcomes may arise.For instance, if one thread is in the process of updating a shared object and another thread also attempts to update it, it is unclear which thread's update will take effect.In such cases, the behavior of the program cannot be relied upon.

Related Works
In recent years, some augmented reality tasks have higher requirements for real-time performance when processing data.So, the traffic of mobile data continues to grow.It is not difficult to find that data requested by users is highly repetitive, which will lead to a large amount of redundant data transmission.In recent years, the caching problem has attracted the attention of researchers as a method to solve the delay problem (Vayadande et al., 2022).Cache is a new strategy to improve the performance and the service quality of mobile edge networks.
It includes offloading tasks to the mobile edge cloud and storing computation results in the local storage located at the edge of network.This technology avoids redundant and repetitive processing of the same task, thereby simplifying the offloading process and improving the utilization of network resources (Wu et al., 2022).These studies serve as basis for knowledge development, create guidelines for policy and practice, provide evidence of an effect, and engender new ideas and directives for this particular study.It provides a solid basis for understanding how concurrent applications synchronize access to shared memory, the concepts are important to understand, even if an application does not use these tools explicitly.
As a new method to alleviate the unprecedented network traffic, mobile edge caching has been widely used in the wired internet, and it has proved that it can reduce delay and energy consumption (Jin et al., 2018).To date, many research works have focused on optimizing caching methods to solve the delay and energy consumption problems in computation offloading.In Kumar et al. (2023), the author considered the horizontal cooperation between mobile edge nodes for joint caching and proposes a new transformation method to solve the problem of edge caching and improve cache hit rate of the network.
In Ma et al. (2020) and Pramudita et al. (2020) authors designed a heterogeneous collaborative edge cache framework by jointly optimizing node selection and cache replacement in mobile networks.The joint optimization problem is expressed as a Markov Decision Process (MDP), and Deep Q Network (DQN) is used to solve the problem, which alleviates the offloading traffic load.These studies provide substantial practical and theoretical contributions, and help to identify the scope of this study.
In Chen et al. (2021), an approach, which is based on the users' interests, has been presented where media objects are organized logically according to their interests over the network.This approach assumes a peer, which has a media object that other peers are interested in, might have other media objects that they are also interested in.In the case of a cache miss, the request is forwarded to a peer that has the same interests (Cai et al., 2022).This section of the literature review examines factors influencing deadlock and analytical modeling, the experiments described in this section do not attempt to manipulate and test the factors that influence conformity, they help to understand the results obtained and consider the implication of the findings.
Thus, the overhead and network traffic is reduced.In Cheikh et al. (2021) and Dai et al. (2020), new approaches based on machine learning techniques have been proposed to improve the performance of traditional cache replacement policies such as least recently used (LRU), global distribution system (GDS), and least frequently used (LFU).
However, these approaches have some disadvantages such as the additional computational overhead, which is needed for the target outputs preparation in the training phase when looking for future requests.Back Propagation Neural Network (BPNN) has been applied in Neural Network Proxy Cache Replacement (NNPCR) (Candelario et al., 2023) and NNPCR-2 (Fu et al., 2020) for making cache replacement decisions.However, BPNN performance is influenced by the optimal selection of the network topology and its parameters.These findings provide background context to the statistical analysis and contribute further to understanding deadlock effects in system performance.
Murthy & Rani (2022) and Fu et al. (2022) describe a model for multi-processor IPC based on cache miss rates (Pramudita et al., 2020.Their model of cache delays for single-threaded workloads uses a similar approach.The fundamental distinction of this work is that the model is a multithreaded processor that hides individual threads' memory latenciesthis has not been addressed in Murthy & Rani (2022) study or elsewhere.This section provides context for the research identifies gaps in existing literature and ensures novelty.
Several recent studies have been performed on facemask detection during COVID-19, and those are presented in the literature.Deep learning-based approaches were developed by researchers to study the issue of facemask detection (Majumdar et al., 2021;Murthy;Rani, 2022).Address the issue of masked face recognition; it was developed as a reliable solution based on occlusion removal, as well as deep learning-based features (Kochol, 2023).Alex Net and VGG16 convolutional neural network designs are used as transfer learning for the development of new models (Madhi et al., 2018).However, all these studies focus on specific optimization or characterizations and stop short of providing a quantitative framework.They serve as basis for knowledge development and serve for future research.
Technology has been developed that prevents the spread of viruses and uses deep learning technology to ensure that people are wearing facemasks correctly.The Celeb dataset was used to develop a model to automatically remove mask objects from the face and synthesize the corrupt regions while maintaining the original face structure (Ma et al., 2022).
A multi-threading strategy with VGG-16 and triplet loss Face Net dependent on the masked faced recognition approach is proposed, which is built on Mobile Net and Haar-cascade to detect facemasks (Raul et al., 2023).The embedding unmasking model (EUM) method was designed, which aimed to improve upon existing facial recognition models.
The self-restrained triplet (SRT) technique was utilized, which allowed EUM to produce embedding corresponding to the associated characters of unmasked faces (Srijone et al., 2021).The margin cosine loss (MFCosface) masked faced recognition algorithm, which is dependent on a wide margin cosine loss design, was proposed for detecting and identifying criminals.An attention-aware mechanism was improved by incorporating important facial features that helps in recognition (Almutairi et al., 2023).This paragraph summarizes details of the paper's methodology and data that are relevant to this study.This has help to find the "sweet spot" for what this study has to focus on.
An attention-based component using the convolutional block attention module (CBAM) model was designed, which depends on the highlighted area around the eyes (Chen et al., 2021).The near infrared to visible (NIR-VIS) masked faced recognition problem was analyzed in terms of the training approach and data model (Yang et al., 2021).A method called heterogeneous semi-siamese training (HSST) was designed, which attempts to maximize the joint information between face representations using semi-siamese networks.The literature here introduces the background and defines the gaps this work aims to fill.It also helps to summarize details of the paper's methodology and data that are relevant to this study.

Multi-threaded Issues (Thread Safe)
Multiple threads can execute the same block of code simultaneously or in parallel.It can be executed in a "Thread Safe" or "Not Thread Safe" manner.This research identifies the issues related to thread safety and explores methods to prevent them.

Thread Security
Thread security refers to the ability of multiple threads to access the same resources without causing unpredictable outcomes such as race conditions or deadlocks.For instance, when implementing a Singleton pattern for initializing a database connection, the idea is to create a database connection object only once and utilize that object whenever there is a need to interact with the database.12. Instance=make{singleton} // <…not thread safe 13. } 14.

Return instance 16. }
A critical section is any piece of code that has the potential of being executed concurrently by more than one thread and sharing data or resources used by the application.The above code worked but it has a race condition at line 10.
Assuming multiple threads run simultaneously on line 10, threads "compete" to read the instance variable, it becomes nil at that time and then multiple threads initialize connection with the database, which might consume all the connection pool of the database.Output in a race condition is difficult to predict and inconsistent.

Mutual Exclusions (thread Safe using Mutex)
There are several methods for avoiding race conditions to achieve thread safety.

Return instance 21 }
For instance, Thread A executes in line 12, it attempts and successfully obtains a lock, then it proceeds to line 17 and generates a singleton database connection object.Meanwhile, when Thread B executes on line 12, it must wait to obtain the lock since Thread A is holding it.After Thread A returns in line 20, it releases the obtained lock in line 13 (defer keyword will move the execution of the statement to the very end inside a function).Until then, Thread B can successfully obtain the lock in line 12 and verify if the instance variable is nil, since it has already been assigned by Thread A, it won't initialize the singleton object again.Then it releases the obtained lock as depicted in (Figure 4).

Creating and executing threads
The recommended way to develop multithreaded Java applications is by utilizing the Runnable interface from the java package.A Runnable object represents a task that can be executed simultaneously with other tasks.The runnable interface defines the sole method Run, which contains the code that specifies the task that a runnable object should carry out.When a thread executing a Runnable is initialized, the thread invokes the Run method of the Runnable objects, which runs in the newly created thread (Kumar et al., 2022).Utilize the runnable interface (line 5), so that multiple print tasks can execute concurrently.The variable sleep time (line 7) holds a random integer value from 0 to 5 seconds generated in the constructor of the print task (line 16).Each thread running a print task sleeps for the time specified by the sleep time, then prints its task's name and a message indicating that it has finished sleeping.

Runnable and the
A print task is executed when a thread calls the run method of the print task.Lines 24-25 display a message indicating the name of the currently executing task and that the task is going to sleep for sleep time milliseconds.Line 26 calls the sleep method of the Thread class to put the thread in a timed waiting state for the specified amount of time.During this time, the thread relinquishes the processor, allowing another thread to execute.When the thread wakes up, it reenters the runnable state.
When the print task is assigned to a processor again, line 35 prints a message indicating that the task has finished sleeping, and then the run method terminates.Note that the catch block at lines 28-32 is necessary because the sleep method may throw a checked exception if the sleeping thread is interrupted (Almutairi et al., 2023).

Experimental Evaluation
The structure was a large-scale, cache-consistent, multi-processor system.It comprises of multiple multiprocessor clusters connected through a scalable, low delay interconnection network.Physical memory was spread out among the processing nodes in different clusters.The dispersed memory formed a worldwide address space.For each memory block, the distant directory kept record of remote nodes caching it.When a write occurred, point-to-point messages were sent to invalidate remote copies of the block.
Acknowledgement messages were used to notify the originating node when an invalidation was completed.Multithreading requires that the processor be designed to handle multiple contexts simultaneously on a context-switching basis.Figure 4 below specified the typical architecture environment using multiple-context processors (P) and memory (M) as shown below.The dispersed memory forms a worldwide address space.Four machine parameters are defined to analyze the performance of this network: Latency (L), number of thread (N), context switching overhead (C) and the interval between switches (R).Two levels of local cache were used per processing nodes.Loads and write were separated with the use of write-buffers for implementing weaker memory consistency models (Figure 4).The distributed memories form a global address space.Four machine parameters are specified below to analyze the performance of this network.The multithreading processor will suspend the current context and switch to another, so after a fixed number of cycles it will be busy again doing useful work, even if the remote reference is lost.Only when the whole context hangs do the processor go idle (Cai et al., 2022).
The goal here is to maximize the fraction of the time the processor is busy.Processor efficiency is a performance metric, given by, () busy Efficiency busy switching idle = ++ 1 The busy, switching, and idle cases represent time, measured over a substantial period.The basic idea behind multithreaded machine is to interleave execution of multiple contexts to greatly reduce idle value, but not to increase the level of conversion too much.The instruction queue unit has a cache, which stores certain instructions according to the instruction specified by the program counter.The minimum required buffer size is: Where N is the number of thread slots and C is the number of cycles required to access the instruction cache.

Processor efficiencies
The single-threaded processor executes the context until the remote reference is given (R cycle), then waits until the reference completes (L cycle).There is no context switching and obviously no switching overhead.Let us model this behavior as an alternate innovation process with an R+L cycle.R and L correspond to the duration of a cycle during which the processor is busy and idle, respectively.The efficiency of a thread server is given by: The equation shows performance degradation of such a processor in a parallel system with large memory latency (Wu et al., 2022).

Saturation region:
In this saturated region, the processor operates with maximum utilization.The cycle of the renewal process in this case is R+C, and the efficiency is simply: That is what to say.Saturation efficiency does not depend on the delay and also does not change as the number of contexts increases.
Liner -When the number of contexts is lower than the saturation point, there may be no contexts after the context switch, so the processor will go through cycles of inactivity.The time it takes to switch to the ready context, execute the context until the remote reference is emitted, and handle the reference using R+C+L.Assume N is below the saturation point, whereas all other contexts have a variation in the processor.Efficiency is given by: i.e.… the efficiency increases linearly with the number of contexts until the saturation point is reached and beyond that remains constant (Candelario et al., 2023).

Result and Discussion
This study experimentally evaluated the performance of a multithreaded architecture using a network model consisting of clusters of multiprocessors (P) and physical memory (M) distributed among nodes, processed in different clusters, as shown in (Figure 4).Distributed memory forms a global address space.Confirmation messages were used to notify the root node when the deactivation was complete.Two levels of local cache were used for the processing nodes.Load and write have been separated with write cache to implement weaker memory consistency models.In multiprocessors using the ownership-based caching association protocol, if a line of cache blocks needs to be modified, direct prefetching with ownership can significantly reduce write latency and secure network traffic to get ownership.
Caching helps to hide latency by increasing the cache hit rate for read operations (Tang et al., 2021).When a cache failure occurs, cycles are wasted.The benefit of using a distributed cache is to reduce buffer misses.In the test screen, it shows that despite the lock contention, the processors have progressed at approximately their relative reserve speed.Thus, a clear advantage of using multithreading is demonstrated in the test.
To support this point, Figure 5 shows the time distribution between frames (methods) exposed by the X server to a client thread using Unix domain sockets, since server x runs at much lower throughput against its client, several frameworks (methods) have been pointed out.For multi-threaded trains, higher latency was not observed because the server operates with the same customer-provided reservation.This further strengthens the lemma that multithreading can speed up performance through parallelism, a program using entirely two CPUs can run in almost half the time.

Conclusion
The motivation of this study was to improve the performance of several important modern applications that use a lot of memory that are known to cause frequent CPU crashes.The lack of practical experience with multithreaded architectures suggests that we don't have a complete understanding of how these processors work and the best way to design them.Having more hardware context increases the possibility of latency hiding.On the other hand, not having enough buffers can make memory latency so high that multithreading won't be able to fix it.Threads that exhibit poor memory location, such as database applications, are often blocked while waiting for a response from the memory hierarchy: These threads are memory intensive.This observation suggests a better way of handling pipeline conflicts.To evaluate the effect of memory latency on processor performance, a dual-core MT machine with four thread contexts per core was used.The workload includes nine benchmarks from the 2000 CPU suite.These specific benchmarks were chosen to allow the workload to include programs with both good and bad cache locality.Two copies of each benchmark were run, resulting in a workload of 18 threads (each benchmark is a single-threaded process running in its own thread).Runtime-oriented simulations and benchmarks are run in the Solaris™ 9 operating environment.Solaris™ 9 does not include special support for Chip Multithreading (CMT) processors.
From an operating system perspective, a quad-core multithreaded machine emulates the same as a regular eight-way multiprocessor.The Solaris™ scheduler assigns threads to the hardware context as if it were assigning them to processors on a multiprocessor system, selecting eight threads at a time to schedule them according to the corrected modeling method change.When modeling a multithreaded environment, statistics are collected when multiple threads in the kernel are active, but the input request-modeling engine is a CPU service request when only one thread is active in the CPU.
This requires calculations on the collected statistics.Execution time is reduced by a third when using threads compared to executing functions sequentially.That's the power of multithreading.Multithreading is useful for I/O-bound processes, such as reading files from a network or a database, because each thread can simultaneously execute an I/Obound process.Note that using multithreading for CPU-bound processes can slow performance since competing resources ensure only one thread can run at a time, and the overhead is incurred in dealing with multiple threads.

Acknowledgement
I would like to express my sincere gratitude to my supervisor, Professor Akinwale A.T., Federal University of Agriculture Abeokuta, Nigeria, for his guidance and support throughout the research process.His expertise and insights are invaluable in shaping the research and helping to overcome challenges.
I am also grateful to IAEC-University for providing me with resources and support to complete this study.
Finally, I would like to thank my family and friends for their encouragement and support throughout the research process.Without their support, I would not have been able to complete this research study.

Author Contributions
The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.

Conflicts of Interest
No conflict of interest.

Ethics Approval
Not applicable.
Figure 5.Time taken to download content of all URL.Source: Yang et al. (2021).
).A thread in Java at any point of time exists in any one of the following states.A thread lies only in one of the shown states at any instant: The first method is Locking or Mutex.Mutex, as the name implies Mutual exclusion, only one thread has exclusive access permission and blocks others from accessing the resource.It allows all the threads to be able to use the resource but only one process is allowed to use it at a time.