My team research page can be found here.

Pervasive AI Hardware Design

We are pioneering a brand new Pervasive AI Hardware architecture using the principle of Learning Automata. We have sent our first ASIC microchip in February 2020 (mignon.v1 seen in the picture) and we continue to develop more architectures suited to the need of energy-frugality and performance. Our design is supported by a suite of design exploration and automation solutions implemented in various other platforms, such as FPGAs and microcontrollers. Our early results have demonstrated up to 1000x times more energy efficient AI hardware than state-of-the-art solutions. Our research is currently funded by two grants from LRF-ICON and EPSRC IAA. More updates on this to follow soon.

Embedded Genomics
Using powerful hardware/software co-design we are developing algorithms and implementation solutions for embedded genomics. The aim is to make whole genome sequencing highly energy efficient, portable and low-cost to enable personalised healthcare. We are liaising with industries to translate our research (please write to us if you are interested)  and have already published a journal paper in IEEE TCBB (see publications) and a number other papers are soon to follow. The project is funded by grants from Royal Society and EPSRC IAA.

Power-infused, Real-Power AI Hardware
A bulk amount of energy is lost in traditional systems at the systems boundaries, such as from batteries or energy harvesters to power managers and then to the integrated circuits. We are developing solutions to minimise this loss by developing new power-infused integrated hardware solutions. We are developing highly power-elastic AI hardware that can operate in tandem with variable power scavengers with the natural capability to operate over a dynamic power domain (which we call as Real-Power AI Hardware). The aim is to enable a new generation of AI hardware that can operate in pervasive environments. The project is funded by grants from EPSRC DTPs.

Hardware/Software Co-design for Acceleration

Currently we are developing new hardware/software interaction models to accelerate  big data applications, such as Whole Genome Assembly. Our key aim is to design the ecosystem of software runtime and hardware interaction models that can be seamlessly integrated for high-performance and energy-efficient accelerated applications. Our system software and runtime developments are  based around OpenCL for making the most of natural heterogeneity in modern hardware platforms. A number of developments are currently in progress.   

Real-Power Computing

Computing paradigm for emerging ubiquitous systems, particularly for the energy-harvested ones, has clearly shifted from the traditional systems. The energy supply of these systems can vary temporally and spatially within a dynamic range,  delivering  computation extremely challenging. Such a paradigm shift requires disruptive approaches to design computing systems that can provide continued functionality under unreliable supply power envelope and operate with autonomous survivability (i.e. the ability to automatically guarantee retention and/or completion of a given computation task). In this research, we use the concept of Real-Power Computing, inspired by the above trends and tenets. We show how computation systems must be designed with power-proportionality to achieve sustained computation and survivability when operating at extreme power conditions. We are developing multiple approaches to addressing the unique challenges relevant to thos new paradigm.

Parallelisation-Aware Runtime for Many-Core Systems

What is the most unique property of many-core applications that describes the needs for the resource allocation better? Our  research   from PRiME programme required us to answer this basic question. To find a suitable answer we needed to go back to the basics of understanding how application achieves performance and how performance manifests itself in energy consumption. In essence, we have been carrying out research along the lines of extended Amdahl's and Gustafson's models to understand these parametric relationships. Underpinning these we are developing a whole new genersation of runtimes that are capable of determining parallelisation and power consumption using processor performance counters to simplify the rather complicated problem of runtime resource allocation in single and concurrent applications. See our publications for more details.   

Significance-Driven Approximate Circuits Design

Approximate arithmetic has recently emerged as a promising paradigm, which relaxes the need of perfectly precise outputs in many imprecision-tolerant applications. By leveraging imprecise logic designs this paradigm can offer substantial energy, performance and area advantages.   In this work, we propose an energy-efficient multiplier design using a novel significance-driven approach. Fundamental to this approach is an algorithmic lossy compression of partial product matrix based on bit significance. The more significant bits are treated with progressively higher precision, while bits with lower significance are compressed using variable logic clusters (VLCs) for vertical reduction of the product terms. As such, the complexity of computation in terms of logic cell counts and length of the critical paths are drastically reduced. A number of multipliers (from 4-bit to 128-bit) using this approach have been designed using SystemVerilog and synthesised using Synopsys Design Compiler. Our post-synthesis experiments with a 128-bit multiplier showed that up to 63% less energy consumption and 59% performance improvement can be achieved, when compared with traditional multipliers. These gains are achieved at a low loss of accuracy due to significance-driven bit processing - with up to 17% inaccuracies for small valued operands and exponentially reduced imprecision for higher values.

Power Budget-Aware Distributed Energy Minimization in Many-core Systems

Power budget constrained energy minimization of parallel applications is an emerging challenge for current and future generations of many-core computing systems. Traditionally, energy minimization in these systems is carried out using oine training based dynamic voltage/frequency scaling (DVFS) and concurrency control. In this paper, we will demonstrate that the scalability of such energy minimization can be limited with architectural changes. To address these limitations, we propose a scalable and adaptive energy minimization approach that suitably applies DVFS and dynamic core allocations with a given overall energy budget, annotated in the application through a modi ed OpenMP library. Fundamental to this approach is a dynamic allocation algorithm that suitably allocates conurrent threads and adjusts their energy budgets based on their online workload pro les. With the given concurrent threads and their energy budgets the core processor VFS controls are then adapted through an iterative learning control (ILC) algorithm guided by the feedback from the CPU performance counters. The proposed approach is validated on an Intel Xeon E5-2630 platform with up to 24 CPUs running NAS parallel benchmark applications. We show that our proposed approach can effectively adapt to architectural allocations and minimize energy consumption by up to 11% compared to the existing approaches for a given energy budget.

s     [Project webpage:]

OpenMP based Adaptive Energy Minimisation for Many-core Systems

Energy minimization of parallel applications is an emerging challenge for current and future generations of many-core computing systems. In this paper, we propose a novel and scalable energy minimization approach that suitably applies DVFS in the sequential part and jointly considers DVFS and dynamic core allocations in the parallel part. Fundamental to this approach is an iterative learning based control algorithm that adapt the voltage/frequency scaling and core allocations dynamically based on workload predictions and is guided by the CPU performance counters at regular intervals. The adaptation is facilitated through performance annotations in the application codes, defined in a modified OpenMP runtime library. The proposed approach is validated on an Intel Xeon E5-2630 platform with up to 24 CPUs running NAS parallel benchmark applications. We show that our proposed approach can effectively adapt to different architecture and core allocations and minimize energy consumption by up to 17\% compared to the existing approaches for a given performance requirement.

Transfer Learning-based Adaptive Power Governor for Embedded Systems

Embedded systems execute applications with different performance requirements. These applications exercise the hardware differently depending on the types of computation being carried out, generating varying workloads with time. We will demonstrate that energy minimization with such workload and performance variations within (intra) and across (inter) applications is particularly challenging. To address this challenge we propose an online energy minimization approach, capable of minimizing energy through adaptation to these variations. At the core of the approach is an initial learning through reinforcement learning algorithm that suitably selects the appropriate voltage\slash frequency scalings (VFS) based on workload predictions to meet the applications' performance requirements. The adaptation is then facilitated and expedited through learning transfer, which uses the interaction between the system application, runtime and hardware layers to adjust the power control levers. The proposed approach is implemented as a power governor in Linux and validated on an ARM Cortex-A8 running different benchmark applications. We show that with intra- and inter-application variations, our proposed approach can effectively minimize energy consumption by up to 33% compared to existing approaches. Scaling the approach further to multi-core systems, we also show that it can minimize energy by up to 18% with 2X reduction in the learning time when compared with a recently reported approach.

Software-based Online Testing and Fault Tolerance for On-Demand Reliable Systems

Commercial off-the-shelf (COTS) components are increasingly being employed in embedded systems due to their high performance at low cost. With emerging reliability requirements, design of these components using traditional hardware redundancy incur large overheads, time-demanding re-design and validation. To reduce the design time with shorter time-to-market requirements, software-only reliable design techniques can provide with an effective and low-cost alternative. This paper presents a novel, architecture-independent software modification tool, SMART (Software Modification Aided transient eRror Tolerance) for effective error detection and tolerance. To detect transient errors in processor datapath, control flow and memory at reasonable system overheads, the tool incorporates selective and non-intrusive data duplication and dynamic signature comparison. Also, to mitigate the impact of the detected errors, it facilitates further software modification implementing software-based check-pointing. Due to automatic software based source-to-source modification tailored to a given reliability requirement, the tool requires no re-design effort, hardware- or compiler-level intervention. We evaluate the effectiveness of the tool using a Xentium processor based system as a case study of COTS based systems. Using various benchmark applications with single-event upset (SEUs) based error model, we show that up to 91% of the errors can be detected or masked with reasonable performance, energy and memory footprint overheads.

On-Chip Architecture Exploration

Using analytical and simulation results, this work carried out comparative analyses between network on chip (NoC) and shared-bus AMBA using real-application traffic with MPEG-2 video decoder in cycle-accurate realistic simulation environment. The comparisons were carried out in terms of performance and reliability in presence of soft errors. The performance results demonstrated that despite higher channel latency, NoC has higher bandwidth advantage and outperforms shared-bus AMBA, requiring lower frequency in order to decode the video bitstream at given frame rate. On the other hand, reliability comparison results demonstrated that due to higher register usage, NoC interconnects suffer from upto 24 times higher soft errors compared to shared-bus AMBA.

SystemC-based Simulated Fault Injection

In this work, a new SystemC-based fault injection technique was proposed with improved fault representation in all data and signal registers. The technique has been demonstrated to be minimum intrusive since it only requires replacing the original data or signal types to fault injection enabler types. The proposed simulation technique was compared with recently reported SystemC-based techniques. The proposed technique clearly has advantages with fast simulation speed, better fault representation and flexibility, while maintaining simplicity and minimum intrusion. To demonstrate and validate fault injection capabilities of the proposed technique, a behavioural SystemC description of MPEG-2 decoder setup was used. The validation shows that up to 98.9% fault representation within data and signal registers can be achieved unlike previously reported SystemC-based fault injection techniques.

The fault injection simulator will be online soon.

System-level Low Power and Reliable Design of MPSoCs

This work presented the first study that established the relationship between application-level correctness (i.e. the impact of soft errors at the application-level) and power consumption minimization of MPSoC through voltage scaling. A novel voltage scaling technique has been proposed using this relationship, which can be used to produce optimized low power designs with acceptable application-level correctness (expressed in terms of peak signal-to-noise ratio) in the presence of specified soft error rate and timing constraints. Furthermore, the work has presented detailed investigation of the effect of MPSoC architecture allocation and application task mapping on the trade-offs between application-level correctness and power minimization through voltage scaling. MPEG-2 video decoder has been used as a case study to validate the proposed voltage scaling technique. The work further envisages that any MPSoC application can benefit from the proposed technique if low power and reliability are of concern under real-time performance constraints.

Random Task and Resource Graph (RTRG) Tool

rtrg logo

RTRG is a random task and resource graph tool to facilitate task mapping-related research in embedded systems. The tool now supports generation of task map in .rtg format, with or without resource mapping. All task graphs generated by this tool includes costs of tasks and dependencies. Working versions of C/C++ executable for Linux and Windows are available. The RTRG tool can generate random task graphs with the following user specified inputs:

1. The number of computation tasks (N, compulsory).

2. The lowest and highest number of dependencies (optional, by default 0 to N/2).

3. The lowest and highest costs of computation tasks (optional, by default 1 to 40).

4. The lowest and highest costs of communication tasks (optional, by default 1 to 10).

5. The probability distribution for random generation of 2, 3, and 4 (separately specified and optional, by default all uniform distribution).

For further information, go to RTRG page.

On-Chip Pipelined Communication

In this work, pipelined NoC channels and switches were designed to enable cycle accurate on-chip communication for NIRGAM. The work involved SystemC-based modeling and simulation and validation using different architecture allocation and applications. Currently, a 32-bit implementation is completed. A synthesis of the NoC switch and router was carried out as a part of MSc project, which I supervised.

Low Power Congestion Controller on Ad Hoc Wireless LANs

A novel, low power congestion controller was developed for multi-hop wireless LANs based on classical time-delay control model. The congestion control scheme is first derived as a continuous-time model using internal model control (Smith's Predictor) principles. Underpinning the continuous-time model, a discrete and scalable digital-filter based solution is developed to enable a low-level, low-power and yet fast circuit-level control.

Copyright 2018 - RA Shafik