List of Posters

Hossein Jafari and Lijun Qian.
Multiple data streams fusion using Dempster-Shafer theory
Abstract: Because recently emerging new technologies such as IoT, wearable devices, and distributed sensors generate huge data in real time, data processing and decision making from aggregation of data streams that evolving along the time is an important challenge to address. The problem becomes more difficult because of the uncertainties and conflicts in data due to imperfect sensor reading and various sensing range. In this study, we try to provide insights and potential solution by applying the combination rule of Dempster-Shafer theory of evidence (DST).
In DST perspective, measurements from each data source act as an evidence of underlying stochastic process. For each measurement, distances from predefined references and related mass functions are calculated. Then DST combination rules are applied to combine mass functions among all data sources. Finally, total combined mass values or associated pignistic probability values can be compared to a threshold for change detection and decision making in real time. We applied the proposed method to study "comfort zone" based on environmental monitoring data sets, namely, temperature and humidity data collected by our community based sensing test bed (using smartphone paired sensors). It is demonstrated that the proposed methods based on DST performed well when the data contains uncertainties and conflicts, and it can be applied to smart buildings for optimizing and controlling the usage of air conditioning system for heating, cooling, and ventilation that accounts for a big portion of the electricity bill while keeping occupants comfortable.
Xin Zhan and Peng Li.
Stability Assurance in Distributed On-chip Regulation: Theoretical Foundation, Over-design Reduction and Performance Optimization

Abstract: While distributed on-chip voltage regulation offers an appealing solution to power delivery, designing power delivery networks (PDNs) with distributed on-chip voltage regulators with guaranteed stability is challenging because of the complex interactions between the active regulators and bulky passive network. The recently developed hybrid stability technique provides an efficient stability checking and design approach, giving rise to highly desirable localized design of the PDN. However, the theoretical conservativeness of the hybrid stability criteria can lead to pessimisms in stability evaluation and hence overdesign. We address this challenge by proposing an optimal frequency-dependent system partitioning technique to significantly reduce the conservativeness in stability checking. With theoretical rigor, we show how to partition the PDN system by employing optimal frequency-dependent impedance splitting between the passive sub-network and voltage regulators while maintaining the desired theoretical properties of the partitioned system blocks upon which the hybrid stability principle is anchored. We demonstrate the proposed new design approach using an automated design flow which has shown to be able to greatly reduce the amount of conservativeness in PDN stability checking and lead to significantly improved regulation performance and power efficiency.
Venkateshwar Kottapalli and Sunil Khatri.
A Practical Methodology to Validate the Statistical Behavior of Bloom Filters

Abstract: Bloom filters are commonly used to test for set membership. A Bloom filter consists of a series of hash functions, whose combined signature is used to construct a composite hash. Applications are broad, and include networking, transactional memory, server load balancing etc. Several realizations, both in hardware and software, have been proposed in academia and industry, but a reliable means to empirically test the statistical behavior of Bloom filters is lacking. In this paper, we present an efficient, practical methodology to validate the statistical performance of Bloom filters. This paper does not focus on the design of a new Bloom filter. Our scheme is based on the use of the NIST test suite, which is a widely accepted means to test for randomness in the cryptography community. In particular, we first test the randomness of the hash functions that comprise the Bloom filter in isolation, and then test for randomness of the Bloom filter, which is constructed by combining the hash functions. We also prove that randomness is a sufficient condition for the uniformity of the hash function and the Bloom filter output. Finally, we test our approach using two classes of Bloom filters, and based on the results, we provide practical guidelines for designing Bloom filters.
Chance Tarver, Raul Martinez and Joseph Cavallaro.
High-Order, Sub-band Digital Predistortion: WARPLab Testing

Abstract: To facilitate increasing data-rate demands and spectrum scarcity, noncontiguous transmission schemes are becoming more popular such as carrier aggregation in the LTE-Advanced standard. However, these schemes produce a major challenge in radio transmitter and power amplifier (PA) design. The noncontiguous carriers of such as scheme intermodulate due to the nonlinear nature of the PA. This potentially causes severe unwanted emissions which may interfere with neighboring channel signals or desensitize the receiver in frequency division duplexing (FDD) transceivers. We implement a low-complexity, sub-band DPD solution that corrects the distortion caused by up to ninth-order nonlinearities in the PA using the WARPLab environment for the WARPv3 SDR platform. This is done by targeting the most problematic spurious intermodulation distortion components at the PA output and training an injection at that spur to reduce its magnitude. This sub-band method allows for substantially reduced processing complexity over other full-band predistortion solutions. Using these techniques, we are able to suppress IMD spurs in WARP by over 20 dB.
Kaipeng Li, Bei Yin, Michael Wu, Yujun Chen, Joseph Cavallaro and Christoph Studer.
GPU accelerated massive MIMO uplink data detector

Abstract: We present a reconfigurable GPU-based uplink detector for massive MIMO software-defind radio (SDR) systems. To enable high throughput, we implement a configurable linear minimum mean square error (MMSE) soft-output detector and reduce the complexity without sacrificing its error-rate performance. To take full advantage of the GPU computing resources, we exploit the algorithm'ss inherent parallelism and make use of efficient CUDA libraries and the GPU's hierarchical memory resources. We furthermore use multi-stream scheduling and multi-GPU workload deployment strategies to pipeline streaming-detection tasks with little host-device memory copy overhead. Our flexible design is able to switch between a high accuracy Cholesky-based detection mode and a high throughput conjugate gradient (CG)-based detection mode, and supports various antenna configurations. Our current GPU implementation exceeds 250 Mb/s detection throughput for a 128×16 antenna system by running on two latest GPUs concurrently. To further scale up our design, we also propose novel methods with initial simulation results for distributed massive MIMO data detection in a cluster-based manner.
Aida Vosoughi.
Resilient Distributed Cooperative Spectrum Sensing for Cognitive Radio Ad Hoc Networks

Abstract: Cooperation among cognitive radios for spectrum sensing is deemed essential for environments with deep shadows. However, the existing cooperative spectrum sensing schemes for mobile ad hoc networks are high-overhead and vulnerable to spectrum sensing data falsification attacks. We propose a novel trust-aware consensus-inspired cooperative sensing scheme based on iterative broadcast and update that is fully distributed, low-cost and resilient to malicious behavior in the cooperative network. Unlike the existing schemes, our method does not require any network discovery by the nodes for cooperation; therefore, it offers significantly lower overhead with no degradation in sensing performance.
The proposed distributed trust management scheme is integrated into cooperation to mitigate different types of attacks in practical scenarios where the primary user and secondary users of the spectrum and the malicious users behave dynamically. We evaluate our proposed trust-aware distributed cooperative sensing scheme by extensive Monte Carlo simulations modeling realistic scenarios of mobile cognitive radio ad hoc networks in TV white space. We show that the proposed scheme reduces the harm of a set of collusive attackers up to two orders of magnitude in terms of missed-detection and false alarm error rates. In addition, in a hostile environment, integration of trust management considerably relaxes the sensitivity requirements on the cognitive radio devices.
He Zhou, Jiang Hu, Sunil Khatri, Frank Liu, Cliff Sze and Mohammadmahdi Yousefi.
Accelerating Bayesian Control of Markovian Genetic Regulatory Networks on a GPU

Abstract: A recently developed approach to precision medicine is the use of Markov Decision Processes (MDPs) on Gene Regulatory Networks (GRNs). Due to very limited information on the system dynamics of GRNs, the MDP must repeatedly conduct exhaustive search for a non-stationary policy, and thus entails exponential computational complexity. This has hindered its practical applications to date. With the goal of overcoming this obstacle, we investigate acceleration techniques, using the Graphic Processing Unit (GPU) platform, which allows massive parallelism. Our GPU-based acceleration techniques are applied with two different MDP approaches: the optimal Bayesian robust (OBR) policy and the forward search sparse sampling (FSSS) method. Simulation results demonstrate that our techniques achieve a speedup of two orders of magnitude over sequential implementations. In addition, we present a study on the memory utilization and error trends of these techniques.
Yingyezhe Jin and Peng Li.
AP-STDP: A Novel Self-Organizing Mechanism for Efficient Reservoir Computing

Abstract: The Liquid State Machine (LSM) exploits the computation capability of recurrent spiking neural networks by incorporating a randomly generated reservoir, which is often fixed. This standard choice relaxes the challenging need for training the complex recurrent reservoir. The fixed reservoir is used as a generic kernel to map the temporal input signals to the internal network dynamics, and a readout layer is trained to extract the information embedded in the network dynamics to facilitate pattern classification.

However, the question of how to effectively tune the reservoir for given computational tasks remains to be answered. In this paper, we propose a novel Activity-based Probabilistic Spiking-Timing Dependent Plastic (AP-STDP) mechanism for self-organizing reservoirs. Compared to conventional STDP mechanisms, the proposed rule improves tuning efficiency, prevents the saturation of synaptic memory, and boosts performance. We assess the internal representation ability of the proposed self-organizing mechanism via principal component analysis (PCA) and show that the proposed method is advantageous over other STDP algorithms. Using the spoken English letters adopted from the TI46 speech corpus for performance benchmarking, we demonstrate that AP-STDP consistently outperforms other STDP mechanisms regardless of reservoir size, and is able to boost the performance of the isolated spoken English letter recognition by 2.7% with a small reservoir size.
Monther Abusultan and Sunil Khatri.
Ternary-valued Digital Circuit Implementation using Flash Transistors

Abstract: This paper presents a method to use floating gate (flash) transistors to implement low power ternary-valued digital circuits targeting handheld and IoTdevices. Since the threshold voltage of flash devices can be modified at afine granularity during programming, our approach has several advantages.For one, speed binning at the factory can be controlled with precision. Secondly, an IC can be re-programmed in the field, to negate effects such asaging, which has been a significant problem in recent times, particularly for mission-critical applications. We present the circuit topology that we use inour flash-based ternary logic digital circuit approach, and, through circuitsimulations, show that our approach yields significantly improved power (~11%), energy (~29%) and area (~83%) characteristics while operating at a clock rate that is 36% as compared to a traditional CMOS standard cell based approach, when averaged over 20 designs. Unlike CMOS, our ternary-valued, flash-based implementation provides in-field configuration flexibility.
Ahmad Al Kawam, Sunil Khatri and Aniruddha Datta.
A CPU-GPU heterogeneous algorithm for NGS Read Alignment

Abstract: Next Generation DNA Sequencing (NGS) have unleashed a wealth of genomic information by producing immense amounts of data. It is enabling humanity to learn more about the origins of life and the genetic basis of diseases like cancer. Before biological conclusions can be drawn, sequencing data have to go through a computationally expensive analysis process. Genomic analysis is typically carried out using traditional computing platforms, which have become a limiting factor in the process. Consequently, we are leading a pioneering effort to design hardware systems optimized to carry out genomic analysis. Our work investigates the use of cutting edge computing architectures, memory technologies, and custom-designed chips to accelerate analysis without jeopardizing biological accuracy. Our current work tackles the NGS read alignment process in which millions of DNA fragments, called reads are mapped to a reference genome. The massive scale of the problem places it as an attractive target for acceleration. To resolve this bottleneck, we implement a read alignment tool designed to run on a heterogeneous system composed of a GPU and a multicore CPU. We introduce novel techniques to the alignment process and construct a computational pipeline of interleaved CPU and GPU stages. Our design exploits the GPU’s massive parallelism to hide memory access latency. We overlap CPU and GPU computations together with data transfers to maximize throughput. We compared our tool with popular alignment tools and the results show substantial speedups reducing the runtime from a few hours to a few minutes.
Qian Wang, Youjie Li and Peng Li.
Liquid State Machine based Pattern Recognition on FPGA with Firing-Activity Dependent Power Gating and Approximate Computing

Abstract: This paper presents an FPGA architecture and implementation of the Liquid State Machine, a spiking neural network model, for real world pattern recognition problems. The proposed architecture consists of a parallel digital reservoirwith fixed synapses, and a readout stage that is tuned by a biologically plausible supervised learning rule. When evaluated using the TI46 speech corpus, a widely adopted speech recognition benchmark, the presented FPGA neuromorphic processorsdemonstrate highly competitive recognition performance and provide a runtime speedup of 88X over the 2.3 GHz AMD OpteronTM Processor. A number of critical design issues such as interconnection of liquid neurons, storage of synaptic weights and design of arithmetic blocks are addressed in this work. More importantly, it is shown that the unique computational structure and inherent resilience of the liquid state machine can be leveraged for highly efficient FPGA implementation. For this, it is demonstrated that the proposed firing-activity based power gating and approximate arithmetic computing with runtime adjustable precision can lead to up to 30.2% reduction in power and energy dissipation without greatly impacting speech
recognition performance.
Wuxi Li, Shounak Dhar and David Pan.
UTPlaceFR: A Routability-Driven FPGA Placer with Physical and Congestion Aware Packing

Abstract: FPGA packing and placement without routability consideration could yield unroutable results for high-utilization designs. Conventional FPGA packing and placement approaches are shown to have sever difficulties to yield good routability. In this paper, we propose an FPGA packing and placement framework that simultaneously optimize wirelength and routability. A novel physical and congestion aware packing algorithm and several congestion aware detailed placement techniques are proposed. Compared with the top 3 winners of ISPD’16 FPGA placement contest, we can achieve 3.2%, 7.2%, and 28.2% better routed wirelength with similar or shorter runtime.
Derong Liu, Bei Yu, Salim Chowdhury and David Z. Pan.
Incremental Layer Assignment for Critical Path Timing

Abstract: With VLSI technology nodes scaling into nanometer regime, interconnect delay plays an increasingly critical role in timing. For layer assignment, most works deal with via counts or total net delays, ignoring critical paths of each net and resulting in potential timing issues. In this work, we propose an incremental layer assignment framework targeting at delay optimisation for critical path of each net. A set of novel techniques are presented: self-adaptive quadruple partition based on KxK division benefits the run-time; semidefinite programming is utilised for each partition; post mapping algorithm guarantees integer solutions while satisfying edge capacities. The effectiveness of our work is verified by ISPDང global routing benchmarks.
Jiaojiao Ou, Bei Yu and David Z. Pan.
Concurrent Guiding Template Assignment and Redundant Via Insertion for DSA-MP Hybrid Lithography

Abstract: Directed Self-Assembly (DSA) is a very promising emerging lithography for 7nm and beyond, where a coarse guiding template produced by conventional optical lithography can “magically” generate fine pitch vias/contacts through self-assembly process. A key challenge for DSA-friendly layout is the guiding template assignment to cover all vias under consideration. Meanwhile, redundant via insertion has been widely adopted to improve yield and reliability of the circuit. In this paper, we propose a comprehensive framework for concurrent DSA guiding template assignment and redundant via insertion with consideration of multiple patterning (MP) in guiding template generation. We first formulate the problem as an integer linear programming (ILP), and then propose a novel approximation algorithm to achieve good performance and runtime trade-off. The experimental results demonstrate the effectiveness of the proposed algorithms. To our best knowledge, this is the first work in concurrent guiding template assignment and redundant via insertion for DSA-MP hybrid lithography.
Yujie Wang, Pu Chen, Jiang Hu and Jeyavijayan Rajendran.
The Cat and Mouse in Split Manufacturing

Abstract: Split manufacturing of integrated circuits eliminates vulnerabilities introduced by an untrusted foundry by manufacturing only a part of design at an untrusted high-end foundry and the remaining part at a trusted low-end foundry. Most researchers have focused on attack and defenses for hierarchical designs and/or use a relatively high-end trusted foundry, leading to high cost.
We propose an attack and defense for split manufacturing for industry-standard/relevant flattened designs. Our attack uses network-flow model and outperforms previous attacks. We also develop a defense technique using placement perturbation, while considering overhead. The effectiveness of our techniques is demonstrated on benchmark circuits.
Prasenjit Biswas and Duncan M. Walker.

Abstract: Scan-based delay test achieves high fault coverage due to its improved controllability and observability. This is particularly important for our K Longest Paths Per Gate (KLPG) test approach, which has additional necessary assignments on the paths. At the same time, so percentage of the flip-flops in the circuit will not be scan, increasing the difficulty in test generation. In particular, there is no direct control on the outputs of those non-scan cells. All the non-scan cells that cannot be initialized are considered "uncontrollable" in the test generation process. They behave like “black boxes” and thus may block a potential path propagation, resulting in path delay test coverage loss. It is common for the timing critical paths in a circuit to pass through nodes influenced by the non-scan cells. In our work, we have extended the traditional Boolean algebra by including the uncontrolled state as a legal logic state, so that we can improve path coverage. Many path pruning decisions can be taken much earlier and many of the lost paths due to uncontrollable non-scan cells can be regained, increasing path coverage and potentially CPU time. We have extended the existing traditional algebra to an 11-value algebra (Zero (stable), One (stable), Unknown, Uncontrollable, Rise, Fall, Zero/Uncontrollable, One/Uncontrollable, Unknown/Uncontrollable, Rise/Uncontrollable, Fall/Uncontrollable). The logic descriptions for the NOT, AND, NAND, OR, NOR, XOR, XNOR, PI, Buff, Mux, TSL, TSH, TSLI, TSHI, TIE1 and TIE0 cells the ISCAS89 benchmark circuits have been extended to the 11-value truth table. With 10% non-scan flip-flops, improved path recovery has been observed in comparison to that with the traditional algebra. The more longest paths we want to cover, the more path recovery advantage we achieve using our algebra.
Mallika Pokharel and Duncan M. H. Walker.
At-Speed Test Compaction

Abstract: Testing circuits in functional mode, with multiple functional cycles following the scan-in, increases the delay correlation between scan and functional test. In this work we exploit those multiple functional cycles by compacting additional K longest path per gate (KLPG) tests into the cycles, reducing the number of scan patterns. The challenge is that the at-speed delay test in a cycle must have its necessary value assignments set up in previous (preamble) cycles, and have the captured results propagated to a scan cell in the later (coda) cycles. We extend our prior work on dynamic compaction to support multi-cycle test. We retain the greedy nature of our original algorithm, in that each path delay test is compacted into the first test pattern that it fits, and is compacted into the first at-speed cycle that it fits into within the pattern.
Kinshuk Sharma and Sunil Khatri.
Design a robust C-Element for asynchronous circuit design

Abstract: Metastability causes unpredictable behavior in circuits and can sometimes cause circuit failure. Although, much work has been done to reduce the possibility of metastability in synchronous designs, metastability in asynchronous designs has not been given significant attention. For asynchronous designs, often metastability resolution is assumed using some handshake protocol. However, metastability might still persist in various asynchronous circuit elements for special timing cases. One such circuit is the C-Element which is used in asynchronous circuit design to generate the AND function for events on its inputs. The C-Element is vulnerable to metastability conditions at its output for opposite edge transitions at its inputs or for a single input glitch. In this work, a robust design of a C-Element is proposed which aims at reducing the possibility of metastability at its output by using independent paths for the pull-up and pull-down transitions. Three different existing circuit topologies for a C-Element have been studied and modified with the proposed design. Experimental results show significant improvements in signal integrity at the output of the C-Element.
Meng Li, Jin Miao, Kai Zhong and David Pan.
Practical Public PUF Enabled by Solving Max-Flow Problem on Chip

Abstract: The execution-simulation gap (ESG) is a fundamental property of public physical unclonable function (PPUF), which exploits the time gap between direct IC execution and computer simulation. ESG needs to consider both advanced computing scheme, includ- ing parallel and approximate computing scheme, and IC physical realization. In this paper, we propose a novel PPUF design, whose execution is equivalent to solving the hard-to-parallel and hard-to- approximate max-flow problem in a complete graph on chip. Thus, max-flow problem can be used as the simulation model to bound the ESG rigorously. To enable an efficient physical realization, we propose a crossbar structure and adopt source degeneration technique to map the graph topology on chip. The difference on asymptotic scaling between execution delay and simulation time is examined in the experimental results. The measurability of output difference is also verified to prove the physical practicality.
Jeffrey M. Dudek, Kuldeep S. Meel and Moshe Y. Vardi.
Constrained Sampling and Counting

Abstract: Constrained sampling and counting are two fundamental problems in data analysis. In constrained sampling the task is to sample randomly, subject to a given weighting function, from the set of solutions to a set of given constraints. This problem has numerous applications, including probabilistic reasoning, machine learning, statistical physics, and constrained-random verification. A related problem is that of constrained counting, where the task is to count the total weight, subject to a given weighting function, of the set of solutions of the given constraints. This problem has applications in machine learning, probabilistic reasoning, and planning, among other areas. Both problems can be viewed as aspects of one of the most fundamental problems in artificial intelligence, which is to understand the structure of the solution space of a given set of constraints.

This project focuses on the development of new algorithmic techniques for constrained sampling and counting, based on a universal hashing -- a classical algorithmic technique in theoretical computer science. Many of the ideas underlying the proposed approach were go back to the 1980s, but they have never been reduced to practice. Recent progress in Boolean reasoning is enabling us to reduce these algorithmic ideas to practice, and obtain breakthrough results in constrained sampling and counting, providing a new algorithmic toolbox in machine learning, probabilistic reasoning, and the like.
Syed Ali Hasnain and Roozbeh Jafari.
Urban Heartbeat and Context Detection

Abstract: Sensors and actuators are finding their way into our lives and our surroundings at a very fast pace. These heterogeneous sensors deployed in any environment can prove to be useful in providing insights into the behavior and trends of the environment. Time series data from sensors can capture patterns that can lead to such insights and gain of useful knowledge. In this work, we capture a part of that knowledge and propose a novel concept called Urban Heartbeat using that data. We first develop techniques to find couplings between sensors using multiple operators. The concept of couplings between sensors serves as an abstraction layer and helps in fusing different modalities. Next, we define an algorithm that can be used to find quasi-periodic patterns from time series data that has spatiotemporal deviations. We then introduce the concept of Urban Heartbeat which uses data from heterogeneous sensors to find useful contextual information and trends in the environment. We define heartbeat as what is normal for the environment. Urban Heartbeat can be used to gain behavioral knowledge that is important in fields like health care, industrial applications and context aware computing applications such as resource optimization, user intent prediction and anomaly detection. Urban Heartbeat can be used not only to differentiate between normal and abnormal trends thereby giving us the ability to detect anomalies, but also in making predictions about user or environment behavior with meaningful earliness. We also show how we build heartbeat for a lab environment, and learn useful information about subjects and make predictions about their behavior in the lab.
Ryan Guerra and Clayton Shepard.
Argos and It's Many Eyes: Scalable Software-Defined Radio Platform for Commercial Massive-MIMO

Abstract: Wireless systems designers are increasingly turning to many-antenna or "Massive" MIMO systems to increase the achievable spatial diversity gain in next-generation cellular systems and support throughput of over 70 bits/sec/Hz. However, achieving the order-of-magnitude higher spectral efficiency offered by many-antenna systems requires integrating hundreds of independent radio chains into large, complex, and expensive base station systems.

Skylark Wireless has re-imagined the way that large radio systems are architected and has developed an agile and scale-able Software-Defined Radio (SDR) platform based on many-antenna research pioneered at Rice University. By addressing bottlenecks to scale-ability at all layers of the network stack and developing new, low-cost SDR hardware, Skylark is enabling the next generation of commercial many-antenna wireless systems with hundreds of integrated radios.
Qingyue Liu and Peter Varman.
A Block Migration Model for NVM Wear Leveling

Abstract: Emerging NVM technologies have a limit on the number of writes that can be made to any cell, similar to the erasure limits in NAND Flash. This motivates the need for wear-leveling techniques to distribute the writes evenly among the cells. Unlike NAND Flash, cells in NVM can be rewritten without the need for erasing the entire containing block, allowing different solutions to the problem. Previous approaches for NVM propose low overhead techniques for local wear leveling within a region of memory based on a ?xed rotation of words within the region. However a systematic approach for global wear leveling across the device or multiple devices has not been addressed. In this work we propose a hierarchical block migration model, which combines existing local wear-leveling methods with global wear leveling, using lexicographic smoothness to direct the remapping of blocks. We also design a block remapping algorithm to minimize the number of block migrations while maximizing the evenness of the write distribution. Our result shows that we can distribute the writes uniformly across the NVM with a limited number of block movements.
Ellis Giles, Peter Varman and Kshitij Doshi.
Lightweight Atomic Consistency for Storage Class Memory

Abstract: Non-volatile byte-addressable memory, commonly called Storage Class Memory, has the potential to revolutionize system architecture by providing instruction-grained direct access to vast amounts of persistent data. We describe both hardware-software and software-only methods to achieve lightweight atomic persistence to storage class memory in light of power or application failure.

Our software-only approach simultaneously propagates updates made within an atomic region along two paths: a foreground path through the cache hierarchy that is used for value communication within and between wraps, and an asynchronous background path to SCM to log the updates. By creating these two paths, we decouple value communication for transaction execution from logging (in SCM) for recovery.

We also present a hardware-software approach with a non-intrusive memory controller that uses backend operations for achieving lightweight failure atomicity. By moving synchronous persistent memory operations to the background, the performance overheads are minimized. Our solution avoids costly software intervention by decoupling isolation and concurrency-driven atomicity from failure atomicity and durability, and does not require changes to the front-end cache hierarchy. Two implementation alternatives – one using a hardware structure, and the other extending the memory controller with a firmware managed volatile space – are described. Our results show the performance is significantly better than traditional approaches.
Mark Garrett, Liwen Shih and Trinh Le.
Profiling a cellular automata based finite difference model on GPU accelerators for seismic elastodynamics

Abstract: A cellular automata (CA) based finite difference model has been developed to model seismic elastodynamics which has been shown to be equivalent to the centered-difference finite-difference (FD) method. CA approach assigns stresses to the cell faces while the FD approach assigns stresses collocated with displacement components at a single node. The CA method lends itself easily to distributed computing which may yield performance advantages over the traditional FD methods. Initial research of performance profiling has been inconclusive as to whether or not CA has a clear advantage over FD using a similar implementation. More research will need to be conducted to adapt the CA paradigm to a high-performance algorithm for distributed computing.
Venkatesh Gangineni and Shih Liwen.
Implimentation of Associative Memory using Quantum Calculation
Abstract: In this paper an endeavor is made to execute cooperative memory with quantum neural systems. The quantum calculation for the capacity of example and the recovery of data are introduced .It is demonstrated that the exponentially contrasted with the routine model. The idea of quantum calculation is produced in light of the quantum mechanics .It was initially presented by the Nobel prize winning physicist Richard Feynmen in 1982.The thought of quantum calculation was essentially of hypothetical premium just for a timeframe .Until as of late, the leap forward made by dwindle Shor conveyed the possibility of quantum calculation to everyone's consideration .By utilizing the Shor's calculation, a quantum PC would have the capacity to figure out encoded codes a great deal more rapidly than any established PC could. This paper joins quantum calculation with traditional simulated neural system hypothesis self-sorting out Quantum Neural system that can perform design traditionally through quantum aggressive procedure. It doesn't have to prestore some given examples yet can characterize information designs with higher characterization speed than that of established neural systems. Quantum Neural systems in view of quantum mechanics.
Himanshu Aggrawal and Aydin Babakhani.
An Ultra-Wideband Impulse Receiver for sub-100fsec Time-Transfer and sub-30μm Localization
Abstract: An ultra-wideband impulse receiver capable of detecting sub-200psec pulses is presented. The chip detects a specific zero-crossing of an incoming pulse and mitigates the undesired effects of ringing. The time detection sensitivity of the chip is limited by the jitter of the incoming pulse rather than the pulse width. A mean RMS jitter of 94fsec is recorded, which translates to the localization accuracy of sub-30μm. The chip is fabricated in IBM 130nm SiGe BiCMOS process technology.
Babak Jamali and Aydin Babakhani.
Sub-picosecond Wireless Synchronization Based on a Millimeter-Wave Impulse Receiver

Abstract: This work presents a wireless synchronization receiver using sub-8psec pulses. A novel self-mixing technique is introduced to detect low-power picosecond impulses and extract the repetition rate with a low timing jitter. The chip is fabricated in a 0.13μm SiGe BiCMOS process and achieves a time transfer accuracy of 376fsec. The receiver, which is integrated with a broadband on-chip antenna, successfully detects a picosecond pulse train with a 3.1GHz repetition rate and generates an output locked to this rate with a phase noise of -89 dBc/Hz at 100 Hz frequency offset. The chip consumes 146 mW from a 2.5V supply and occupies an area of 1.89mm2.
Robert Likamwa and Lin Zhong.
Rethinking the sensing system stack for Continuous Mobile Vision
Abstract: The future of computing is in allowing our devices to see what we see; wearable systems that continuously interpret vision data for real-time analytics. Today's system software and imaging hardware are ill-suited for this goal of "continuous mobile vision." Current systems -- highly optimized for photography -- fail to achieve sufficient energy efficiency for wearables batteries to last an entire day.

This poster provides a rethinking of the vision system stack that includes application frameworks, operating system and sensor hardware to improve efficiency by two orders of magnitude. This cross-layer rethinking contributes: (1) a split-process application framework that eliminates redundancy in data movement and processing across multiple concurrent applications, (2) operating system optimizations for energy proportional image capture, and (3) a mixed-signal image sensor architecture that processes data in the analog domain to eliminate the efficiency bottleneck of analog-digital conversion. Together, the improvements to system stack efficiency will open the door to integrate our devices with our real-world environments and ultimately, our own lives.
Hamed Rahmani and Aydin Babakhani.
Fullt-Integrated Energy Harvesting System for Neural Recording in 180nm SOI CMOS
Abstract: Rising demand for continuous monitoring of human body and health care devices in recent years has resulted in the development of implantable biopotential sensors. Neural recording is one of the most popular areas and many efforts have been done for designing implantable neural recorder sensors. Infection risks and mobility concerns constrain the implanted sensors to operate without any transcutaneous wire connection which raises serious challenges for powering and data telemetry. Lack of enough information on many physiological processes such as human speech production dynamics has hindered research on the basis of the processes and it is mainly stemmed from available measurement systems limitations. We present a fully-integrated implantable energy harvesting system for neural recording and data telemetry in CMOS 180nm technology. In addition, we show a novel solution for increasing the data rate of implanted medical systems through analysis of wireless power and data link.
Min Hong Yun, Songtao He and Lin Zhong.
Forget Synchrony for Low Latency
Abstract: Modern systems have an interactive latency of over 50 ms. While humans generally cannot perceive a latency of 50 ms, touchscreen latency manifests spatially. For example, when a user draws a line on a screen at 50 cm/sec, a latency of 10 ms would lead to a visible, annoying gap of 5 cm between the nger tip and the line. In this poster, we report a promising design, Asynchrony, that reduces the mean latency by 48%.
Peiyu Chen and Aydin Babakhani.
A 30GHz Impulse Radiator with On-Chip Antennas for High- Resolution 3D Imaging
Abstract: This work reports a 30-GHz impulse radiator utilizing an injection-locked asymmetric cross-coupled voltage-control-oscillator (VCO) with on-chip bow-tie antennas. The impulse radiator converts a digital trigger signal to a radiated impulse with a variable pulse-width down to 60psec with peak EIRP of 15.2dBm without using any lens. Coherent spatial pulse combining is demonstrated by using two widely spaced radiators. A timing jitter of the 216fsec for the combined signal is measured. The impulse radiator has the capability of producing 3D images with depth resolution of 33μm at 25cm of target distance in the air. The chip is implemented in a 0.13μm SiGe BiCMOS process technology. The total die area is 2.85mm2 with maximum power consumption of 106mW.