Link/Page Citation

1. Introduction

Recently, computing systems in space and other extreme environmentswith high-energy radiation (e.g., high-energy physics, high altitudes)have been turning to field-programmable gate arrays (FPGAs) to meetperformance and power constraints not met by other computingtechnologies [1]. One challenge for FPGAs in these environments issusceptibility to radiation-induced single-event upsets (SEUs), whichcan alter the functionality of a design by changing bits in memories andflip-flops. Although radiation-hardened FPGAs exist, some of thosedevices are still susceptible to SEUs and commonly have prohibitivecosts compared to commercial-off-the-shelf (COTS) devices [2].

To mitigate these issues, designers often use triple-modularredundancy (TMR) on COTS FPGAs. TMR is a well-known form of hardwareredundancy that replicates a design into three independent modules witha voter at the outputs to detect and correct errors. Research has shownthat TMR with frequent configuration scrubbing (i.e., reconfiguringfaulty resources) provides an effective level of fault tolerance formany FPGA-based space applications [3].

One key disadvantage of TMR is the 3x resource overhead, whichoften requires large FPGAs that may exceed cost or power constraints forembedded systems. Although resource sharing is a common strategy forreducing this overhead, designers often apply TMR toregister-transfer-level (RTL) code, where the productivity challenge ofexploring resource sharing and scheduling options is often impractical.Furthermore, RTL code is not available for the common case of usingencrypted or presynthesized IP cores, making such explorationimpossible.

In this paper, we automate this exploration during high-levelsynthesis (HLS) by integrating resource sharing and TMR into ascheduling and binding heuristic called the Force-DirectedFault-Tolerance-Aware (FD-FTA) heuristic that provides attractivetradeoffs between performance, area, and redundancy. More specifically,the heuristic explores the impact of varying hardware redundancy withthe capability to correct an error, which we measure as anerror-correction percentage. Our heuristic is motivated by theobservation that, for many situations, error correction is not ascritical as error detection. By allowing a designer to specify anerror-correction percentage constraint that is appropriate for theirapplication, our heuristic can explore numerous options betweenperformance and area that would be impractical to do manually.

Although other FPGA workhas also approached fault tolerance throughhigh-level synthesis, that earlier work mainly focused on differentfault models or reliability goals. For example, Golshan et al. [4]introduced an HLS approach for minimizing the impact of SEUs on theconfiguration stream, which is complementary to our approach. Shastri etal. [5] presented a conceptually similar HLS approach, but that workfocused on minimizing the number of coarse-grained resources withoutconsidering their FPGA implementation costs, while also suffering fromlong execution times and the need for manual parameter tuning. Thispaper presents an extension of [5] that addresses previous limitationswith an automated approach, showing resource savings of up to 74%compared to the earlier approach, while also reducing heuristicexecution time by 96x on average.

Similarly, high-level synthesis for ASICs has introducedconceptually similar techniques [6], but whereas ASIC approaches mustdeal with transient errors, FPGAs must pay special attention to SEU inconfiguration memory that will remain until scrubbing or reconfiguration(referred to as semipermanent errors for simplicity). Due to thesesemipermanent errors, FPGA approaches require significantly differentHLS strategies.

Compared to the common strategy of applying TMR to existing RTLcode, our heuristic has average resource savings of 34% and displayssignificant improvements as the latency constraint and the benchmarksize increases. For a latency constraint of 2x the minimum-possiblelatency, our heuristic shows average resource savings of 47%, whichachieves a maximum of 80% resource savings in the largest benchmark.

The paper is organized as follows. Section 2 describes related workon studies pertaining to areas such as FPGA reliability andfault-tolerant HLS. Section 3 defines the problem and the assumptions ofour fault model. Section 4 describes the approach and implementation ofthe FD-FTA heuristic. Section 5 explains the experiments used toevaluate the FD-FTA heuristic, and Section 6 presents conclusions fromthe study.

2. Related Work

With growing interest in FPGAs operating in extreme environments,especially within space systems, a number of studies have been done onassessing FPGA reliability in these environments. Some of these studiesrely on analyzing FPGA reliability through models. For example, Ostleret al. [7] investigated the viability of SRAM-based FPGAs in Earth-orbitenvironments by presenting a reliability model for estimating mean timeto failure (MTTF) of SRAM FPGA designs in specific orbits and orbitalconditions. Similarly, Heron et al. [8] introduced an FPGA reliabilitymodel and presented a case study for its application on aX[C.sub.2]V3000 under a number of soft IP cores and benchmarks. Inaddition to reliability models, a number of studies have been done onemulating and simulating faults in FPGAs (e.g., [9, 10]). Rather thanrelying on costly testing in a radiation beam, such approachesfacilitate cost-effective testing of fault-tolerant designs. Althoughmany studies analyze SRAM-based technologies, other studies have alsoconsidered antifuse FPGAs and flash FPGAs. For example, McCollumcompares the reliability of antifuse FPGAs to ASICs [11]. Wirthlinhighlights the effects of radiation in all three types of FPGAs andexplores the challenges of deploying FPGAs in extreme environments, suchas in space systems and in high-energy physics experiments [2]. In thispaper, we complement these earlier studies by presenting an HLSheuristic that transparently adds redundancy to improve the reliabilityof SRAM-based FPGA designs. As described in Section 3, the fault modelfor our heuristic makes several assumptions based on the findings ofthese earlier works.

In our approach, we leverage the use of TMR to apply faulttolerance to FPGA designs. TMR is a well-studied fault tolerancestrategy in FPGAs that has been studied in different use cases and underadditional modifications. Morgan et al. [12], for example, compared TMRwith several alternative fault-tolerant designs in FPGAs and showed thatTMR was the most cost-effective technique for increasing reliability ina LUT-based architecture. Other works have introduced modifications thatare complementary to our TMR design and may be incorporated into ourheuristic. For example, Bolchini et al. [13] present a reliabilityscheme that uses TMR for fault masking and partial reconfiguration forreconfiguring erroneous segments of the design. Our work focuses onautomatically applying the TMR architecture from high-level synthesisand could incorporate partial reconfiguration regions for enhancedreliability at the cost of producing device-specific architectures.

Work by Johnson and Wirthlin [14] approaches the issue of voterplacement for FPGA designs using TMR and compares three algorithms forautomated voter insertion based on strongly connected component (SCC)decomposition. Compared to the naive approach of placing voters afterevery flip-flop, these algorithms are designed to insert fewer votersthrough feedback analysis. Since FPGA TMR designs consist of bothapplying redundancy and voter insertion, our paper focuses specificallyon the problem of automated TMR and can be potentially used alongsidethese voter insertion algorithms.

Although scheduling and binding in high-level synthesis is awell-studied problem [15-17], many of those studies do not considerfault tolerance or error-correction percentage. More recent works havetreated reliability as a primary concern in the high-level synthesisprocess but focus on different reliability goals in ASIC designs. Forexample, Tosun et al. [18] introduce a reliability-centric HLS approachthat focuses on maximizing reliability under performance and areaconstraints using components with different reliabilitycharacterizations. Our work, in contrast, focuses on minimizing areaunder performance and reliability constraints using components with thesame reliability characterizations. Antola et al. [6] present an HLSheuristic that applies reliability by selectively replicating parts ofthe datapath for self-checking as an error-detection measure. Ourheuristic differs by applying reliability through TMR with resourcesharing. Our heuristic additionally varies the reliability through anerror-correction percentage constraint that enables selective resourcesharing between TMR modules.

Other work has notably used HLS to apply reliability through othermeans than replication. For example, Chen et al. [19] introduce an HLSapproach that uses both TMR and gate-level hardening technique calledgate sizing on different resources to minimize both soft error rate andarea overhead. In another example, Hammouda et al. [20] propose a designflow that automatically generates on-chip monitors to enable runtimechecking of control flow and I/O timing behavior errors in HLS hardwareaccelerators. Our heuristic focuses specifically on scheduling andbinding and may be complementary to several of these approaches.Although our heuristic could also be potentially applied to ASIC designwith modification, we have tailored scheduling and binding to the FPGAarchitecture, which notably has different HLS challenges compared toASICs [21], and have compared its effectiveness with other FPGA-specificapproaches. Our heuristic is especially applicable to FPGAs given therise of FPGAs in space missions which commonly implement fault-tolerantlogic through TMR and configuration scrubbing [3].

Compared to ASICs, far fewer reliability-centric high-levelsynthesis studies have targeted FPGAs. Golshan et al. [4] introduced aTMR-based HLS process for datapath synthesis, placement, and routingthat targets SRAM-based FPGAs. Although conceptually similar to ourwork, that study focused on mitigating the impact of SEUs in theconfiguration bitstreams by enforcing self-containment of SEUs within aTMR module and by minimizing potential SEU-induced bridging faults thatconnect two separate nets in a routing resource. By contrast, our workfocuses on minimizing the resources needed for TMR and can intentionallyneglect self-containment of SEUs within a TMR module for additionalresource savings. Dos Santos et al. [22] investigated another TMR-basedHLS design flow on SRAM-based FPGAs and compares the reliability ofthese designs with their respective unhardened equivalents. Compared toour work, our heuristic focuses on making tradeoffs between redundancyand area, which can include full TMR. Shastri et al. [5] introduced aTMR-based HLS heuristic that focused on solely minimizing coarse-grainedresources under latency and redundancy constraints. By contrast, ourwork considers each resource's implementation costs and usesimproved scheduling and binding algorithms for better scalability inlarger benchmarks and increased latency constraints. At a 2x normalizedlatency constraint, our heuristic provides average resource savings of34% compared to TMR applied to existing RTL and achieves resourcesavings of 74% relative to [5] approach on the largest benchmark.

3. Problem Definition

Although there are different optimization goals that could beexplored while varying the amount of error correction, in this paper wefocus on the problem of minimum-resource, latency- and error-constrainedscheduling and binding, which for brevity we simply refer to as theproblem. To explain the problem, we introduce the following terms:

(i) Fault: a resource with an SEU-induced error

(ii) Module: one instance of the dataflow graph (DFG), analogous toa module in TMR

(iii) Error: any fault where one or more of the three modulesoutput an incorrect value

(iv) Undetectable error: any fault where all three modules outputthe same incorrect value

(v) Detectable error: any fault where one or more modules outputdifferent values

(vi) Uncorrectable error: any fault where two or more modulesoutput incorrect values

(vii) Correctable error: any fault where two or more modules outputa correct value

(viii) Error-correction % (EC%): the percentage of total possibleerrors that are correctable by a given solution.

Figure 1 illustrates several example error types, where alloperations with the same color are bound to a single resource. If theblack resource experiences a fault, this binding results in anundetectable error because all three modules will produce the sameincorrect value. If there is a fault in the medium-gray resource, thisbinding causes an uncorrectable error because two modules (2 and 3) willproduce incorrect values. A fault in the light-gray resource results ina correctable error because modules 2 and 3 both produce correctoutputs. Note that both gray resources result in detectable errorsbecause at least one module outputs a different value than the othermodules. We consider the error correction to be 100% if all errors canbe classified in this way as correctable errors, although failures thatoccur in other parts of the system may still cause incorrect outputs.

The input to the problem is a dataflow graph (DFG) D, a latencyconstraint L expressed in number of cycles, and an error constraint Especified as the minimum acceptable EC%. The output is a solution X,which is a combination of a schedule S and binding B for a redundantversion of D. Given these inputs and outputs, we define the problem asfollows:

Minimize NumResources (X)

Subject to Latency (X.S) [less than or equal to] L, ErrorCorrection(X.B) [greater than or equal to] E, ErrorDetection (X.B) = 100%. (1)

In other words, the goal of the problem is to find a schedule andbinding that minimizes the number of required resources, where theschedule does not exceed the latency constraint L, the binding does notexceed the error constraint E, and all errors are detectable. We providean informal proof that this problem is NP-hard as follows. If we removeboth error constraints from the problem definition, the problem isequivalent to minimum-latency and resource-constrained schedulingfollowed by binding, which are both NP-hard problems [15]. Thecorrectable and detectable error constraints only make the problemharder by expanding the solution space with replicated versions of theinput.

Note that a more complete definition of this problem would includeother FPGA resources (e.g., DSP units, block RAM), as opposed to solelyusing LUTs. However, because there is no effective relative cost metricfor different FPGA resources, comparison between solutions withdifferent types of FPGA resources is difficult. For example, it is notclear whether or not a solution with 100 DSPs and 10,000 LUTs ispreferable to a solution with 10 DSPs and 100,000 LUTs. An alternativeto this approach would be minimizing one resource while usingconstraints on the other resources. This approach however may excludesolutions that minimize multiple selected resources. For ease ofexplanation and comparison, the presented heuristic implementscoarse-grained resources using only LUTs and focuses on minimizing thedesign's overall LUT count.

We assume that scrubbing occurs frequently enough so there cannotbe more than one faulty resource at a time, which is often true due tothe low frequency of SEUs in many contexts. For example, in the CibolaFlight Experiment [23], the experiment's Virtex FPGAs experiencedan average SEU rate of 3.51SEUs/day compared to their scrubbing cycle of180 ms. With this assumption, based on our definitions, the total numberof possible faults (and errors) is equal to the total number ofresources used by the solution. Due to the likely use of SRAM-basedFPGAs, we assume that all faults persist until scrubbing removes thefault. This contrasts with earlier work that focuses on transient faults[24-26]. We assume the presence of an implicit voter at the output ofthe modules, potentially using strategies from [14].

One potential challenge with error correction is the possibility oftwo modules producing incorrect errors that have the same value, whichwe refer to as aliased errors. Although we could extend the problemdefinition to require no instances of aliased errors, this extension isnot a requirement for many usage cases (e.g., [26]). In addition, bytreating aliased errors as uncorrectable errors, good solutions willnaturally tend to favor bindings that have few aliased errors. Tofurther minimize aliased errors, our presented heuristic favorssolutions with the highest EC% when there are multiple solutions thatmeet the error constraint with equivalent resources.

4. Force-Directed Fault-Tolerance-Aware (FD-FTA) Heuristic

To solve the problem of minimum-resource, latency- anderror-constrained scheduling and binding, we introduce theForce-Directed Fault-Tolerance-Aware (FD-FTA) heuristic, which performsscheduling and binding during high-level synthesis while simultaneouslyapplying TMR and resource sharing to reduce overhead. By using anerror-correction constraint combined with a latency constraint, theheuristic explores various tradeoffs between area, performance, andredundancy. As described in Algorithm 1, the heuristic first triplicatesthe DFG, schedules the triplicated DFG under a given latency constraint,and then binds the scheduled operations under a given EC% constraint.

The heuristic is divided into two keyparts: scheduling and binding.We discuss the scheduling algorithm in Section 4.1 and the bindingalgorithm in Section 4.2.

4.1. Scheduling. In high-level synthesis, scheduling is the processof assigning each operation into a specific cycle or control state. Theresulting schedule for the entire application is then implemented usinga finite-state machine. In this section, we discuss the limitations ofprevious fault-tolerance-aware schedulers (Section 4.1.1) and thenpresent a heuristic that adapts Force-Directed Scheduling (FDS) [27] toaddress those limitations (Section 4.1.2).

4.1.1. Previous Fault-Tolerance Aware Scheduling. The previous workon fault-tolerance-aware scheduling from [5] used the RandomNonzero-Slack List Scheduling. As a variant form of minimum-resource,latency constraint (MR-LC) list scheduling, this scheduling algorithm isa greedy algorithm that makes scheduling decisions on a cycle-by-cyclebasis based on a resource bound and operation slack (i.e., thedifference between the latest possible cycle start time and the cycleunder consideration). To minimize resource usage, the algorithm iteratesover each cycle and schedules operations ordered from lowest to highestslack up to the resource bound. If there are still zero-slack operationsin a particular cycle once the resource bound is reached, thoseoperations are scheduled and the resource bound is updated to match thisincreased resource usage. This process continues until all of theoperations are scheduled. Unlike MR-LC list scheduling, the RandomNonzero-Slack List Scheduling schedules nonzero-slack operators with a50% probability up to the resource bound. With this randomness, thescheduling algorithm is intended to produce different schedules for eachTMR module which would increase the likelihood of intermodule operationsbindings following scheduling. To escape local optima, this previouswork used the scheduling algorithm in a multipass approach that wouldcollect the best result from a series or phase of scheduling and bindingruns. The heuristic would then continue to run these phases until thebest result of a phase showed no significant improvement compared to theprevious phase's result.

Algorithm 1: Force-Directed Fault-Tolerance-Aware (FD-FTA) heuristic.Input: Dataflow graph D, latency constraint L, error correctionconstraint EOutput: Solution X that minimizes resourcesbegin Step 1. Triplicate D into [D.sub.FT]; Step 2. Schedule the operators of [D.sub.FT] with constraint L to obtain S; Step 3. Bind the operators of S with constraint E to obtain B; Step 4. Return solution X from S and B; endAlgorithm 2: Force-Directed Scheduling.Input: Dataflow graph representation of the designOutput: Operator assignments to cycleswhile there are unscheduled operations do Step 1. Evaluate time frames; Step 2. Update distribution graphs; Step 3. Calculate self-forces for every feasible cycle; Step 4. Calculate total force from self-force, predecessor force, and successor force; Step 5. Schedule operation with lowest force;end

There are several key disadvantages of using the previous heuristicwith this scheduling algorithm that we address in this paper. Oneprimary disadvantage is the heuristic's lengthy execution time fromits multipass approach, which relies on randomness in the schedule tofind an improved solution. Using a user-defined percentage, theheuristic continues to another phase with double the amount ofscheduling and binding runs if the current phase's result is notsignificantly better than the previous phases. The execution time cantherefore be largely influenced by the randomness and the datadependencies between operations. In addition to long execution times,the heuristic requires fine-tuning of starting parameters to avoidpremature exiting, which worsens productivity and can be error prone.The heuristic also has the disadvantage of exploring a restrictedsolution space by favoring the scheduling of operations in earliercycles. This scheduling tendency results from the use of a fixed 50%scheduling probability for nonzero-slack operations in each cycle. Bycontrast, our proposed heuristic performs the scheduling and bindingprocess once, using a more complex force-directed scheduler, whichgenerally reduces the execution time and improves quality without theneed for manually tuning heuristic parameters.

In terms of complexity, both heuristics consist of schedulingfollowed by binding. As such, the proposed heuristic consists ofO([cn.sup.2]) complexity for the force-directed scheduler [27] describedin the next subsection and O([cn.sup.2]) for the binder described inSection 4.2 [28]. The overall complexity is therefore O([cn.sup.2])where c is the latency constraint and n is the number of operations. Incontrast, the previous heuristic consists of O(n) complexity for itsvariant list-scheduler and O([cn.sup.2]) for its clique partitioningbinder. Since the previous heuristic will generally limit the number ofphases or total number of scheduling and binding runs in its multipassapproach, the overall complexity is O([cn.sup.2]), where p is auser-defined limit on total iterations and n is the number ofoperations. The proposed heuristic therefore has a lower complexity thanthe previous heuristic when c < p which is commonly true as the userwill generally set p such that p [much greater than] c to avoidpremature exiting.

Additionally, the previous heuristic suffers from the samedisadvantages as general MR-LC list scheduling. Notably, it is a localscheduler that makes decisions on a cycle-by-cycle basis using operationslack as a priority function. Such an approach has a tendency towardslocally optimal operation assignments that often leads to suboptimalsolutions when providing a latency constraint that is significantlylarger than the minimum-possible latency. Similarly, theheuristic's use of slack may present suboptimal results inscenarios where operation mobility may underestimate resourceutilization. By contrast, Force-Directed Scheduling is a globalstepwise-refinement algorithm that selects operation assignments fromany cycle based on its impact on operation concurrency.

4.1.2. Force-Directed Scheduling. Force-Directed Scheduling is alatency-constrained scheduling algorithm that focuses on reducing thenumber of functional units by balancing the concurrency of theoperations assigned to the units. Algorithm 2 presents an overview ofthe algorithm. A more detailed description can be found in [27].

As demonstrated in Algorithm 2, Force-Directed Scheduling balancesoperation concurrency by using time frames, distribution graphs, andforce values. Time frames refer to the possible cycles to which anoperation may be assigned, such that the resulting schedule does notviolate the latency constraint. Intuitively, the assignment of anoperation to a specific cycle may impact the time frames of otheroperations if there are data dependencies. Distribution graphs refer tothe probability that a given operation type is assigned to a specificcycle for each cycle within the latency constraint. For each cycle, thealgorithm assigns probabilities to the distribution graphs by findingall operations of the same type that can be scheduled at a given cycleand then sums their individual probabilities.

To better illustrate these structures, Figure 2 shows the timeframes and distribution graphs of a DFG that consists of two operationsand is subject to a latency constraint of 4 cycles. In Figure 2(a), theDFG is shown scheduled with an as-soon-as-possible (ASAP) schedule andan as-late-as-possible (ALAP) schedule. The ASAP schedule focuses onscheduling an operation at the earliest possible cycle given datadependencies. Notice how operation B cannot be scheduled on cycle 1because it is dependent on operation A. Similarly, the ALAP schedulefocuses on scheduling an operation at the latest possible cycle.Force-Directed Scheduling uses these two schedules to form the timeframe of each operation, which represents the cycle bounds of anoperation assignment.

Figure 2(b) shows the distribution graphs of the DFG before anyoperation has been scheduled. Assuming operations 1 and 2 are of thesame type (or can share a resource), each operation has a uniform 33%chance of being scheduled in any cycle of its 3-cycle time frame. Thedistribution graph therefore shows a 33% probability of that operationtype being scheduled in cycles 1 and 4 and a 66% probability for thecycles where the time frames of the operations overlap.

Force-Directed Scheduling abstracts each distribution graph as aseries of springs (for each cycle) connected to each operation of theDFG. Therefore, any spring displacement exerts a "force" oneach operation. In this abstraction, the spring's strength is thedistribution graph's value in a particular cycle and thedisplacement is a change in probability in a cycle due to a tentativeassignment. A tentative operation assignment will therefore causedisplacements in the cycle it is assigned to, whose probability is now100%, and in the cycles it can no longer be assigned to, whoseprobabilities are now 0%. Using this abstraction, each operationassignment has an associated total force value that reflects theassignment's impact on the time frames and distribution graphs ofthe operation and its preceding and succeeding operations. While thereare unscheduled operations, the algorithm schedules the operationassignment with the lowest force value reflecting the lowest impact onoperation concurrency.

Figure 3 depicts several examples of how an operation assignmentmay impact the time frames of other operations using the DFG from Figure2. In each of these examples, operation A is scheduled in a specificcycle, which limits the operation's time frame to that cycle. Forthe first diagram, operation A is scheduled in cycle 1 and has no impacton operation B's time frame. In contrast, scheduling operation A incycle 2 or 3 reduces operation B's time frame, since operation Amust be scheduled in a cycle before operation B.

Figure 4 depicts the corresponding distribution graph after suchcycle assignments. Notice that an operation assignment in a particularcycle changes the probability of that operation type being scheduled to100%. In (a), operation A contributes a 100% probability in cycle 1,while operation B contributes a uniform 33% probability over its timeframe from cycle 2 to cycle 4. In (b), operation A still contributes a100% probability in its scheduled cycle, while operation B contributes auniform 50% probability over its reduced time frame. Noticeably, in (c),operation A's assignment in cycle 3 will effectively forceoperation B's later assignment to cycle 4 due to data dependencies.It should be noted that a distribution graph may have probabilities over100% which can represent multiple operation assignments of that type inthe same cycle. The main goal of Force-Directed Scheduling is to balanceoperation concurrency such that the distribution graph for eachoperation type will have a relatively uniform probability over eachcycle after all operations are scheduled. Using the equations for force,the assignment of operation A to cycle 1 would have the lowest forcevalue of the three possible assignments since it balances the operationprobability across the most cycles.

In [27], the authors describe an adjustment for optimization inscenarios with resources with different realization costs. Thisadjustment involves scaling the force values by a cost factor reflectingthe relative realization costs. In our proposed heuristic, we scale theforce values by the LUT requirements of the resource.

4.2. Binding. In high-level synthesis, binding refers to theprocess of assigning operations and memory access to hardware resources.When following scheduling, the binding process yields a mapping ofoperations to hardware resources at specific cycles.

4.2.1. Singleton-Share Binding. Our proposed heuristic, referred toas Singleton-Share Binding, involves a two-stage process that firstperforms binding in each TMR module separately and then mergesnonconflicting bindings between TMR modules. By binding operations fromdifferent TMR modules to the same resource, the heuristic has thereforemade tradeoffs between the circuit's capability to correct an errorand the total area needed by the circuit. Using the aforementioned EC%constraint, a designer can limit the total amount of intermodule bindingthat can occur.

Algorithm 3 displays the pseudocode of Singleton-Share Binding.

In the first stage, Singleton-Share Binding considers each TMRmodule separately and binds the module's operations to resourcesusing the Left Edge Algorithm (LEA) as originally described in [29].Although originally introduced as a method for packing wire segments,this algorithm has found popularity in resource binding that focuses onbinding as many operations to a resource as possible before binding onanother resource.

In the second stage, the binding algorithm consolidates singletonbindings (i.e., resources with only one mapped operation) and attemptsto merge each singleton binding with a nonconflicting binding fromanother module. By merging these bindings, a single resource willcontain operation bindings from different TMR modules which will reducethe circuit's error-correcting capability at the benefit of a lowerarea cost. This binding process will therefore reduce the total numberof resources used in the design at the cost of reducing thedesign's EC%. As mentioned before, the heuristic maintains 100%error detection by limiting the resource sharing to only two modules.This process allows the design to detect errors, albeit uncorrectable,in the event of a fault in a shared resource. Algorithm 3 shows thepseudocode for the second stage which first merges bindings betweensingleton and nonsingleton bindings and then merges bindings betweensingleton binding pairs. Bindings may only be merged if the resultingbinding does not involve multiple operations scheduled on the same cycleor operations of all three modules.

The key requirement of the second stage is ensuring that the EC%does not fall under the error constraint E. Since the number ofuncorrectable errors equals the number of resources that have beenshared across two modules, this constraint satisfaction is illustratedas shown in

[[resources.sup.used] -[resources.sub.Shared]/[resources.sub.used]] x 100% [greater than orequal to] E (2)

which can be reduced to

[resources.sub.Shared] [less than or equal to] [resources.sub.used]x(1 - [E/100]). (3)

These equations however do not reveal a limit on the number ofshares that can be done by the second stage binder. Since sharing aresource also reduces the number of resources, the relationship betweenthe number of shares i and the initial number of bindings is related in

i [less than or equal to] ([usage.sub.stg1] - i) x {1 - [E/100]).(4)

Algorithm 3: Singleton-Share Binding.Input: Operator assignments to control-steps, SOutput: Operator bindings to resources, [B.sub.stg2]begin /* Stage 1 Binding */ [B.sub.stg1] [left arrow] 0; foreach module m do [B.sub.stg1] [left arrow] [B.sub.stg1] [union] LeftEdgeAlgorithm ([S.sub.m]) end /* Stage 2 Binding */ shares [left arrow] 0; limit [left arrow] GetLimit ([B.sub.stg1]); singleton, non-singleton [left arrow] SeparateRes ([B.sub.stg1]); /* Stage 2: Merge singletons with non-singletons */ foreach resource s of singleton do foreach resource n of non-singleton do if shares = limit then break; ; if IsMergeable (s,n) then add operator of s to n and remove s from singleton; shares [left arrow] shares + 1; end end end /* Stage 2: Merge singletons with other singletons */ foreach pair of singleton (s, n) do if shares = limit then break; ; if IsMergeable (s,n) then add operator of s to n and remove s and n from singleton; add n to non-singleton; shares [left arrow] shares + 1; end end [B.sub.stg2] [left arrow] singleton U non-singleton; return [B.sub.stg2] end

Since the number of shares must be an integer, we define theinteger limit as the maximum number of shares done by the second stagebinder while meeting the error constraint. We simplify (4) in terms oflimit in

limit [less than or equal to] [[usage.sub.stg1] x [100 - E/200 -E]. (5)

Figure 5 depicts an example of the binding process. In thesesubfigures, each nonsingleton resource is represented by a node with asolid color, whereas each singleton resource is represented by a nodewith a pattern. The format [A.sub.n] on each node refers to an operationA from the original DFG performed on module n. Figure 5(a) thereforedisplays three modules each with distinct singleton and nonsingletonresources and represents a possible scheduling and binding after thefirst stage of binding.

Using a 50% EC constraint and (5), the second stage of binding mayshare up to two resources to meet the EC% constraint. As the first step,the binding process attempts to merge singleton bindings withnonsingleton bindings. In Figure 5(a), the only candidates for sharingis singleton bindings of either node [B.sub.1] or node [A.sub.3], withthe nonsingleton binding containing nodes [A.sub.2], [C.sub.2], and[D.sub.2] (the light-gray resource). The singleton binding containing[B.sub.2] is ineligible since all nonsingleton bindings already have aconflicting operation scheduled during cycle 2. Similarly, no othernonsingleton binding can be merged with a singleton as each contains aconflicting operation during cycles 1 and 2. Merging singleton bindingof node Bi with the candidate nonsingleton binding, the binding processproduces Figure 5(b). Notice that node [B.sub.1] is now part of thenonsingleton binding of [A.sub.2], [C.sub.2], and [D.sub.2], asrepresented by the same color.

Since there are no more candidate pairs, the binding process thenattempts to merge singleton binding pairs. In Figure 5(b), there is onlyone candidate pair between the singleton of [B.sub.2] and of [A.sub.3].Merging these bindings, the binding process produces a nonsingletonbinding containing these two nodes, as seen in Figure 5(c) with thesolid black color. At this point, the binding process completes as thenumber of shared resources has met the limit. Instead of using theinitial size resources, the final design now produces a scheduling andbinding on four resources with a 50% EC. If the limit had not beenreached, the binding process may continue until there are no morecandidate pairs.

For register-binding, our heuristic provides a register unit foreach functional unit to avoid relatively expensive multiplexers on theFPGA's LUT-based architecture while also providing register sharingfor compatible variables originating from the same functional unit. Dueto the FPGA's register-rich fabric, register sharing can be moreexpensive than register duplication and is rarely justified [30].

5. Experiments

In this section, we evaluate the effectiveness of the FD-FTAheuristic in a variety of benchmarks. Section 5.1 describes theexperimental setup. Section 5.2 shows a comparison in resource savingswith TMR applied to existing RTL. Section 5.3 shows a comparison inresource savings with a previous approach. Section 5.4 shows acomparison in execution time with the previous approach. Section 5.5illustrates a comparison of clock frequencies using the differentapproaches.

5.1. Experimental Setup. To evaluate our heuristic, we implementedAlgorithm 1 in C++, while also using Vivado 2015.2 to provide LUT countsfor each resource and the clock frequency of the final solution. Ourtarget architecture was a Xilinx Virtex-7 xc7vx485tffg1761-2, which usesXilinx 7 Series Configurable Logic Blocks with 6-input LUTs.

5.1.1. Benchmarks. Table 1 summarizes the benchmarks and shows thenumbers of nodes and edges and the operation types. To representsignal-processing applications, we used DFGs for 5x5 convolution,8-point radix-2 butterfly FFT, 16-point radix-2 butterfly FFT, and4-point radix-4 dragonfly FFT. Of these, the radix-4 FFT efficientlyexpands the complex arithmetic involved to equivalent real operations asin [31]. For the radix-2 FFTs, we use resources capable of directlyperforming complex operations. To represent fluid-dynamics and similarapplications, we used two DFGs that solve 5-dimensional linear equations(Ax = B) using the Jacobi iterative method and the successiveoverrelaxation (SOR) iterative method [32]. We also used two DFGs thatsolve Laplace's equation ([[nabla].sup.2/] = 0) using the Jacobiand SOR methods. We also supplement these real benchmarks with syntheticbenchmarks (small0-small7, medium0-medium4) that we created using DFGsgenerated through random directed acyclic graphs.

5.1.2. Baselines. Two baselines were used to evaluate our proposedheuristic: a common approach where designers apply TMR to existing RTL(collectively referred to as TMR-RTL for simplicity) and the approachfrom [5] (referred to as the previous approach). We use the TMR-RTLapproach baseline to motivate the benefits of our heuristic over commonuse cases and the previous approach baseline to demonstrate improvementsover the state of the art.

We model the TMR-RTL approach based on two observations: (1)designers often deal with existing RTL code and (2) designers may not bewilling or capable of performing an extensive exploration offault-tolerance tradeoffs. Because most existing RTL is implementedwithout considering the effects of TMR, the code is generally notwritten to minimize area. To approximate this use case, we use an ASAPschedule and LEA bindings. Although a designer could certainly use anapproach that reduces area, we have observed that ASAP schedules arecommon in RTL cores, likely due to ease of implementation.

5.2. Comparison with TMR Applied to Existing RTL. In this section,we evaluate the effectiveness of our FD-FTA heuristic compared toTMR-RTL under different EC% and latency constraints. Unlike a simpletriplicated RTL circuit, an HLS RTL circuit may use more extensiveresource sharing during high-level synthesis to automatically targetdifferent latency and redundancy constraints. Applying triplication toan existing circuit, in contrast, limits the designer to thecircuit's existing architecture and leaves little opportunities totarget different constraints, especially when using encrypted orpresynthesized IP cores. Table 2 presents the resources savings wherethe rows correspond to different benchmarks and the columns correspondto different EC% and latency constraints. We vary the latency constraintfrom the DFG's minimum-possible latency (as determined by an ASAPschedule) up to 2x the minimum latency, in increments of 0.2x. The EC%constraint is explored for 100% and 70%. On average, the FD-FTAheuristic shows savings compared to the TMR-RTL for each pair ofnormalized latency and EC% constraints. Although relatively small at thelowest latency constraint, these average savings become significant athigher latency constraints reaching 45% and 49% savings at 2.0xnormalized latency constraint for both 100% and 70% EC constraints,respectively. For individual benchmarks, the FD-FTA heuristic showssignificant savings at each set of constraints except for benchmarkswith low operation counts and for certain benchmarks at low normalizedlatency constraints.

From Table 2, we observe results consistent with expectations. Fora normalized latency constraint of 1.0x and a EC constraint of 100%, theFD-FTA heuristic yielded no savings for about half of the benchmarks andup to 65% savings for the remainder. The lack of savings for certainbenchmarks is expected given that some benchmarks have no flexibility toschedule operations on different cycles when constrained by theminimum-possible latency. In such cases, designers should use the lowestcomplexity scheduling algorithm as all algorithms will produce similarschedules. For the other benchmarks, up to 65% savings correspond tobenchmarks with little to some flexibility, whereas the 65% savings inthe linsor benchmark correspond to DFGs with large operationflexibility. In these cases, we start to see a trend of growing resourcesavings as operation flexibility increases. Additionally, the FD-FTAheuristic displayed a minimal degradation between -2% and -1%, on somebenchmarks. Further investigation of each approach's output revealsthat nonoptimal bindings by the FD-FTA heuristic resulted inmultiplexers with more inputs than their counterparts in the TMR-RTLapproach.

For a normalized latency constraint of 2.0x and a EC constraint of100%, the FD-FTA heuristic yielded savings up to 75% with savings above50% for about half of the benchmark set. We expect such high savings dueto the large operation flexibility granted by the latency constraintwhich enables the FD-FTA heuristic's global scheduling approach tobalance the operation concurrency over 2x as many cycles as theminimum-possible latency. The TMR-RTL approach, in contrast, willmaintain a static scheduling approach despite the larger latencyconstraint. Like any basic triplication approaches, the TMR-RTL approachis unable to target new constraints without manual intervention and maysuffer from a higher area cost due to fewer opportunities for resourcesharing. Despite these savings, there are a few benchmarks where theFD-FTA heuristic experienced little to no savings. These benchmarks(e.g., labjacobi, lapsor, and small!) are however very small inoperation count and represent cases where the FD-FTA heuristic'sattempt to balance operation concurrency will yield output similar tothe TMR-RTL approach. Overall, these results have supported ourexpectation that the FD-FTA heuristic performs better with increasinglatency constraints due to larger scheduling flexibility.

Comparing the results from nonsynthetic benchmarks and syntheticbenchmarks, Table 2 shows that the trends found in each individualbenchmark set are similar to the trends found when both sets areconsidered together. For nonsynthetic benchmarks, the FD-FTA heuristicyielded no savings for about half of the benchmarks and up to 65%savings for the remainder at a latency constraint of 1.0x and ECconstraint of 100%. As the latency constraint increased, the FD-FTAheuristic generally showed increased savings with exception to lapsorand fftRadixA which displayed minimal degradation between -2% and -1%under certain latency constraints. At a latency constraint of 2.0x, theFDFTA heuristic yielded savings up to 75%. Similarly, in the syntheticbenchmarks, the FD-FTA heuristic also yielded no savings for about halfof the benchmarks and up to 30% savings for the remainder at a latencyconstraint of 1.0x and EC constraint of 100%. As the latency constraintincreased, the FD-FTA heuristic also generally showed increased savingsbut at a slower rate than the nonsynthetic benchmarks. At a latencyconstraint of 2.0x, the FD-FTA heuristic showed savings up to 62%. Withsimilar trends, both benchmark sets also showed averages within 5% oftheir respective total averages.

To better illustrate these trends, Figure 6 shows the LUT savingsof certain benchmarks that exemplify the different behaviors observed.In this graph, four benchmarks are displayed with EC constraints of 70%and 100%. Each benchmark entry is shown with six bars representing theLUT savings compared to the TMR-RTL approach for the six normalizedlatency constraints.

As one may expect, the impact of a latency constraint for HLS has aquite varied impact depending on the DFG structure. For the lapsor andother small benchmarks, increasing the latency constraint generally hada minimal impact in resource savings since the FD-FTA heuristic islikely to produce a similar schedule as TMR applied to existing RTL atsmall DFG operation counts. Similarly, the linsor benchmark alsoexperienced small increases in savings as the latency constraintincreased, despite the large initial savings. This behavior is caused bythe benchmark's innate structure that enables much operationscheduling flexibility, even for the minimum-possible latencyconstraint. Due to the large flexibility, the FD-FTA heuristic'sscheduling algorithm can balance the operation concurrency easier andexperiences diminished benefits from additional flexibility granted byan increased latency constraint. In contrast, the fft8 and conv5x5benchmarks show significant increases in savings when the latencyconstraint is increased. We attribute this behavior to minimalscheduling flexibility at the minimum-possible latency constraint whichis alleviated with larger latency constraints.

When comparing the savings of 70% EC with 100% EC, Table 2 showsthat the experiments under the 70% EC constraint had additional savingsup to 27% compared to experiments under the 100% EC constraint with thesame latency constraint. Similar to previous observations, the decreasedEC% largely had no impact in experiments with a normalized latencyconstraint of 1x which is primarily due to the lack of operationflexibility. For this baseline latency constraint, the singletons ofeach module are likely scheduled on the same cycle and are therefore noteligible for singleton-sharing as seen in a majority of the benchmarks.Otherwise, the 70% EC iterations generally showed additional savingscompared to their counterpart 100% EC iterations as the latencyconstraint is increased. Notably, these additional savings remainrelatively similar once savings occur. Such savings are expected as EC%reflects the number of resources that may be shared across modules whichplaces an upper bound on the additional savings due to EC.

Despite the overall positive impact by decreasing the EC%, thereare a few cases where decreasing the EC% also decreased the LUT savings.For those cases, the savings due to sharing were offset by the cost ofincreasing the number of inputs on the interconnection multiplexers forthe remaining resources. This trend is however only largely noticeablein the smaller DFGs, where the cost of input multiplexers may becomparable to the cost of resources. Similarly, there are also caseswere decreasing the EC% showed a sudden increase in resource savings fora single latency constraint, but not for others, as seen in thelapjacobi benchmark. In these scenarios, the specific latency constraintof increased resource savings may present the only scenario where theSingleton-Share Binding is capable of sharing singletons. In contrast,the heuristic may present incompatible singleton bindings due toinflexible schedules at lower latency constraints or suboptimalschedules at higher latency constraints due to the force-directed globalscheduling approach. Overall, determining the impact of the EC%constraint is counterintuitive as the constraint's impact reliesprimarily on the heuristic's distribution of singletons amongcycles within each module. This distribution cannot be easily determinedfrom the benchmark or constraints as the scheduling and binding processmay vary the singleton distribution at the slightest change ofconstraints or in DFG structure. This impact is further affected by theshared resource type and by whether the sharing has increased the sizeof the resource's input multiplexer.

5.3. Comparison with Previous Approach. In this section, weevaluate the effectiveness of our FD-FTA heuristic compared to theapproach of previous work described in [5] under different EC% andlatency constraints. Table 3 presents the resource savings of the FD-FTAheuristic compared to the previous approach and uses the samepresentation style as Table 2. On average, the FD-FTA heuristic showssavings compared to the previous work for each pair of normalizedlatency and EC% constraints. Although minimal at the lowest latencyconstraint, these savings become significant at higher latencyconstraints reaching 21% and 23% average savings at 2.0x normalizedlatency constraint for 100% and 70% EC, respectively. For individualbenchmarks, the FD-FTA heuristic commonly shows significant savings ateach set of constraints except for benchmarks with low operation countsand for most benchmarks at low normalized latency constraints.

Like the previous experiment, Table 3 expresses some similarresults for benchmarks under the minimum-possible latency constraint.For the normalized latency constraint of 1.0x and 100% EC constraint,the FD-FTA heuristic generally has a small degradation ranging from -1%down to -5% in fft8. Similar to the results in Table 2, the FD-FTAheuristic has a tendency to make nonoptimal bindings that result inlarger input multiplexers than the previous approach. Although presentin all iterations of the experiment, such behavior is usually amortizedby better utilization on fewer resources, especially on larger latencyconstraints. For the few cases where the FD-FTA heuristic experiencedsavings under these constraints, these iterations involved benchmarkswith some flexibility to schedule operations on different cycles. In thecase of the largest savings in the small5 benchmark, the solutionproduced by the FD-FTA heuristic used far less multipliers than theprevious approach's solution. This better utilization of the costlymultipliers ultimately contributed to a smaller design.

Table 3 expresses some clear trends in the results of thisexperiment. These trends are general increases in resource savings asboth the latency constraint and the number of operations in thebenchmark increase. The experiment displays these trends in a majorityof benchmarks and constraint sets with a few exceptions. These trendshowever are expected as the FD-FTA heuristic was designed to addressdisadvantages of the previous approach's scheduling algorithm. Asmentioned in Section 4.1, the previous approach's schedulingalgorithm had a number of scalability issues. The first trend of savingsincreasing with increased latency constraint can be attributed to theMR-LC scheduling algorithm that the previous approach is based on. SinceMR-LC tries to minimize resources in a local cycle basis, this algorithmkeeps a maximum operation count that limits the number of operationsthat may be scheduled in a cycle. These counts are only increased if thenumber of zero-slack operations in a cycle exceeds its current resourcecount. This behavior noticeably causes problems for latency constraintslarger than the minimum-possible latency constraint since all operationsare guaranteed to have nonzero slacks in the first few cycles. In thiscase, the algorithm will only schedule at most one operation per cycleuntil it experiences numerous zeroslack operations at later cycles. Thislack of initial operation concurrency will likely cause large operationconcurrency at later cycles. The design will therefore be forced toinstantiate additional resources to meet this large concurrency. Theresult is a large number of resources that are underutilized in earlycycles. By contrast, the FD-FTA heuristic's scheduling algorithmaims to balance the operation concurrency among all cycles such that allresources maintain a high utilization and therefore uses fewer overallresources.

Comparing the results from nonsynthetic benchmarks and syntheticbenchmarks, Table 3 shows that the trends found in each individualbenchmark set are similar to the trends found when both sets areconsidered together. For nonsynthetic benchmarks, the FD-FTA heuristicyielded no savings for most benchmarks and up to 19% savings for theremainder at a latency constraint of 1.0x and EC constraint of 100%. Asthe latency constraint increased, the FD-FTA heuristic generally showedincreased savings with exception to lapjacobi and linsor which decreasedin savings. At a latency constraint of 2.0x, the FD-FTA heuristicyielded savings up to 67%. Similarly, in the synthetic benchmarks, theFD-FTA heuristic also yielded no savings for most benchmarks and up to27% savings for the remainder at a latency constraint of 1.0x and ECconstraint of 100%. As the latency constraint increased, the FD-FTAheuristic also generally showed increased savings but at a slower ratethan the nonsynthetic benchmarks. At a latency constraint of 2.0x, theFD-FTA heuristic showed savings up to 58%. With similar trends, bothbenchmark sets also showed averages within 3% of their respective totalaverages.

To better illustrate these trends, Figure 7 shows the resourcesavings of certain benchmarks that exemplify the different behaviorsobserved. In this graph, four benchmarks are displayed with ECconstraints of 70% and 100%. Each benchmark entry is shown with six barsrepresenting the LUT savings compared to the previous approach fordifferent normalized latency constraints.

Similar to Figure 6, the impact of a latency constraint for HLS hasa quite varied impact depending on the DFG structure and on theeffectiveness of the previous approach. For the lapsor and other smallbenchmarks, increasing the latency constraint generally had a smallimpact in resource savings since the FD-FTA heuristic is likely toproduce a similar schedule to the previous approach. This behavior ismainly caused by the small operation count which limits the number ofpossible schedules. For a 70% EC constraint and 2.0x normalized latencyconstraint, the FD-FTA heuristic even displays a notable loss insavings. It should be cautioned that, at such small DFGs, any"significant" gain or loss of savings may be only thedifference of one additional adder.

For the fft8 and conv5x5 benchmarks, we observe a significantincrease in savings as the latency constraint increases. This trendrepresents the general case where the previous approach has difficultiesfinding an optimal schedule with larger operation counts. There areexceptions to this trend as seen in the linsor benchmark where theprevious approach performs better than the FD-FTA heuristic under mostlatency constraints. In fact, the FD-FTA heuristic does progressivelyworse as the latency constraint increases.

The second trend of increased savings with increased operationcount can be attributed to the previous approach's iterativeapproach and its randomness in the scheduling algorithm. Since thatapproach largely relies on finding an optimal schedule randomly overmany iterations, that approach is much more likely to find an optimalschedule in small DFGs rather than in larger DFGs given good startingparameters. As a result, as the size of the DFG increases, the previousapproach finds less optimal solutions and encounters its exit conditionmuch earlier. Although a design could modify the starting parameters tofind a better solution at different sizes, such an approach requiresmuch fine-tuning and is not a scalable solution. In contrast, the FDFTAheuristic uses a global scheduling approach that balances operationconcurrency over all possible cycles which allows the heuristic to scalewith the size of the DFG at the expense of additional complexity.

5.4. Heuristic Execution Time Comparison. This section compares theaverage heuristic execution times of the FDFTA heuristic with theprevious approach. In this context, execution time refers to theduration of the scheduling and binding process and does not include theplacement and routing of the design on a FPGA. We provide thiscomparison to demonstrate the scalability of these heuristics, which maylimit the practical use cases of a heuristic. For FPGAs, long executiontimes may be an acceptable tradeoff due to placement and routinggenerally dominating compiling times. Relatively short execution timesmay make a heuristic amenable to fast recompilation in areas such asFPGA overlays [33].

Figure 8 presents the results of this experiment comparing theexecution time of the two approaches with a 100% EC constraint and a 2xthe normalized latency constraint. The figure is arranged such that thehorizontal axis represents benchmarks arranged by operation count fromsmallest to largest and that the vertical axis represents execution timein milliseconds on a logarithmic scale. It should be noted that eachtick of the horizontal axis does not represent the same increase inoperation count between benchmarks. Similarly, benchmarks of the sameoperation count may vary in execution for both heuristics based on theoverall DFG structure.

Generally, the results of this experiment match our expectations onexecution time. The figure shows that, for each benchmark, the previousapproach's execution time is orders of magnitude longer than theFD-FTA heuristic with an average speedup of 96x, a median speedup of40x, and a max speedup of 570x in the fft 16 benchmark. Our switch tothe proposed FD-FTA heuristic was intended to address the previousapproach's long execution time.

Based on the experiments on resource savings, these results suggestthat both heuristics and the TMR-RTL approach have Pareto-optimalsolutions within the design space for execution times, resourcerequirements, and latency constraints. Although the TMR-RTL approachdoes not require any scheduling and binding, results from Table 3indicate a much smaller design may be achievable with the otherapproaches, especially at a larger latency constraint. Similarly, theapproach of previous work may take vastly longer than other approachesbut may achieve significantly smaller designs for latency constraints atthe minimum-possible latency at a negligible difference in executiontime for small DFGs. Notably, the FD-FTA heuristic achievessignificantly more savings than the previous work at increased latencyconstraints and on larger DFGs.

5.5. Clock Results. This section compares the clock frequencies ofdesigns generated by the proposed heuristic, the TMRRTL approach, andthe previous approach. Since increased resource sharing can increase thesize of input multiplexers and potentially the critical path, we includethis experiment to explore the impact of each approach on adesign's clock frequency. Since synthesis tools perform placementand routing with a pseudorandom process, we determine an approach'sclock frequency over multiple compilation runs through a script thatdoes a binary search over the possible clock-constraint space anddetermines the theoretical max clock frequency based on a targetfrequency and slack values. Due to the lengthiness of this process, weprovide this comparison only on a subset of nonsynthetic benchmarks andon a reduced set of constraints. For constraints, we test each selectedbenchmark at 1x and 2x normalized latency constraint with 100% ECconstraint. Since the TMR-RTL approach does not rely on the latencyconstraint, we repeat the value in both latency constraints forcomparison purposes.

To obtain these results, we convert the solution's netlistprovided the C++ code into RTL code through scripts. We then use Vivado2015.2 for compilation and for determining clock frequencies afterplacement and routing. For resources, we use IP cores from Xilinx foreach respective resource and configure them to be implemented in LUTsand with single-cycle latencies.

Figure 9 shows the results of this experiment. In this figure, eachapproach has a bar for each benchmark and normalized latency constraintpairing. On the horizontal axis, the results are clustered by benchmarkand then by normalized latency constraint relative to theminimum-possible latency. On the vertical axis, the clock frequency isreported in MHz.

For each benchmark and latency constraint pairing, Figure 9presents similar results for each approach. Compared to the TMR-RTLapproach, the previous approach and the FD-FTA heuristic had an averageclock degradation of 7.08% and 6.03%, respectively. This clock overheadis unexpectedly small since sharing increases the steering logic and theFDFTA heuristic and previous approach incorporate much more sharing thanthe TMR-RTL approach. Upon further analysis of the timing reports, eachapproach had different critical paths. For the TMR-RTL approach, thecritical path was generally a path from an input register to the outputregister of a multiplier. For the other approaches, the critical pathwas usually a path from the output register of one resource to theoutput register of a multiplier. These results suggest that the largerrouting problem associated with using more resources in the TMR-RTLapproach had a similar impact on a design's clock frequency as thesteering logic for the other approaches. The one exception to this trendwas the Conv9x9 benchmark where the TMR-RTL approach had much higherclock frequencies than the other two approaches. In terms of increasingthe latency constraint, both the FD-FTA heuristic and the previousapproach generally showed minor increases in clock frequency.

6. Conclusions

This paper introduced the Force-Directed Fault-Tolerance-Aware(FD-FTA) heuristic to solve the minimum-resource, latency- anderror-constrained scheduling and binding problem. Compared to TMRapplied to existing RTL (TMR-RTL) and to a previous approach, thisheuristic provides attractive tradeoffs by efficiently performingredundant operations on shared resources. The FD-FTA heuristic improvesupon previous work with improved scheduling and binding algorithms forbetter scalability on larger benchmarks and with increased latencyconstraints. Compared to TMR-RTL implementations, the FD-FTA heuristichad average savings of 34% and displayed average savings of 47% at a 2xnormalized latency constraint. Although skewed by small benchmarks whichhave little opportunities for sharing, the FD-FTA heuristic had savingsup to 80% under a 2x normalized latency constraint, with most savingsoccurring in the larger benchmarks. Compared to the previous approach,the FDFTA heuristic had average savings of 19% and displayed averagesavings of 22% at a 2x normalized latency constraint. The FD-FTA'scomparisons with the previous approach are also skewed by smallbenchmarks and showed up to 74% resource savings under a 2x normalizedlatency constraint with most savings occurring in the larger benchmarks.Additionally, the FD-FTA showed a 96x average heuristic execution timeimprovement compared to the previous approach and produced FPGA designswith a 6% average clock degradation compared to the TMR-RTLimplementations. Future work includes support for pipelined circuits andfor alternative strategies for FPGA fault-tolerance.

https://doi.org/10.1155/2017/5419767

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper.

Acknowledgments

This work is supported in part by the I/UCRC Program of theNational Science Foundation, under Grant nos. EEC0642422, IIP-1161022,and CNS-1149285.

References

[1] N. Wulf, A. D. George, and A. Gordon-Ross, "Memory-awareoptimization of FPGA-based space systems," in Proceedings of theIEEE Aerospace Conference (AERO '15), pp. 1-13, IEEE, March 2015.

[2] M. Wirthlin, "High-reliability FPGA-based systems: space,high-energy physics, and beyond," Proceedings of the IEEE, vol.103, no. 3, pp. 379-389, 2015.

[3] M. J. Wirthlin, "FPGAs operating in a radiationenvironment: Lessons learned from FPGAs in space," Journal ofInstrumentation, vol. 8, no. 2, Article ID C02020, 2013.

[4] S. Golshan, H. Kooti, and E. Bozorgzadeh, "SEU-awarehigh-level data path synthesis and layout generation on SRAM-basedFPGAs," IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 30, no. 6, pp. 829-840, 2011.

[5] A. Shastri, G. Stitt, and E. Riccio, "A scheduling andbinding heuristic for high-level synthesis of fault-tolerant FPGAapplications," in Proceedings of the 26th IEEE InternationalConference on Application-Specific Systems, Architectures and Processors(ASAP '15), pp. 202-209, July 2015.

[6] A. Antola, V. Piuri, and M. Sami, "High-level synthesis ofdata paths with concurrent error detection," in Proceedings of theIEEE International Symposium on Defect and Fault Tolerance in VLSISystems, pp. 292-300, Austin, Tex, USA, November 1998.

[7] P. S. Ostler, M. P. Caffrey, D. S. Gibelyou et al., "SRAMFPGA reliability analysis for harsh radiation environments," IEEETransactions on Nuclear Science, vol. 56, no. 6, pp. 3519-3526, 2009.

[8] O. Heron, T. Arnaout, and H.-J. Wunderlich, "On thereliability evaluation of SRAM-based FPGA designs," in Proceedingsof the International Conference on Field Programmable Logic andApplications (FPL 05), pp. 403-408, August 2005.

[9] O. Boncalo, A. Amaricai, C. Spagnol, and E. Popovici,"Cost effective FPGA probabilistic fault emulation," inProceedings of the 32nd NORCHIP Conference (NORCHIP '14), pp. 1-4,October 2014.

[10] A. Janning, J. Heyszl, F. Stumpf, and G. Sigl, "Acost-effective FPGA-based fault simulation environment," inProceedings of the 8th International Workshop on Fault Diagnosis andTolerance in Cryptography (FDTC '11), pp. 21-31, September 2011.

[11] J. McCollum, "ASIC versus antifuse FPGAreliability," in Proceedings of the IEEE Aerospace Conference, pp.1-11, March 2009.

[12] K. S. Morgan, D. L. McMurtrey, B. H. Pratt, and M. J.Wirthlin, "A comparison of TMR with alternative fault-tolerantdesign techniques for FPGAs," IEEE Transactions on Nuclear Science,vol. 54, no. 6, pp. 2065-2072, 2007.

[13] C. Bolchini, A. Miele, and M. D. Santambrogio, "TMR andpartial dynamic reconfiguration to mitigate SEU faults in FPGAs,"in Proceedings of the 22nd IEEE International Symposium on Defect andFault-Tolerance in VLSI Systems, DFT 2007, pp. 87-95, September 2007.

[14] J. M. Johnson and M. J. Wirthlin, "Voter insertionalgorithms for FPGA designs using triple modular redundancy," inProceedings of the 18th Annual ACM SIGDA International Symposium onField-Programmable Gate Arrays (FPGA '10), pp. 249-258, ACM,Monterey, Calif, USA, February 2010.

[15] P. G. Paulin and J. P. Knight, "Scheduling and bindingalgorithms for high-level synthesis," in Proceedings of the 26thACM/IEEE conference, pp. 1-6, June 1989.

[16] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, "A formal approachto the scheduling problem in high level synthesis," IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 10, no. 4, pp. 464-475, 1991.

[17] P. Coussy, D. D. Gajski, M. Meredith, and A. Takach, "Anintroduction to high-level synthesis," IEEE Design and Test ofComputers, vol. 26, no. 4, pp. 8-17, 2009.

[18] S. Tosun, N. Mansouri, E. Arvas, M. Kandemir, and Y. Xie,"Reliability-centric high-level synthesis," in Proceedings ofthe Design, Automation and Test in Europe (DATE '05), vol. 2, pp.1258-1263, March 2005.

[19] X. Chen, W. Yang, M. Zhao, and J. Wang, "HLS-basedsensitivity-inductive soft error mitigation for satellite communicationsystems," in Proceedings of the 22nd IEEE International Symposiumon On-Line Testing and Robust System Design (IOLTS '16), pp.143-148, July 2016.

[20] M. B. Hammouda, P. Coussy, and L. Lagadec, "A unifieddesign flow to automatically generate on-chip monitors during high-levelsynthesis of hardware accelerators," IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 36, no.3, pp. 384-397, 2017.

[21] S. Hadjis, A. Canis, J. H. Anderson et al., "Impact ofFPGA architecture on resource sharing in high-level synthesis," inProceedings of the ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA '12), pp. 111-114, ACM, Monterey,Calif, USA, February 2012.

[22] A. F. Dos Santos, L. A. Tambara, F. Benevenuti, J. Tonfat, andF. L. Kastensmidt, "Applying TMR in hardware accelerators generatedby high-level synthesis design flow for mitigating multiple bit upsetsin SRAM-based FPGAs," in Applied Reconfigurable Computing, pp.202-213, Springer, 2017.

[23] H. Quinn, D. Roussel-Dupre, M. Caffrey et al., "Thecibola flight experiment," ACM Transactions on ReconfigurableTechnology and Systems, vol. 8, no. 1, article no. 3, 2015.

[24] A. Orailoglu and R. Karri, "Automatic synthesis ofself-recovering VLSI systems," IEEE Transactions on Computers, vol.45, no. 2, pp. 131-142, 1996.

[25] K. Kyriakoulakos and D. Pnevmatikatos, "A novelSRAM-based FPGA architecture for efficient TMR fault tolerancesupport," in Proceedings of the International Conference on FieldProgrammable Logic and Applications, pp. 193-198, August 2009.

[26] T. Inoue, H. Henmi, Y. Yoshikawa, and H. Ichihara,"High-level synthesis for multi-cycle transient fault tolerantdatapaths," in Proceedings of the IEEE 17th International On-LineTesting Symposium (IOLTS '11), pp. 13-18, July 2011.

[27] P. G. Paulin and J. P. Knight, "Force-directed schedulingfor the behavioral synthesis of ASIC's," IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 8, no. 6,pp. 661-679, 1989.

[28] F. J. Kurdahi and A. C. Parker, "Real: a program forregister allocation," in Proceedings of the 24th ACM/IEEE DesignAutomation Conference, pp. 210-215, June 1987.

[29] A. Hashimoto and J. Stevens, "Wire routing by optimizingchannel assignment within large apertures," in Proceedings of the8th Design Automation Workshop (DAC '71), pp. 155-169, ACM,Atlantic City, NJ, USA, June 1971.

[30] A. Canis, J. Choi, M. Aldham et al., "LegUp: high-levelsynthesis for FPGA-based processor/accelerator systems," inProceedings ofthe 19th ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA '11), pp. 33-36, ACM, Monterey,Calif, USA, March 2011.

[31] J. A. Vite-Frias, R. D. J. Romero-Troncoso, and A.Ordaz-Moreno, "VHDL core for 1024-point radix-4 FFTcomputation," in Proceedings of the proceedings of the IEEEInternational Conference on Reconfigurable Computing and FPGAs (RECONFIG'05), pp. 4-24, Puebla, Mexico, September 2005.

[32] J. Hu, Solution of partial differential equations usingreconfigurable computing [PhD. thesis], The University of Birmingham,2010.

[33] J. Coole and G. Stitt, "Fast, flexible high-levelsynthesis from OpenCL using reconfiguration contexts," IEEE Micro,vol. 34, no. 1, pp. 42-53, 2014.

David Wilson, (1) Aniruddha Shastri, (2) and Greg Stitt (1)

(1) Department of Electrical and Computer Engineering, Universityof Florida, Gainesville, FL 32611, USA

(2) National Instruments Corp., 11500 N Mopac Expwy, Austin, TX78759, USA

Correspondence should be addressed to David Wilson;[emailprotected]

Received 31 March 2017; Revised 2 July 2017; Accepted 11 July 2017;Published 21 August 2017

Academic Editor: Michael Hubner

Caption: Figure 1: An example binding onto three color-codedresources. For this binding, a single fault can result in anundetectable, uncorrectable (but detectable), or correctable error,depending on the resource.

Caption: Figure 2: Illustrations of different structures used inForce-Directed Scheduling. The diagram in (a) shows the ASAP and ALAPschedule of a DFG and the respective time frames of the operations. Thechart in (b) shows the distribution graph assuming both operations usethe same resource type.

Caption: Figure 3: Illustrations of the DFG's time framesafter a tentative assignment of operation A.

Caption: Figure 4: Distribution graphs for a cycle 1 assignment(a), for a cycle 2 assignment (b), and for a cycle 3 assignment (c) ofoperation A.

Caption: Figure 5: Sample binding for a latency constraint of 4cycles and error constraint of 50%. The binding in (a) depicts the firststage of binding, whereas the bindings in (b) and (c) represent bindingsfollowing different steps in the second stage. The binding in (b)depicts the merging of a singleton with a nonsingleton, and the bindingin (c) depicts the merging of a pair of singletons.

Caption: Figure 6: Resource savings of FD-FTA heuristic underdifferent normalized latency constraints for selected benchmarks and EC%constraints when compared to TMR-RTL.

Caption: Figure 7: Resource savings of FD-FTA heuristic underdifferent normalized latency constraints for selected benchmarks and EC%constraints when compared to the previous approach.

Caption: Figure 8: Execution times of the three approaches in thebenchmark set under a 2x normalized latency constraint and a 100% ECconstraint.

Caption: Figure 9: Clock frequency of the three approaches in asubset of nonsynthetic benchmarks under 1x and 2x normalized latencyconstraints and a 100% EC constraint.

Table 1: Benchmark summary.Benchmark Description Number Number of nodes of edgesconv5x5 5x5 convolution kernel. 49 190conv9x9 9x9 convolution kernel. 161 932 8-point radix-2 DIT FFT based on 32 116 butterfly architecture.fft16 16-point radix-2 DIT FFT based on 88 648 butterfly architecture.fftrad4 4-point radix-4 FFT based on 40 232 dragonfly architecture. Efficiently decomposing complex operations into real operations.linsor Single iteration kernel to solve a 64 2657 system of linear equations ( Ax = B)in5 variables, using successive-overrelaxation (SOR) method.linjacobi Single iteration kernel to solve a 60 230 system of linear equations in 5 variables, using Jacobi method.lapsor Single iteration kernel to solve 7 16 Laplace's equation ([[nabla].sup.2]- = 0), using successive- overrelaxation (SOR) method.lapjacobi Single iteration kernel to solve 5 9 Laplace's equation using Jacobi method.small0-7 8 randomly generated DFGs with <20 4-19 2-36 operations.medium0-4 5 randomly generated DFGs with <130 88-127 346-642 operations.Benchmark Operation typesconv5x5 Add, Multconv9x9 Add, Mult Add, Sub, Multfft16 Add, Sub, Multfftrad4 Add, Sub, Multlinsor Add, Sub, Mult, Divlinjacobi Add, Sub, Mult, Divlapsor Add, Sub, Multlapjacobi Add, Sub, Multsmall0-7 Add, Multmedium0-4 Add, Sub, Mult, DivTable 2: Resource savings (LUT%) of FD-FTA heuristic comparedto TMR-RTL for 100% and 70% EC. Normalized latency and EC% constraints 1.0x 1.2x 1.4xBenchmark 100% 70% 100% 70% 100% 70%lapjacobi 0% 0% 0% 0% 5% 30%lapsor 0% 0% -1% -1% -1% -1%conv5x5 18% 18% 52% 52% 58% 65% 0% 0% 12% 13% 13% 29%fftRadix4 0% 0% -1% 15% 21% 29%fft16 0% 0% 16% 28% 9% 18%linjacobi 27% 28% 50% 52% 47% 59%linsor 65% 69% 68% 68% 70% 72%conv9x9 19% 19% 50% 50% 66% 70%Average 14% 15% 27% 31% 32% 41%small0 1% 1% 1% 1% 10% 13%small1 0% 1% 0% 1% 0% 1%small2 0% 0% 0% 0% 20% 22%small3 30% 30% 30% 30% 57% 57%small4 30% 49% 30% 49% 38% 38%small5 27% 44% 35% 54% 37% 45%small6 5% 5% 5% 5% 11% 20%small7 0% 0% 0% 0% 31% 35%medium0 3% 3% 37% 42% 50% 52%medium1 3% 3% 32% 44% 49% 54%medium2 0% 0% 38% 42% 51% 54%medium3 6% 6% 33% 39% 45% 50%medium4 0% 0% 33% 37% 43% 51%Average 8% 11% 21% 27% 34% 38%Total average 11% 13% 24% 28% 33% 39% Normalized latency and EC% constraints 1.6x 1.8x 2.0xBenchmark 100% 70% 100% 70% 100% 70%lapjacobi 2% 5% 2% 2% 5% 4%lapsor -1% 2% -2% 0% -2% 0%conv5x5 64% 67% 62% 71% 68% 71% 19% 25% 24% 32% 38% 37%fftRadix4 21% 39% 28% 43% 32% 43%fft16 11% 25% -10% 5% 18% 27%linjacobi 61% 67% 63% 68% 62% 70%linsor 70% 72% 71% 72% 70% 74%conv9x9 69% 74% 76% 78% 75% 80%Average 35% 42% 35% 41% 40% 45%small0 10% 13% 2% 11% 48% 57%small1 3% 30% 3% 30% 2% 4%small2 20% 22% 19% 30% 39% 40%small3 53% 57% 45% 58% 58% 57%small4 49% 57% 40% 49% 57% 57%small5 28% 30% 38% 39% 45% 46%small6 29% 38% 29% 28% 31% 49%small7 39% 47% 50% 57% 54% 58%medium0 50% 55% 57% 61% 52% 60%medium1 51% 55% 58% 62% 60% 64%medium2 58% 63% 60% 63% 62% 67%medium3 49% 55% 52% 57% 52% 59%medium4 49% 51% 52% 57% 57% 60%Average 37% 44% 39% 46% 48% 52%Total average 37% 43% 37% 44% 45% 49%Table 3: Resource savings (LUTS) of FD-FTA heuristic comparedto previous work for 100% and 70% EC. Normalized latency and EC% constraints 1.0x 1.2x 1.4xBenchmark 100% 70% 100% 70% 100% 70%lapjacobi 0% 0% 0% 0% 2% 28%lapsor 19% 19% -1% -1% 1% -5%conv5x5 0% 0% 40% 41% 49% 56% -5% -5% 10% 9% 12% 27%fftRadix4 0% -2% -6% 11% 12% 22%fft16 -3% -1% 15% 28% 9% 18%linjacobi -1% 0% 29% 32% 24% 41%linsor 0% 7% -7% -12% -8% -1%conv9x9 -1% 0% 38% 38% 57% 63%Average 1% 2% 13% 16% 18% 28%small0 1% 1% 1% 1% 0% 4%small1 0% 1% 0% 1% 0% 1%small2 0% 0% 0% 0% 11% 13%small3 0% 0% 0% 0% 34% 30%small4 10% 35% 10% 35% 11% 11%small5 27% 45% 28% 48% 22% 31%small6 2% 2% 2% 2% -1% 10%small7 0% 0% 0% 0% 28% 32%medium0 0% 0% 33% 39% 44% 46%medium1 0% -2% 27% 40% 42% 48%medium2 0% 0% 37% 41% 50% 53%medium3 0% 0% 27% 33% 38% 43%medium4 0% 0% 31% 35% 39% 48%Average 3% 6% 15% 21% 24% 28%Total average 2% 4% 14% 19% 22% 28% Normalized latency and EC% constraints 1.6x 1.8x 2.0xBenchmark 100% 70% 100% 70% 100% 70%lapjacobi -2% -2% -5% -6% -2% -42%lapsor 1% -2% 0% -38% 0% -33%conv5x5 54% 59% 53% 64% 59% 63% 17% 22% 22% 29% 36% 35%fftRadix4 9% 30% 12% 31% 13% 27%fft16 11% 25% -9% 6% 18% 27%linjacobi 43% 52% 44% 53% 42% 54%linsor -18% -33% -20% -33% -46% -25%conv9x9 61% 67% 69% 72% 67% 74%Average 20% 24% 18% 20% 21% 20%small0 0% 4% -21% -10% 26% 40%small1 2% 2% 2% 2% 0% -35%small2 11% 13% -1% 12% 13% 15%small3 25% 24% 1% 19% 20% -4%small4 15% -5% -18% -25% -4% -4%small5 -2% -1% 3% 0% -1% -25%small6 11% 22% 1% -1% -11% -19%small7 34% 42% 43% 52% 46% 51%medium0 44% 48% 49% 53% 40% 50%medium1 43% 47% 49% 53% 48% 53%medium2 55% 60% 56% 60% 58% 63%medium3 41% 48% 42% 49% 39% 49%medium4 43% 47% 46% 52% 48% 52%Average 25% 27% 19% 24% 25% 22%Total average 23% 26% 19% 23% 23% 21%

COPYRIGHT 2017 Hindawi Limited

No portion of this article can be reproduced without the express written permission from the copyright holder.

Copyright 2017 Gale, Cengage Learning. All rights reserved.