## Chalmers University of Technology



## MCC092 Exercises 2021

Lena Peterson and Kjell Jeppson
with input from many others

Preliminary version
will be updated during course
Revised 0 times so far during the course 2021

## Contents

1 Introduction ..... 5
2 Background material ..... 7
2.1 Logic ..... 7
2.2 Capacitance and charge ..... 7
2.3 Power and energy ..... 7
3 Logic functions, static CMOS gates, ILAs ..... 9
3.1 Realizing logical functions ..... 9
3.2 Iterative logic arrays ..... 11
4 The MOS transistor ..... 13
5 The CMOS inverter ..... 15
5.1 Static characteristics ..... 15
5.2 Dynamic characteristics ..... 17
5.3 Tapered buffers etc. ..... 19
6 Delay for complex gates and paths ..... 21
6.1 Gate delay ..... 21
6.2 Path delay ..... 21
7 Wire delay ..... 25
8 Layout ..... 31
9 Sequential circuits ..... 37
10 Power, energy and scaling ..... 45
10.1 Power and energy ..... 45
10.2 Technology scaling and other scalings too ..... 49
11 Adders ..... 53
12 Solutions ..... 57
12.1 Introduction ..... 57
12.2 Background material ..... 57
12.3 Logic functions ..... 57
12.4 The MOS transistor ..... 62
12.5 The CMOS inverter ..... 64
12.6 Delay for complex gates and paths ..... 69
12.7 Wire delay ..... 74
12.8 Layout ..... 81
12.9 Sequential circuits ..... 86
12.10Power, energy and scaling ..... 90
12.11 Adders ..... 99

## Appendices

A Templates and graphs to draw on

## Chapter 1

## Introduction

This is a set of exercises for the course MCC092. For the 2017 instance of the course we compiled problems previously used in exams and various sets of exercises through the years into one contained document. This document has further been updated for 2019 by adding more problems from recent exams and other problems and examples we have used in class. Still, it is not yet complete and will be updated some as we go. We have relied heavily on our old solutions from previous years. However, over the years our practice in how to express things in the course has evolved. So quite likely there are still quite a few errors, some due to this evolution and some just plain typos. We would be extremely happy if you students would report any errors you find to us. We promise to mention you in the list of contributors.

## Chapter 2

## Background material

This part is still missing, but it would be very good to have.
Here we have complied some problems on topics we expect you to have seen before. However, it may have been a long time ago so a brush-up may be good.

### 2.1 Logic

### 2.2 Capacitance and charge

### 2.3 Power and energy

## Chapter 3

## Logic functions, static CMOS gates, ILAs

### 3.1 Realizing logical functions

Exercise 3.1: Task tests understanding the implementation of static CMOS gates from logical expression. Solution on page 57.

Design static CMOS gates realizing the following Boolean expressions:
a) $Z=\overline{A \cdot B \cdot C \cdot D}$
b) $Z=\overline{A \cdot B \cdot C+D}$
c) $Z=\overline{(A+B+C) \cdot D}$

Exercise 3.2: Task tests understanding the implementation of static CMOS gates from logical expression and the difference between implementing with given gates and as compound gates. Solution on page 58.

Realize the following Boolean expressions using:

1. NAND, NOR, and-or-invert (AOI) and/or or-and-invert (OAI) gates.
2. Compound gates (that is implement one static CMOS gate that implements the full function.)

The functions are:
a) $Z=A^{\prime} B+A B^{\prime}(\mathrm{XOR})$
b) $Z=A B+A^{\prime} B^{\prime}$ (XNOR)
c) $Z=\left(A B^{\prime} C^{\prime}+A^{\prime} B C^{\prime}+A^{\prime} B^{\prime} C+A^{\prime} B^{\prime} C^{\prime}\right)$ which corresponds to the sum function, $S U M=A \oplus B \oplus C$, in a binary adder.

Which solution is more efficient in terms of the number of transistors used?

Exercise 3.3: Task tests understanding the realization of static CMOS gates from a Karnaugh map describing the logic function. Solution on page 59.

Realize the three logic functions described by the Karnaugh maps in Figure 3.1 as static CMOS gates. For each Karnaugh map also identify the logic function.


Figure 3.1: Three Karnaugh maps for problem 3.3.

Exercise 3.4: Tasks test understanding of logic functions and static CMOS gates. Tasks were part of problem 1 in exam 2019-08-29. Solution on page 59.

Consider the static CMOS cell shown in Figure 8.1 on page 31. The cell comprises one compound gate and one inverter.
a) What is the logical function of the cell, $\mathrm{Z}=\mathrm{f}(\mathrm{A}, \mathrm{B}, \mathrm{C}, \mathrm{D}, \mathrm{E})$ ?
b) The n-net can be simplified if we connect inputs $A$ and $D$ together and $B$ and $E$ together, so that there are now only three inputs to the cell: A, B and C. Draw the simplified n-net and explain why it is possible to simplify it this way.

Exercise 3.5: Task tests understanding of identifying logic gates. Exam 2016-12-22 Task 1(a). Solution on page 60 .

David Harris, the author of our textbook, holds quite a few patents. One of these is for static CMOS gates that have multiple outputs. The schematic for one such gate is shown in Figure 3.2, taken from the patent (but the gate in the figure is not one of the patented gates). What are the four logical functions for the four outputs Y1 through Y4 indicated in the schematic in Figure 3.2?


Figure 3.2: A schematic of the gate with multiple outputs taken from a patent application. $V_{\mathrm{DD}}$ is at the top of all pMOS transistors although is not so clear from the schematic. There is no contact when lines cross.

### 3.2 Iterative logic arrays

Exercise 3.6: Task tests understanding of logic design, iterative logic arrays and compound gates. Exam 2014-10-22. Solution on page 60.

As a designer of a datapath you are assigned the task to design a comparator for two positive unsigned numbers A and B. The instructions are that there should be two one-bit output signals for the result. One of these should have the two possible values $A=B$ and $A \neq B$. If the first signal has the value $A \neq B$ then the other one-bit signal should indicate whether $A<B$ or $A>B$.
a) Perform the logical design of a one-bit cell that can be used in such a comparator. The cell should generate a two-bit output that indicates if $A=B, A>B$ or $A<B$. Clearly explain how you encode the two one-bit result signals. Include a truth table for the two one-bit output signals of your cell.
b) Implement you one-bit cell as in static CMOS. There should be one compound gate for each of the two one-bit output signals. In addition you may have to use inverters at the outputs and/or inputs. Assume that the two one-bit input data signals, A and B , but not the inverses, are available as inputs to the cell.
c) Show how to connect eight instances of you cells to form an 8-bit comparator. Clearly show how the inputs of the first cell should be connected. (The "first" cell is either the least or the most significant bit, depending on your design.)

Exercise 3.7: Task tests understanding of logic design, iterative logic arrays and compound gates. Solution on page 62.

Design an iterative logic array that determines if two unsigned 8-bit words are equal or not.
a) Design the one-bit equal cell and draw its schematic.
b) Show how eight instances of the cell you designed in a) should be connected to form an eight-bit equal circuit.
c) Suggest one simplification we could implement to reduce the number of transistors (and also speed up the circuit).

## Chapter 4

## The MOS transistor

Exercise 4.1: Task tests understanding of MOS regions of operation. Solution on page 62.


Figure 4.1: Regions of operation for a n-channel MOSFET.
a) Figure 4.1 shows the regions of operation for a n-channel MOSFET. What would be the equations for the borderlines between the different regions?
b) Draw a similar "regions of operation" diagram for a p-channel MOSFET.

Exercise 4.2: Task tests understanding of MOS model parameters. Solution on page 63.

The model parameters for an n-channel MOSFET, $k_{N}$ and $V_{\mathrm{TN}}$, in a certain MOSFET technology are given by $k_{N}=900 \mu \mathrm{~A} / \mathrm{V}^{2}$ and threshold voltage $V_{\mathrm{TN}}=0.30 \mathrm{~V}$.
a) Calculate the gate voltage overdrive $V_{\mathrm{GT}}$ if the supply voltage, $V_{\mathrm{DD}}$, is 1.2 V and $V_{\mathrm{GS}}=V_{\mathrm{DD}}$.
b) Calculate the saturation current, $I_{\mathrm{DSAT}}$ when $V_{\mathrm{GS}}=V_{\mathrm{DD}}$.
c) Calculate the saturation voltage, $V_{\mathrm{DSAT}}$.

Exercise 4.3: Exercise tests understanding of MOS capacitance . Solution on page 63.
a) Calculate the gate capacitance for a 1 mm wide MOSFET in the 65 nm CMOS process if its insulator capacitance per unit area, $C_{\text {ox }}$, is given as $20 \mathrm{fF} / \mu \mathrm{m}^{2}$.
b) What would be the gate capacitance of the MOSFET in task 4.4 b ) if the effective gate length, $L_{\text {eff }}$, in the process is 45 nm and $C_{\mathrm{ox}}$ is $10 \mathrm{fF} / \mathrm{\mu m}^{2}$ ?
c) Repeat task b) for the MOSFET width in task 4.4c).

Exercise 4.4: Exercise tests understanding of MOS effective resistance. Solution on page 63.
a) Calculate the effective resistances for two MOSFETs delivering maximum currents of $500 \mu \mathrm{~A}$ and $750 \mu \mathrm{~A}$, respectively, at a supply voltage of 1 V .
b) If the effective resistance of a MOSFET in a certain technology is specified as $2 \mathrm{k} \Omega \mu \mathrm{m}$, what would be the effective resistance of a $5 \mu \mathrm{~m}$ wide MOSFET,
c) What if the MOSFET in b) was 280 nm wide? What would its effective resistance be then?

## Chapter 5

## The CMOS inverter

We use the CMOS inverter as a model for all static CMOS gates. Therefore it is essential to understand the inverter in detail.

### 5.1 Static characteristics

Exercise 5.1: Task tests understanding of voltage transfer curves, nMOS and pMOS transistors and and inverter bias point. Solution on page 64.

In Figure 5.1 you see the graphs of $I_{\mathrm{DS}}$ vs $V_{\text {OUT }}$, for the nMOS and pMOS FETs in a CMOS inverter. Match the three nMOS characteristics (a) through (c), shown to the left in in Figure 5.1. with the pMOS characteristics, (d) through (f) that corresponds the same input voltage. Then mark the corresponding bias points in the three voltage transfer curves (g) through (i) to the right. For more details on the red, blue, and green VTC diagrams refer to Figure 5.3.

Exercise 5.2: Task tests understanding of voltage transfer curves, nMOS and pMOS transistors and and inverter bias point. It is a variation of 5.1 Solution on page 64.

For three different input voltages, the output voltage of an inverter is swept from $V_{\mathrm{SS}}$ to $V_{\mathrm{DD}}$ while measuring the two MOSFET currents, $I_{\text {DSN }}$ and $I_{\text {DSPP }}$. The resulting current-voltage characteristics thus obtained are shown in Figure 5.2.
a) Match the two MOSFET currents for each of the three inverter input voltages, and find the bias points where the two currents are equal.
b) In the diagram of CMOS inverter regions shown in Figure 5.3 mark each of these three bias points with B, C , or D , depending on the region of operation in the $V_{\text {OUT }}$ vs $V_{\text {IN }}$ graph to which they belong.

Exercise 5.3: Task tests understanding of inverter switching voltage and VTC. Solution on page 64.
a) Add a secondary axis representing the "short-circuit" current through the inverter as shown in Figure 5.3. Sketch the "short-circuit" current through the inverter vs. $V_{\text {IN }}$ based on your knowledge about the current through the current-limiting MOSFET.
b) Use the square-law MOSFET model and Kirchhoff's current law to derive the expression for the switching voltage of an electrically symmetrical CMOS inverter ( $k_{n}=k_{p}, V_{T N}=-V_{T P}$ )?
c) What happens to the switching voltage if $k_{n}=4 k_{p}$ ?
d) Derive an expression for the inverter switching voltage ( $V_{\text {IN }}=V_{\text {OUT }}$ ) in the general case based on the square-law MOSFET models.


Figure 5.1: To the left, (a) to (c), are three nMOS characteristics each for one input voltage. In the middle, (d) to (f), are three pMOS characteristics for the same input voltages but not in the order that correspond to that of the nMOS characteristics. To the right, (g) through (i), are three transfer curves in which to indicate the bias points for cases (a) through (c).

Exercise 5.4: Task tests understanding of noise margins. Solution on page 66.

## The solution for this problem is not included yet, because this problem is part of prelab 1.

To account for voltage fluctuations, i.e. noise, the valid high and low output voltages are usually defined within certain ranges like $0 \leq V_{\mathrm{OUT}} \leq V_{\mathrm{OL}, \text { max }}$, and $V_{\mathrm{OH}, \text { min }} \leq V_{\mathrm{OUT}} \leq V_{\mathrm{DD}}$. Since CMOS is a robust technology, the input voltage can vary within ranges larger than those defined for valid output voltages without causing invalid output voltages, $0 \leq V_{\mathrm{IN}} \leq V_{\mathrm{IL}, \max }$, and $V_{\mathrm{IH}, \min } \leq V_{\mathrm{IN}} \leq V_{\mathrm{DD}}$. These regions are usually defined from the two points, $\left(V_{\mathrm{OL}, \max }, V_{\mathrm{IH}, \min }\right)$ and $\left(V_{\mathrm{OH}, \min }, V_{\mathrm{IL}, \max }\right)$, on the VTC where the amplifications are equal to minus one, $A_{v}=-1$.
a) Derive expressions for the low and high noise margins, NML and NMH, as defined in the Figure 5.4 using the following expressions for $\left(V_{\mathrm{OL}, \max }, V_{\mathrm{IH}, \min }\right)$ and $\left(V_{\mathrm{OH}, \min }, V_{\mathrm{IL}, \max }\right)$ :

$$
\begin{array}{r}
\left(V_{\mathrm{OL}, \max }=\frac{V_{\mathrm{DD}}+V_{\mathrm{TP}}-V_{\mathrm{TN}}}{8}, V_{\mathrm{IH}, \min }=V_{\mathrm{SW}}+\frac{V_{\mathrm{DD}}+V_{\mathrm{TP}}-V_{\mathrm{TN}}}{8}\right) \\
\left(V_{\mathrm{OH}, \min }=V_{\mathrm{DD}}-\frac{V_{\mathrm{DD}}+V_{\mathrm{TP}}-V_{\mathrm{TN}}}{8}, V_{\mathrm{IL}, \max }=V_{\mathrm{SW}}-\frac{V_{\mathrm{DD}}+V_{\mathrm{TP}}-V_{\mathrm{TN}}}{8}\right) .
\end{array}
$$

b) What are the explicit noise margin values in terms of fraction of $V_{\mathrm{DD}}$ if $V_{\mathrm{TN}}=-V_{\mathrm{TP}}=V_{\mathrm{DD}} / 5$ ?




Figure 5.2: The CMOS inverter with three input voltages and corresponding curves for NMOS and PMOS transistors.


Figure 5.3: The CMOS inverter voltage transfer curve (VTC) with regions marked.

Exercise 5.5: Problem tests understanding of voltage transfer curves and noise margin. Slightly adapted from exam 2016-12-22 Tasks 2(a) and (c). Solution on page 66.

Figure 5.5 shows the voltage transfer curves for three of the four logical gates in the schematic Figure 3.2 simulated with all transistors having the same widths and with all four inputs, A-D, connected together, so that there is now only one input. The VTCs have been simulated in a DC analysis in Cadence in the usual 65 nm process. For simplicity you can assume that $V_{T N} \approx-V_{T P}$.
a) Which logical output of Y1-Y4 corresponds to each of the three VTCs named X (red), Y (green) and W(purple) in the graph in Figure 5.5? Motivate!
b) In Figure 5.5 is also a plot of the derivative of the VTC for the X output (the red dash-dotted line). From the data given in Figure 5.5 calculate the noise margin for the X output. The graph in Figure 5.5 is repeated in larger scale in Figure A. 1 for your convenience.

### 5.2 Dynamic characteristics

Exercise 5.6: Task tests RC circuits. Solution on page 67.

Analyze the RC circuit and show that the time needed for the exponential decay of the voltage across the capacitor to $50 \%$ of the initial voltage, $V_{\mathrm{DD}}$, is given by $t_{\mathrm{d}}=R C \ln (2)$ !


Figure 5.4: The The definition of noise margins.


Figure 5.5: Voltage transfer curves (VTC) for the three outputs X, Y and W and the derivative of the VTC for output X .

Exercise 5.7: Task tests understanding of inverter R,C and delay calculation. Solution on page 67.
a) What do we mean with an ideal inverter concerning its parasitic output capacitance?
b) Calculate the propagation delay of an ideal inverter driving an identical inverter! Assume the following MOSFET data: n-channel MOSFETs can $\operatorname{sink} 500 \mu \mathrm{~A} / \mu \mathrm{m}$ channel width at $V_{\mathrm{DD}}=1 \mathrm{~V}$, and their input capacitances are $1.3 \mathrm{fF} / \mu \mathrm{m}$. The p-channel MOSFET is made twice as wide as the n -channel device to obtain the same driving capability.

Exercise 5.8: Task tests understanding fanout-of-four delay. Exam 2012-10-26 Problem 2. Solution on page 68.
a) Calculate the FO4 delay of a $0.35 \mu \mathrm{~m}$ CMOS process with $V_{\mathrm{DD}}=3.3 \mathrm{~V}$ if the effective resistance in the timing model, $R_{\text {eff }}$, is $6 \mathrm{k} \Omega \mu \mathrm{m}$ and the inverter input capacitance is $6 \mathrm{fF} / \mu \mathrm{m}$. Assume $p_{\mathrm{inv}}=1$. (3 p)
b) What is the FO4 delay in a 65 nm process if we assume $V_{\mathrm{DD}}=1.2 \mathrm{~V}, R_{\mathrm{eff}}=2 \mathrm{k} \Omega \mu \mathrm{m}, C_{\mathrm{G}}=4.5 \mathrm{fF} / \mu \mathrm{m}$, and $p_{\text {inv }}=1 / 3$ ?

Exercise 5.9: Task tests understanding of inverter R,C and delay calculation. Solution on page 67.
a) Assume that we, for simplicity, introduce a modified effective resistance $R^{\prime}=R \ln (2)$, how large would this resistance be for the MOSFET in exercise 5.8?
b) How does the use of $R^{\prime}$ modify our delay model?

Exercise 5.10: Task tests understanding inverter operation, inverter switch voltage. Exam 2013-08-26 Problem 2. Slighly modified. Solution on page 68.


Figure 5.6: Two CMOS inverters. (A) is a regular CMOS inverter ( B ) is a pseudo-NMOS inverter intended for use as an amplifier. To the right is a diagram showing the MOSFET regions of operation in CMOS.

In Figure 5.6 you see the circuit diagram for two CMOS inverters, (A) and (B). Inverter (A) is an ordinary CMOS inverter that switches at an input voltage of $V_{\mathrm{DD}} / 2$. Inverter (B) is a pseudo-NMOS inverter intended for use as an amplifier. Its load p-channel MOSFET, M2, is biased at an unknown gate voltage $V_{\mathrm{B}}$ determined by a current mirror. A current mirror takes current $I_{\mathrm{B}}$ from a constant-current source and mirrors it to the inverter. Except for the biasing arrangement, the two CMOS inverters are identical. That is, transistors M1 and M2, respectively, are the same MOSFETs in both inverters. The rightmost diagram in the figure above shows the MOSFET regions of operation in CMOS.
a) Relate the two current gain factors $k_{1}$ and $k_{2}$ of MOSFETs M1 and M2 to each other considering that inverter (A) flips at $V_{\mathrm{DD}} / 2$ assuming symmetrical threshold voltages, $V_{\mathrm{TN}}=-V_{\mathrm{TP}}=V_{\mathrm{DD}} / 5$ ?
b) Assuming $V_{\mathrm{B}}=0.6 V_{\mathrm{DD}}$, what is the switching voltage of the pseudo-NMOS inverter (B)?
c) Calculate current $I_{\mathrm{B}}$ if $V_{\mathrm{DD}}=1.2 \mathrm{~V}$, and $k_{2}=600 \mathrm{~mA} / \mathrm{V}^{2}$ !
d) For what output voltage range are both MOSFET devices saturated in inverter (A)? Refer to the right-hand diagram showing the CMOS regions of MOSFET operation!
e) For what output voltage range are both devices saturated in the pseudo-NMOS inverter? Refer to the righthand diagram showing the CMOS regions of MOSFET operation!
p)

### 5.3 Tapered buffers etc.

Exercise 5.11: Task tests understanding of inverter delay calculation and minimization. Solution on page 68.

## Missing solution

Four is sort of a magic number, if the number of loading inverters becomes much larger than four, it is often more efficient to insert an extra inverter with a better driving capability as a buffer between the original inverter and the capacitive load.
a) What driving capability should the inserted buffer inverter have to minimize the delay?
b) For what number of loading inverters does the inserted buffer shorten the propagation delay?
c) How does the parasitic output capacitance influence these critical numbers?

Exercise 5.12: Task tests understanding of inverter delay calculation and minimization. Solution on page 69.

For how big a capacitive load would the insertion of a non-inverting, two-inverter buffer give the shortest propagation delay?

Exercise 5.13: Task tests understanding of inverter delay calculation and minimization. Solution on page 69.
a) Determine the number of buffer inverters needed to minimize the delay if the load capacitance is 1000 times larger than the inverter input capacitance?
b) What would be the optimum tapering factor?

Exercise 5.14: Task tests understanding of how to deign a tapered buffer. Exam 2015-01-09 Problem X. Solution on page 69.

In a chip you are responsible for designing a driver for an output pad. The capacitance of the pad is (around) $1024 C$ where $C$ is the input of the minimum inverter in the process. The setup is shown in the figure below. The polarity of the output signal is not important in this particular case. Assume that the parasitic output capacitance is half of the inverter input capacitance (that is $p_{\text {inv }}=0.5$ ). Tau in the particular process is 4 ps . (Tasks c) and d) of this exam problem will appear when we get to power and energy.)


Figure 5.7: The setup for the driver that you are to design.
a) If minimum delay is the main design goal, how many inverters do you choose to use in the box with the question mark? How would you size them? Draw a figure of your entire inverter chain with the inverter sizes clearly marked. Motivate your design choices, but proofs are not required.
b) For your design what is the delay?

## BONUS QUESTION

e) What if in a similar design situation as in task a) the polarity at the output were of importance and the obvious design choice was an odd number of inverters? Would you add one more inverter or would you remove one? Discuss your considerations.

## Chapter 6

## Delay for complex gates and paths

### 6.1 Gate delay

Exercise 6.1: Task tests understanding of calculating logical effort, $g$, and parasitic delay, $p$, for simple gates. Solution on page 69.

Imagine that you have started working with a CMOS process where for the pMOS transistors the maximum saturation current is only $1 / 3$ of that for an nMOS transistor of the same width (as usual we assume that we use the minimum length for all transistors).
a) Find the parasitic delay, $p$, and the logical effort, $g$, for a 2 -input NAND and a 2 -input NOR gate in this process. Assume that $p$ for the inverter (sometimes called $p_{\text {inv }}$ ) is 0.5 .
b) Imagine that the NAND2 gate in this process, which you analysed in task a), is connected to one 3-to-1 scaled inverter at its output to create an AND gate. If the pMOS transistors in the inverter gate are twice as wide as the ones in the NAND2 gate, what the normalized delay of the NAND2 gate?

Exercise 6.2: Task tests understanding of calculating logical effort, $g$, and parasitic delay $p$, for complex gates. Solution on page 71.

In Figure 6.1 you see two complex gates that we have found in the textbook. Again the task is to calculate the parasitic delay, $p$, and the logical effort, $g$. Note that in these gates the logical effort will not be the same for all the inputs, so you have to calculate one $g$ value per input. In contrast, the parasitic delay, $p$, is related to the output, so there cannot be more than one value for $p$ for one specific circuit topology. As usual, assume that the p-transistor current is twice the n -transistor current for the same transistor width.

Exercise 6.3: Task tests understanding of transistor scaling for same worst-case resistance. Exam 2016-12-22 Task 1(c). Solution on page 72.

Refer to Figure 3.2. What if your task was to ensure that the worst-case resistance is the same for the p-net and the n -net for all four logic functions, $\mathrm{Y} 1-\mathrm{Y} 4$. How would you size the transistors then? Assume that the drive strength of an nMOS transistor is twice that of a pMOS transistor with the same width. There are multiple solutions - you only have to give one.

### 6.2 Path delay

Exercise 6.4: Problem tests understanding of how to calculate and optimize the path delay. Solution on page 72.


FIGURE 1.19
CMOS compound gate for function $Y=(A+B+C)-D$


$$
\begin{aligned}
\mathrm{G}_{211} & =\mathrm{G}_{2}+\bar{K}_{2} \mathrm{G}_{1} \\
& =\mathrm{A}_{2} \mathrm{~B}_{2}+\left(\mathrm{A}_{2}+\mathrm{B}_{2}\right) \mathrm{A}_{1} \mathrm{~B}_{1}
\end{aligned}
$$

(b) The second complex gate.
(a) The first complex gate.

Figure 6.1: Two complex gates for which to calculate logical effort and parasitic delay. Both are taken from the textbook.
a) In Figure 6.2 you see a path through a circuit made up of NAND and NOR gates. Add the missing data for the 3 -input NAND gate and calculate the delay from A to B . In this process $p_{\text {inv }}$ is 1 .
b) Again consider the path shown in Figure 6.2. Now your task is to find the optimal sizes for the 3-input NAND and 2-input NOR gates to minimize the delay. What are the sizes and what is the delay when these sizes are applied?


Figure 6.2: A path of NAND and NOR gates for which to calculate the path delay.

Exercise 6.5: These three tasks test understanding of logical effort and path effort, and delay minimization Exam 2016-12-22 Problem 3. Solution on page 73.

In Figure 6.3 you see a block diagram schematic of a register file with 1632 -bit words and a 4 -to- 16 decoder that selects one of the 16 registers according to the address A[3:0].

In this problem your task is to size the decoder circuitry shown in detail in Figure 6.4.
a) How should the 4 -input nand gate (labelled y in Figure 6.4) and inverter (labelled z in Figure 6.4) be sized for minimum delay with the assumptions given in Figure 6.4? Assume that an nMOS transistor has twice the current of a pMOS transistor of the same width.
b) What is the resulting delay, including parasitic delays with the sizing from your result in task a)? Assume


Figure 6.3: Block diagram of 4-to-16 decoder for addressing register file.

Each address-line inverter has input capacitance 10C


Figure 6.4: Detail of of 4-to-16 decoder for addressing register file.
that the inverter's output capacitance is the same as its input capacitance.
c) What if we had a wider register file of 1664 -bit words? Would it be faster to use two inverters in place of inverter z? (One would also have invert the address bits of course, but that could easily be achieved by swapping the lines for each address bit and its inverse). Motivate your reply.

## Chapter 7

## Wire delay

Exercise 7.1: Task tests understanding of wire approximation. Solution on page 76.


Figure 7.1: A two-node RC system.

## Solution to be added for b and c

Shown in Figure 7.1 is a simplified version of the two-pole network from the book chapter.
a) Write down the two nodal equations for $V_{1}(t)$ and $V_{2}(t)$.
b) Find a method to convert these two nodal equations into second-order linear differential equation for $V_{1}(t)$ and $V_{2}(t)$.
c) Use the characteristic equation you found in task b), and identify $(s+a)(s+b)=s^{2}+(a+b) s+a b=0$ to find a simple way of determining the dominating time constant $1 / a$ if $1 / a \gg 1 / b$.

Exercise 7.2: Task tests ability to calculate wire parameters. Solution on page 74.

In a certain CMOS process the wire sheet resistance is $0.2 \Omega / \square$ and the wire capacitance is $0.4 \mathrm{fF} / \mu \mathrm{m}^{2}$.
a) For a 200 nm wide wire calculate the resistance and capacitance for a wire that is $25 \mu \mathrm{~m}$ long.
b) Calculate the critical wire length for a wire that is 100 nm wide when the wire is driven by an inverter (that is, a repeater) with time constant $t_{\text {rep }}=4.6 \mathrm{ps}$.

Exercise 7.3: Task tests ability to calculate wire parameters. Solution on page 75.

In another CMOS process, the wire fringing field capacitance along the wire sidewalls cannot be neglected. This capacitance is $35 \mathrm{aF} / \mu \mathrm{m}$ (including both sidewalls). The bottom-plate capacitance is $30 \mathrm{aF} / \mu \mathrm{m}^{2}$. The wire sheet resistance is $0.10 \Omega / \square$.
a) Calculate the wire resistance and wire capacitance for a wire that is 10 mm long and $1 \mu \mathrm{~m}$ wide.
b) Calculate the delay from the input of a driver inverter to input of an identical receiver inverter a the other end of the 10 mm wire from task a), if the inverter can deliver $600 \mu \mathrm{~A}$ at $V_{\mathrm{DD}}=1.2 \mathrm{~V}$ and its input and output capacitances are both 3.25 fF .
c) For this process and wire width, what is the critical wire length?

Exercise 7.4: Task tests understanding inverter and wire delay calculations. Adapted from exam problem from 2008. Solution on page 75.

An inverter is driving another identical inverter across a rather long RC wire. The inverter input and output capacitances are both $C$, and its resistance is $R$. The wire resistance is $4 R$ and the wire capacitance 8 C .
a) Draw a circuit diagram for the two inverters and the wire using a suitable model for the wire.
b) Use the diagram to calculate the wire RC delay from the output of the driver inverter to the input of the receiver inverter.
c) What if the receiver inverter were a NAND2 gate with input capacitance 2C. What would the change in the delay be then?
d) What if, instead, the driver inverter were a NAND2 gate with input capacitance 2C. How would the delay change then?
e) What if, instead, a branch wire ( $\mathrm{R}, 2 \mathrm{C}$ ) were added at the midpoint of the wire. How much would the delay increase then?
f) If the inverters in the original setup (task b) were properly scaled for minimum delay, what would be the resulting delay then?

Exercise 7.5: Task tests calculating wire and gate delay, logical effort. Exam 2016-08-22 Problem 3. Solution on page 76.

Figure 7.2 shows part of a clock-distribution network that comprises an inverter that acts as the clock driver, some wiring, and three identical NAND gates that act as clock gaters.


Figure 7.2: Clock-distribution network with a driver driving three receivers over different wires.
a) Calculate the clock skews at the inputs of all three clock gaters: $\mathrm{A}, \mathrm{B}$, and C . The clock driver has a driver resistance $R$ and an input capacitance $C$. The identical NAND2 gates all have an input capacitance of 2C. (6 p)
b) Calculate the clock skews at the outputs of the three NAND2 gates: A, B and C.

Exercise 7.6: Task tests calculating wire delay and sizing path to drive wire. Exam 2015-10-29 Problem 4. Solution on page 77.

Figure 7.3 shows part of a clock-gated network for distributing a system clock on a chip. The largest available inverter in the cell library ( $\mathrm{X} 200, C_{\mathrm{IN}}=72 \mathrm{fF}, R=100 \Omega$ ) was chosen to drive the wire network.


Figure 7.3: Clock-network for distribution of clock on chip.
a) Calculate the delay from the input of the X200 inverter (point B) to the input of one of the X50 receivers (point C). You may assume that the input of the X200 inverter is driven by an infinitely strong driver. (4 p)
b) There is a two-inverter buffer inserted between the clock-gating NAND gate and the X200 inverter. Design this buffer - that is, determine the sizes and/or input capacitances of the two buffer inverters for minimum delay from the CLK input of the NAND gate (point A) to the X200 inverter input (point B).
c) Calculate the resulting delay from the CLK input of the NAND gate (point A) to the X200 inverter input (B). (3 p)

Exercise 7.7: Task tests calculating wire parameters and resulting delays. Exam 2014-08-25 Problem 2. Solution on page 78.

In Figure 7.4 you see the general layout of a static random-access memory (SRAM). The word lines (WL) select the particular word that is to be read or written. The bit lines (BL) carry the bit values out when reading and supply the bit values that are to be written when writing. The bit lines are routed in metal 1 (blue) and the word lines are routed in metal 2 (purple). In each memory cell the word line is connected to two minimum-size nMOS transistors for accessing that particular memory cell.
A word line is $0.1 \mu \mathrm{~m}$ wide, which is the minimum width for an M 2 wire in this particular process. A word line has a capacitance of $0.1 \mathrm{fF} / \mu \mathrm{m}$ to ground and an inter-wire capacitance to one adjacent word line of $0.02 \mathrm{fF} / \mu \mathrm{m}$ and the M2 layer has a resistance of $0.1 \Omega / \square$. A minimum-size nMOS transistor has a gate capacitance $C_{\mathrm{g}}=0.1 \mathrm{fF}$. $V_{\mathrm{DD}}$ is 1 V .
a) Calculate the resistance for one WL.
b) Calculate the total capacitance for one WL including the capacitance of the access transistors.
c) Draw a circuit diagram for the WL and the inverter that is driving it. Assume the driver is an inverter with the same equivalent resistance as the WL resistance. Calculate the delay. For simplicity neglect the parasitic capacitance of the driver.
d) Calculate the energy required for accessing the memory when reading the memory once.
e) Estimate what would happen to the delay and energy computed in c) and d) if one could make the memory cells half as high and half as wide. Assume that the inter-wire capacitances are pure plate capacitances and the driver is re-sized so that its equivalent resistance still is the same as that of the wire. Reflect on the result.


Figure 7.4: General layout of SRAM memory bank with 256 128-bit words. The word lines (WL) run horizontally in metal-2 (purple) and the bit lines (BL) vertically in metal-1 (blue).

Exercise 7.8: Task tests calculating wire delay and sizing inverter to drive wire. Related to lab 4. Exam 2016-12-22 Problem 5. Solution on page 79.

In Figure 7.5 you see a driver inverter loaded by four identical receiver inverters across an H -tree wire interconnect.
a) Calculate the FO4 delay for the driver inverter when loaded as shown in Figure 7.5.
b) Determine the inverter resistance, $R_{\text {eff }}$, that minimizes the FO4 delay as calculated in a). You may assume that the inverter output capacitance, $C_{\mathrm{D}}$, is equal to the inverter input capacitance, $C_{\mathrm{G}}$.


Figure 7.5: A driver inverter loaded by four identical receiver inverters across an H -tree wire interconnect.

Exercise 7.9: Problem tests calculating gate and wire delay, repeater insertion with suboptimal solution and energy calculations. Exam 2021-01-07 Problem 5. The problem had different data for different students. The students were allowed to use all sources. Solution on page 80.


Figure 7.6: A long wire with inserted repeaters. Note that the size of the driver, the receiver and all the repeaters is the same.

You have a rather long wire on a chip and want to investigate if you need any repeaters for this wire (as is shown in Fig.7.6). You want the delay to be short but on the other hand you do not want to waste a lot of energy in the repeaters.

You are working our usual $65-\mathrm{nm}$ process which has $R C=7.2 \mathrm{ps}$ for the process (corresponding to tau $=0.7 R C=$ $5 \mathrm{ps})$. For simplicity assume $p_{i n v}=1$.

The data for your wire are: sheet resistance: $R_{S H}=0.15 \Omega / \square$, wire capacitance per length: $c=0.4 \mathrm{fF} / \mu \mathrm{m}$, wire width: $W=1 \mu \mathrm{~m}$, wire length: $L=5550 \mu \mathrm{~m}$
a) Calculate the resistance and capacticance for the wire, $R_{W}$ and $C_{W}$.
b) Using the values from task a) calculate the wire effort, the optimal Elmore delay.
c) Calculate the optimal resistance for the repeaters that you need to use to get the optimal Elmore delay. Also calculate the corresponding gate capacitance of these optimal repeaters.
d) Assume that you use the optimal number of wire segments. How much energy will be consumed in the repeaters (including the output capacitance of the driver and the input capacitance of the receiver) and how much due the wire itself when you switch the wire with repeaters on and off once.
e) Investigate a sub-optimal repeater insertion with fewer wire segments (between 2 less and half the optimal number of segments - your choice!). Assume that you use the same type of repeaters as you found in c). For this sub-optimal solution, what is the Elmore delay and what is the energy for the repeaters? Compare with the result in tasks b) and d). What is your conclusion about what to do for you wire?

## Chapter 8

## Layout

Exercise 8.1: Problem tests the understanding of continuous-line-of-diffusion layout. Solution on page 81.
a) Using Euler paths determine if it is possible to lay out the gate shown in Figure 6.1 a) with single-line-ofdiffusion approach. If it is possible determine one order of the input signals in the layout that will work.
b) Using Euler paths determine if it is possible to lay out the gate shown in Figure 6.1 b) single-line-of-diffusion approach. Assume that the repeated signals, A2 and B2, will also be repeated in the diffusion line. If it is possible, determine one order of the input signals that will work.

Exercise 8.2: Problem test understanding of Euler paths, layout of gates with continuous-line-ofdiffusion. Part of problem 1 in exam 2019-08-29.The rest of the problem is in problem 3.3. Solution on page 81 .

Consider the static CMOS cell shown in Figure 8.1. The cell comprises one compound gate and one inverter and is to be implemented in our usual 65 nm process.


Figure 8.1: Circuit diagram for a 5-input cell comprising one compound cell and one inverter.
a) Determine if it is possible to use a continuous-line-of-diffusion layout for the compound gate with the transistor order shown in Figure 8.1. If it is possible to do so, give one order of the inputs that will work. (2 p)
b) Draw the layout of the cell in Figure 8.1 in one of the two supplied templates in Figure 8.2. The layout of the compound gate should match the schematic in Figure 8.1 exactly.


Figure 8.2: Two templates for drawing the layout in task 2 c ). Select one of them for your solution.

Exercise 8.3: Problem test understanding of layout, parasitic delay and logical effort of multi-stage gates. Adapted from problem 1 in exam 2016-08-22. Solution on page 82.


Figure 8.3: Template for 4-input AND gate.

For a 4-input NAND gate we have previously found that the logical efforts for all inputs are 2 and the parasitic delay is $4 \times p_{i n v}$. In this problem you will layout and model a 4 -input AND gate using such a NAND gate. You can assume that $p_{\text {inv }}=1$.
a) In the cell layout template in Figure 8.3 you see the layout of one inverter to the right. To the left of that inverter is a continuous-line-of-diffusion template. Draw the layout for the 4 -input NAND gate there; also connect the output of the NAND gate to the inverter input, thus forming a 4 -input AND gate. Draw the layout such that you minimize the number of diffusion areas connected to the output node of the NAND gate.
b) The parasitic delay, p , of a static CMOS gate is due to the capacitances of the diffusion areas connected to the gate output. Find the value for $p_{\text {NAND } 4}$ for your layout from task a). Assume that the capacitance of a diffusion area of a particular width is the same if it is shared between two transistors as if it is not shared. (2p)
c) For the 4 -input AND gate, formed by the inverter and the 4-input NAND gate, find the logical effort, $g_{\text {AND4 }}$, (the same for all four inputs), and parasitic delay $p_{\text {AND4 }}$, for the entire gate. Use the $p_{\text {NAND } 4}$ value from task b) and find the relative transistor widths from the layout.

Exercise 8.4: Problem tests going from schematic to layout and from layout to schematic and logical function. Exam 2015-08-24 Task 1. Solution on page 83.
a) Figure 8.4 a shows the circuit schematic for a compound gate that implements a 5 -input logical function. Draw the corresponding layout in the template supplied in 8.4 b . The layout should correspond exactly to the schematic - logical equivalence is not enough. Label all inputs and outputs. Indicate clearly any contacts to metal.


Figure 8.4: A 5-input static CMOS gate.
b) In Figure 8.5 is the layout for a compound static CMOS implementation of a 4-input logical function. Draw the circuit schematic for the gate and find the Boolean expression for the logical function.

Exercise 8.5: Problem tests going from layout to schematic and identifying the logical function. Exam 2015-01-05 Problem 1. Solution on page 84.

Figure 8.6 shows the layout of another four-input standard cell.
a) Draw the corresponding transistor diagram. Make sure that your transistor diagram matches the layout exactly!
b) Identify the logical function that the layout implements.


Figure 8.5: The layout for a 4-input static CMOS gate.


Figure 8.6: The layout for another 4-input standard cell.

Exercise 8.6: Task tests going from schematic to layout. Exam 2016-12-22 Task 1(b). Solution on page 84.

Draw the layout for the gate schematics in Figure 8.7a in the template provided in Fig. 8.7b. For simplicity we have assumed that all transistors have the same width although that may not be a good sizing. Hint: Remember that diffusion can be used to route $V_{\mathrm{DD}}$ or ground short distances.

Exercise 8.7: Task tests understanding of what LVS does and reading layout. Exam 2013-10-22 Task 1. Solution on page 85.

What if we are confused by the error messages from the LVS, how shall we go about to find the discrepancies between the layout and the schematic entry? In other words, find the errors in the layout shown in Figure 8.8a for an AO 22 gate. There are four discrepancies to detect!

(a) A schematic of the gate taken from a patent application. $V_{D D}$ is at the top of the schematic although is not clear from the notation.

(b) A layout template in which to lay out the gate.

Figure 8.7: A static CMOS gate with four logical functions outputs.

(a) Layout of the AO22 gate

Figure 8.8: An LVS problem - where in the layout are the discrepancies between the layout and the schematic?

## Chapter 9

## Sequential circuits

Exercise 9.1: Problem tests understanding of delay for ripple-carry adder together with flip-flops Solution on page 86.

You are to use an adder made up of your ripple-carry cell from prelab 2 and a sum cell in a pipelined processor. For simplicity we do not analyze the sum cell in detail; rather we assume the sum cell has a propagation delay of 60 ps . Also assume that the sum cell is of the type that takes the generated $C_{\text {out }}$ from the same bit as one of its inputs, in addition to the input data bits $A$ and $B$ and $C_{\text {in }}$. (That is the sum cell shown in Figure 11.4 in Weste \& Harris.)

Reminder in case you do not have it handy: The delay of the 8-bit ripple circuit in prelab 2 was around 340 ps with $p_{\text {inv }}=0.8$. You have available flip-flops with the characteristics given in Table 9.1

Table 9.1: Flip-flop timing characteristics

| Flip-flop timing parameter | Value $[\mathrm{ps}]$ |
| :--- | :---: |
| Setup time, $t_{\text {setup }}$ | 50 |
| clk-to-Q propagation delay, $t_{\mathrm{pcq}}$ | 50 |
| clk-to-Q contamination delay, $t_{\mathrm{ccq}}$ | 35 |
| Hold time, $t_{\text {hold }}$ | 10 |

a) Assuming the adder is the combinational circuit limiting the speed of you entire processor, what is the highest clock frequency with which you can clock it, if your adder has 16 bits?
b) Again assuming the adder is the combinational circuit limiting the delay of the processor, what is the highest clock frequency with which you can clock it, if your adder has 64 bits?
c) What if there is clock skew in you system? Assuming the maximum clock skew is 75 ps between any two flip-flops, how will the results in tasks a) and b) change?

Exercise 9.2: Problem tests understanding of delay for ripple-carry adder together with flip-flop hold violations. Solution on page 86.

Return to the setup in the previous exercise. Go back to the ripple-carry adder design based on the carry-chain you implemented in lab 2. Also assume that the sum cell has a contamination delay of 30 ps .
a) Estimate the contamination delay of the 16-bit ripple-carry adder designed from your ripple-carry chain. If necessary, you may assume that the inverses are also available from the flip-flips preceding the adder.
b) With these contamination delays, the flip-flop data in Table 9.1,the clock frequency calculated in task 9.1. a) and no clock skew determine if you have to worry about hold violations.
c) What if you have a clock skew of maximum 75 ps ?

Exercise 9.3: Problem tests understanding of setup and hold violations and mitigation techniques. It is problem 4 from exam 2019-08-26. Solution on page 86.


Figure 9.1: Charlie's circuit which experiences a hold violation.

Your co-worker Charlie has designed the circuit shown in Figure 9.1. The design uses flip-flops with the timing characteristics shown in Table 9.8. All logic gates have the same timing characteristics: the propagation delay, $t_{\mathrm{pd}}$, is 40 ps and the contamination delay, $t_{\mathrm{cd}}$, is 25 ps .

Table 9.2: Flip-flop timing characteristics for Charlies's flip-flops)

| Flip-flop timing parameter | Value $[\mathrm{ps}]$ |
| :--- | :---: |
| Setup time, $t_{\text {setup }}$ | 50 |
| clk-to-Q propagation delay, $t_{\mathrm{pcq}}$ | 80 |
| clk-to-Q contamination delay, $t_{\mathrm{ccq}}$ | 30 |
| Hold time, $t_{\text {hold }}$ | 60 |



Figure 9.2: Two proposed solutions to solve Charlie's hold-violation problem. (a) Alyssa's proposal. (b) Ben's proposal.
a) Charlie has determined that the circuit in Figure 9.1 will experience hold violations. Verify that conclusion by calculating the requirement for the hold violation.
b) Determine the maximum clock frequency that that Charlies's circuit can be run at before is experiences a setup violation.
c) Your other co-workers Alyssa and Ben have each proposed a solution to Charlie's hold-violation problem. Their proposals are shown in Figure 9.2. The buffers they have added consists of two inverters. A buffer has the same delays as the other logic gates use in Charlie's circuit. It is now your task to determine if either of them, or both, will work to avoid the hold violation. If both solutions will work, which one is preferable? Motivate!

Exercise 9.4: Problem tests understanding of delay for adders together with flip-flops and critical path. This problem is from the exam 2016-08-22. It has been modified slightly to fit the purpose in this chapter. Solution on page 87.

In this course we have designed many adders but no multipliers. In this problem you will investigate how to use adders to implement binary multiplication and the performance of such an approach.

In Figure 9.3 is an example of a 6-bit binary multiplication from the Weste and Harris textbook.


Figure 9.3: Example of a multiplication of two 6-bit binary numbers. From Weste and Harris textbook.

From the example it is clear that the partial products are just left-shifted versions of the multiplicand. Binary multiplication can thus be performed by repeatedly shifting the multiplicand to the left and adding it to the product. Figure 9.4 shows how a $2 n$-bit adder can be used to perform binary multiplication of two n-bit binary numbers. To the left in the figure you see the datapath with an adder, two shifters and a register, and to right the iterative control required to perform a multiplication.


Figure 9.4: An iterative multiplier that uses a 2 n -bit adder.
In this problem, your task is to investigate the performance of this iterative multiplication for different types of adders and number of bits, $n$.

In Table 9.3 are worst-case propagation delays for two types of adders, for 8 - and 16-bit additions.
As you know for ripple-carry adders the worst-case delay grows linearly with the number of bits, $n$. We have not yet dealt with prefix adders, but their delay grows linearly with $\log _{2}(n)$.

Table 9.3: Adder worst-case propagation delays

| Number of bits in adder $n$ | Ripple-carry adder $t_{\mathrm{pd}}(\mathrm{ps})$ | Prefix adder $t_{\mathrm{pd}}(\mathrm{ps})$ |
| :---: | :---: | :---: |
| 8 | 130 | 200 |
| 16 | 250 | 250 |

Assume that the ProductReg is made up of flipflops with these characteristics: $t_{\mathrm{setup}}=20 \mathrm{ps}, t_{\mathrm{pcq}}=30 \mathrm{ps}$. For the shifter registers, assume there is a 30 ps delay from when the ShiftR and ShiftL signals are issued until the shifted output is available at their outputs.
For the control logic assume that each step takes one clock cycle. You may assume that the control signals that are the outputs from the control logic are perfectly synchronized with the clock.
a) Use the worst-case delay adder data in the table above to estimate the maximum clock frequency that can be used for the iterative 8 -bit multiplier. With this clock frequency and assuming worst-case multiplier input data, how long would it take to complete one 8 -bit multiplication?
b) What if we extend the iterative multiplier from task a) to multiply two 32 -bit binary numbers? How will its worst-case delay change? Assume that you can generate wider versions of the two types of adders in the table above. Which type of adder would you select? Motivate! For the selected type of adder, estimate the maximum clock frequency with which one could clock the multiplier control logic and still ensure a correct result. How long would it then take to complete one 32 -bit multiplication with the worst-case multiplier input data?

## BONUS QUESTION

c) The proposed multiplier is not that well designed. Suggest one substantial improvement that could be made to the datapath and estimate how much that improvement would increase the maximum clock frequency calculated in task b).
(4p)
Exercise 9.5: Problem tests understanding of critical path, setup and hold violations. It is problem 5 from exam 2016-10-29. Solution on page 88.

Assume that you are designing an adder for the minimalistic 3-bit ArmStrong processor. The adder is built from three full adders such that the carry-out signal of the first adder is the carry-in signal to the second adder and the carry out from the second adder is the carry in of the third adder, as shown in Figure 9.5. At the input and output of the adder are two registers made up of flip-flops.


Figure 9.5: ArmStrong 3-bit adder.
a) If there is no clock skew, what is the maximum operating frequency of the circuit? Assume the delays for typical CMOS-process parameters, given in the leftmost column of values in Table 9.4.
(2 p)
b) How much clock skew can the circuit tolerate before it might experience a hold violation? Again assume the delays for typical CMOS process parameters from Table 9.4.
c) Assume we had characterized the flip-flop and full-adder cells also for the fast-fast and slow-slow process corners and measured the delays shown in the two right-hand columns in Table 9.4. Describe how you would go about extending the results from tasks a) and b) with these additional data so that you can be sure that your adder works correctly also for these two extreme corners. Carry out your proposed calculations. Did you have to modify your results from a) and b)? If so, what are the updated results?

Table 9.4: ArmStrong timing characteristics

| Delay | Typical corner <br> measured value <br> $[\mathrm{ps}]$ | Fast-fast corner <br> measured value <br> $[\mathrm{ps}]$ | Slow-slow corner <br> measured value <br> $[\mathrm{ps}]$ |
| :--- | :---: | :---: | :---: |
| Full adders |  |  |  |
| $t_{\mathrm{pd}}, \mathrm{A}$ or $\mathrm{B} \rightarrow \mathrm{S}$ |  |  | 35 |
| $t_{\mathrm{cd}}, \mathrm{A}$ or $\mathrm{B} \rightarrow \mathrm{S}$ | 30 | 25 | 20 |
| $t_{\mathrm{pd}}, \mathrm{A}$ or $\mathrm{B} \rightarrow$ Cout | 22 | 16 | 30 |
| $t_{\mathrm{cd}}, \mathrm{A}$ or $\mathrm{B} \rightarrow$ Cout | 25 | 20 | 25 |
| $t_{\mathrm{pd}}, \mathrm{Cin} \rightarrow \mathrm{S}$ or Cout | 22 | 17 | 25 |
| $t_{\mathrm{cd}}, \mathrm{Cin} \rightarrow \mathrm{S}$ or Cout | 20 | 17 | 20 |
| Flip-flops | 15 | 12 |  |
| $t_{\mathrm{pcq}}$ |  |  | 40 |
| $t_{\mathrm{ccq}}$ | 35 | 28 | 24 |
| $t_{\mathrm{setup}}$ | 21 | 16 | 35 |
| $t_{\mathrm{hold}}$ | 30 | 25 | 20 |

Exercise 9.6: Problem tests understanding of critical path, setup and hold violations, process corners. It is problem 4 from exam 2021-01-07. On this exam the students could use all sources. Solution on page 89.
\%beginproblem Sequential, setup and hold violations \%endproblem


Figure 9.6: Three blocks of combinational logic, CL A, B and C, connected by three registers, RA, RB and RC.

Table 9.5: Timing parameters for the combinational logic blocks

| Timing parameter | TT <br> $[\mathrm{ps}]$ | SS corner <br> $[\mathrm{ps}]$ | FF corner <br> $[\mathrm{ps}]$ |
| :--- | :---: | :---: | :---: |
| Block A Propagation delay, $t_{\mathrm{pd}}$ | 100 | 120 | 90 |
| Block A Contamination delay, $t_{\mathrm{cd}}$ | 80 | 85 | 70 |
| Block B Propagation delay, $t_{\mathrm{pd}}$ | 30 | 40 | 15 |
| Block B Contamination delay, $t_{\mathrm{cd}}$ | 20 | 30 | 10 |
| Block C Propagation delay, $t_{\mathrm{pd}}$ | 120 | 140 | 100 |
| Block C Contamination delay, $t_{\mathrm{cd}}$ | 20 | 22 | 18 |

Table 9.6: Flip-flop timing characteristics for flip-flops of type R1

| Flip-flop timing parameter | TT <br> $[\mathrm{ps}]$ | SS corner <br> $[\mathrm{ps}]$ | FF corner <br> $[\mathrm{ps}]$ |
| :--- | :---: | :---: | :---: |
| Setup time, $t_{\text {setup }}$ | 40 | 45 | 35 |
| clk-to-Q propagation delay, $t_{\mathrm{pcq}}$ | 25 | 30 | 20 |
| clk-to-Q contamination delay, $t_{\mathrm{ccq}}$ | 15 | 20 | 10 |
| Hold time, $t_{\text {hold }}$ | 20 | 30 | 15 |

Consider the circuit in Figure 9.7 with three combinational logic blocks A, B and C connected with registers RA, RB and RC. Data for the combinational logic blocks A, B, and C and for registers of type R1 are are found in Tables 9.7 and 9.8, respectively. Typical data are is in the the leftmost column, TT, while data for the slow (SS) and fast ( FF ) corners are given in the columns to the right.
a) With typical timing parameters, what is the maximum clock frequency at which the circuit can operate without experiencing a setup violation?
(2 p)
b) With the clock frequency you found in task a) and typical timing parameters, how much clock skew can the circuit tolerate without experiencing a hold violation?
c) In the next design step you tentatively decide to use a clock frequency of 3 GHz . Now you also you need to analyze the circuit for the fast ( FF ) and slow (SS) corners. With this clock frequency, what is the maximum clock skew you can be sure will not cause either setup or hold violations in the circuit for any of the timing corners?

Exercise 9.7: Problem tests understanding of critical path, setup and hold violations also for input signals. It is problem 4 from exam 2021-08-23. On this exam the students could use all sources. Solution on page 89.


Figure 9.7: Three blocks of combinational logic, CL A, B and C, connected by three registers, RA, RB and RC.

Table 9.7: Timing parameters for the combinational logic blocks

| Timing parameter | Block A <br> $[\mathrm{ps}]$ | Block B <br> $[\mathrm{ps}]$ | Block C <br> $[\mathrm{ps}]$ |
| :--- | :---: | :---: | :---: |
| Propagation delay, $t_{\mathrm{pd}}$ | 30 | 40 | 50 |
| Contamination delay, $t_{\mathrm{cd}}$ | 10 | $?$ | 10 |

Consider the circuit in Figure 9.7 with three combinational logic blocks A, B and C connected with registers RA, RB and RC. Timing data for the combinational logic blocks A, B and C and for registers RA, RB and RC are are found in Tables 9.7 and 9.8, respectively.
a) What is the smallest value for the contamination delay of Block B that will guarantee correct operation for all the registers in the circuit (that is, that there are no violations)?
(2 p)
b) What is the smallest value for the clock period that will guarantee that there are no violations in any of the registers?

Table 9.8: Flip-flop timing characteristics

| Flip-flop timing parameter | RA and RB <br> $[\mathrm{ps}]$ | RC <br> $[\mathrm{ps}]$ |
| :--- | :---: | :---: |
| Setup time, $t_{\text {setup }}$ | 10 | 10 |
| clk-to-Q propagation delay, $t_{\mathrm{pcq}}$ | 10 | 10 |
| clk-to-Q contamination delay, $t_{\mathrm{ccq}}$ | 0 | 0 |
| Hold time, $t_{\text {hold }}$ | 10 | 20 |

c) After which time (relative to the clock edge), is the input signal IN not allowed to switch, to ensure no that there are no setup violations at any of the registers in the circuit?
d) When is IN allowed to change again (relative to the clock edge) to ensure there are no hold violations in any of the registers in the circuit?

## Chapter 10

## Power, energy and scaling

This chapter contains problems on power and energy. There is also a section on technology scaling in combination with delay, power, energy and area calculations, since the reason for technology scaling is often to decrease the power consumption while maintaining speed.

### 10.1 Power and energy

Exercise 10.1: Problem tests understanding of power. Problem is 5.2 in the textbook. Solution on page 90.

You are considering lowering $V_{\mathrm{DD}}$ to try to save power for a static CMOS gate. You will also scale the threshold voltages, $V_{\mathrm{T}}$, proportionally to maintain speed (= performance). Will dynamic power consumption go up or down? Will static power consumption go up or down?

Exercise 10.2: Problem tests basic knowledge on how to calculate static and dynamic power dissipation. Exam 2014-10-31 Problem 1. Slighty revised. Solution on page 90.

The power consumption of a CMOS inverter can be minimized through circuit optimizations. How should each parameter in Table 10.1 be changed to reduce the three different components of the inverter power consumption? For each parameter indicate I for increase, D for decrease or N for does not affect this type of power consumption.

Table 10.1: Table for power optimization.

| Type of power to minimize | Capaci- <br> tive <br> load $C_{\mathrm{L}}$ | Supply <br> voltage <br> $V_{\mathrm{DD}}$ | Tresh- <br> hold <br> voltages <br> $V_{\mathrm{T}}$ | Transis- <br> tor <br> widths <br> $W$ |
| :--- | :--- | :--- | :--- | :--- |
| Dynamic power consumption due to the charging <br> and discharging of the capacitive load $C_{\mathrm{L}}$ |  |  |  |  |
| Power consumption due to shortcircuit current <br> during transition (assuming a fixed rise and fail <br> time at the inverter input) |  |  |  |  |
| Static power dissipation, that is the power due to <br> the transistor leakage currents |  |  |  |  |

[^0]Shown in Figure 10.1 are the static power entries for a 2-input NAND gate in the .lib file from a standard-cell library. When we inspect the power entries, we notice that one of the input combinations for this gate has a much lower leakage power than do the other three. Why is this? Explain!


Figure 10.1: Data from lib file for 2-input NAND gate.
Will the ripple-carry gate that you have designed in labs 2 and 3 exhibit a similar leakage pattern, or not? That is, is there a best or worst combination of inputs when it comes to leakage? If so, what for input combination do we have this situation?

Exercise 10.4: Problem tests ability to calculate dynamic power dissipation and understanding of its origins. Exam 2015-01-05 Problem 3 c)-d). Solution on page 90.

Refer back to your solution to Exercise 5.14 or to its solution on page 69. Also, assume that the operating frequency is 200 MHz and that $\alpha$ is 0.25 for the signal that drives the pad. $V_{\mathrm{DD}}$ is 1 V .
c) For your design what is the dynamic power consumption?
d) From the usual formula used for the dynamic power consumption is seems that to minimize the power consumption during switching there should be no inverters in the box in Figure 5.7. Why is this conclusion incorrect?

Exercise 10.5: Problem tests ability to calculate static and dynamic power dissipation and knowledge of power and energy. It is adapted from two examples in the text book. Exam 15-10-29 Problem 3. Solution on page 91.

A digital system on a chip has 1 billion transistors. Of these 50 million are used in static CMOS logic gates and the rest are used in memories. The two parts of the chip are illustrated in Figure 10.2, which also includes relevant chip and process data.
a) Using data from Figure 10.2, estimate the power due to dynamic switching at a clock frequency of 1 GHz , if we neglect the effects of short-circuit current and wire capacitances.
b) Using data from the figure above, estimate the static power consumption for the chip.
c) What if the logic part can be redesigned so that fewer transistors than before require the low $V_{\mathrm{T}}$ : only $1 \%$ of the original number, while the total number of logic transistors has to be increased by $20 \%$. The activity factor stays the same. What are the effects of this redesign on the static and dynamic power consumption for the logic part?
d) If the measures above are not enough, one may have to use power gating to reduce the leakage further. What if we are to power-gate the entire logic part in the chip? We can tolerate only a $5 \%$ drop of $V_{\mathrm{DD}}$ due to the resistance in the power switch, otherwise the increase in delay will be too large. How wide would our switch have to be if the pMOS transistor ON resistance is $2 \mathrm{k} \Omega \mu \mathrm{m}$ ? Assume the dynamic power consumption of the logic calculated in task a).


Figure 10.2: Data for the system on chip.
e) How much energy would be required to switch the power-gating transistor on and off once? How long a time with the leakage current for the logic part only, as calculated in task b), does that energy correspond to? ( 2 p )

## BONUS QUESTION

f) When considering power gating, would it be better to start with the situation described in tasks a) and b) or the one described in task c)? Discuss the design considerations!

## VARIATION

g) What if the processor in problem 10.5 is running at $V_{D D}=1 \mathrm{~V}$ instead. Would the width of the required switch, calculated in 10.5 d ) change? What about the energy and time calculated in 10.5 e )?

Exercise 10.6: Problem tests ability to calculate static and dynamic power and choose mitigation techniques. Exam 2019-10-31 Problem 5. Solution on page 92.

You are part of a team designing a custom processor for a specific application, called Big. Your design is now complete, but after simulations you realize that it consumes too much power. You do not have time to completely redesign the processor. Instead, you and your colleagues have the task to find other ways to reduce its power consumption.

Thankfully, the Big software team has been able to optimize the Big application software so that it now requires only a clock frequency of 800 MHz , rather than the initial 1 GHz . The Big application takes 1 s to execute at 800 MHz ; then the next execution round has to start immediately after one round is finished.
The processor data is given in Table 10.2. The CMOS process characteristics are given in Table 10.3.
Table 10.2: Data for processor design

| Parameter | Description | Logic <br> part | Memory <br> part |
| :--- | :--- | :--- | :--- |
| $N_{\text {trans }}$ | number of transistors | 150 M | 600 M |
| $W_{\text {avg }}$ | average transistor widths | 450 nm | 120 nm |
| $\alpha$ | average activity factor | 0.15 | 0.01 |
| $\%_{R V T}$ | percent of transistors of RVT type | $80 \%$ | $100 \%$ |
| $\%_{\text {LVT }}$ | percent of transistors of LVT type | $20 \%$ | $0 \%$ |

a) The first reduction method that you consider is to continue to run the processor at 1 GHz in order to run the application as fast as possible, and then turn the logic part off. This method requires that you implement a power switch. Let us assume that doing so it not that complex. What would be the energy consumption for the processor, for one execution round, if you use this mode of operation? Assume that the energy required for using the power switch is negligible,

Table 10.3: CMOS process and transistor parameters

| Parameter | Description | Value |  |
| :--- | :--- | :--- | :--- |
| $V_{\mathrm{DD}}$ | nominal supply voltage | 0.8 V |  |
| $V_{\mathrm{DDmin}}$ | minimum supply voltage | 0.4 V |  |
|  | Note: nMOS and pMOS transistors have | RVT type | LVT type |
|  | same current per width | Values | Values |
| $V_{\mathrm{TH}}$ | threshold voltages (pMOS \& nMOS) | 0.2 V | 0.2 V |
| $C_{G}$ | gate capacitance per width | $1 \mathrm{fF} / \mu \mathrm{m}$ | $1 \mathrm{fF} / \mu \mathrm{m}$ |
| $C_{D}$ | drain capacitance per width | $0.3 \mathrm{fF} / \mu \mathrm{m}$ | $0.3 \mathrm{fF} / \mu \mathrm{m}$ |
| $I_{\text {sub }}$ | subthreshold current per width | $500 \mathrm{nA} / \mu \mathrm{m}$ | $50 \mathrm{nA} / \mu \mathrm{m}$ |
| $I_{g}$ | gate leakage current per width | $5 \mathrm{nA} / \mu \mathrm{m}$ | $5 \mathrm{nA} / \mu \mathrm{m}$ |

b) Your colleague proposes that you could save more energy by turning the supply voltage down and running the processor at the lower clock frequency, 800 MHz . Calculate the lowest supply voltage that the processor can operate at, when its clock frequency is 800 MHz . Also calculate the energy consumption for one execution round when the supply voltage is reduced this way. For simplicity assume that the leakage currents are independent of the supply voltage. Hint: $f_{c} \sim \frac{1}{R_{e f f}}$.
c) Another application team is also considering using your processor. They are designing an application called Small. Small depends on external events, and must run at a clock frequency of 100 MHz for proper execution (they do not say anything about how long it takes for the execution to complete at this frequency). The Small team knows that the your processor cannot detect these external events. The team now asks you for a recommendation for how to save power for their application. Should they run continuously at reduced supply voltage with the lowest possible clock frequency? Or should they run at the highest clock frequency and then power the logic part of the processor off, while using an external circuit to detect the external events and reactivating the processor? Motivate you recommendation. You do not need to make any detailed energy calculations.
(2 p)

Exercise 10.7: Problem tests static and dynamic power dissipation and their relation to $V_{\mathrm{D} D}$ and clock frequency. Exam 2016-12-22 Problem 4. Solution on page 94.

In this problem you will analyze the energy consumption for a video-rendering application run on a processor dedicated to this application. The processor can run at different supply voltages and has both a sleep mode and a hibernation mode. You final task is to determine if the hibernation mode is useful for this particular application.

Application: The digital video in the video-rendering application has 25 frames per second. One frame is 640x480 pixels. Each pixel is represented by 24 bits. The number of operations needed per pixel is 8 and these operations take 10 clock cycles to complete on the PP processor. The computations for one frame have to be completed in the time allotted for that frame.

PP processor: The PP processor used for this application has some characteristics shown in Table 10.4. You also know that it is fabricated in a CMOS process where the threshold voltages are 0.3 V . You may assume that the quadratic current equations hold in this process.

Table 10.4: Data for the PP processor

| $\begin{aligned} & \text { Sup- } \\ & \text { ply } \\ & \text { volt- } \\ & \text { age } \\ & V_{\mathrm{DD}} \\ & {[\mathrm{~V}]} \end{aligned}$ | Maximum clock frequency [GHz] | Current due to dynamic power consumption @ max clock frequency and a realistic activity factor [mA] | Idle current @ room temperature @ max clock frequency and a low activity factor [mA] | Static current in sleep mode @ room temperature (clock signal turned off for logic, but clock generation maintained) [mA] | Static current in hibernation mode (clock generation stopped and internal supply voltages turned off) $[\mu \mathrm{A}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1.2 | 1.0 | 600 | 100 | 60 | 60 |
| 1.0 | ? | ? | 80 | 37.5 | 60 |
| 0.8 | ? | ? | 64 | 28 | 60 |

Sleep mode: The time it takes to enter sleep mode is $10 \mu \mathrm{~s}$ and it takes $20 \mu \mathrm{~s}$ for the processor to wake up from sleep mode. The energy required to switch the clocks off is $10 \mu \mathrm{~J}$.

Hibernation mode: The time it takes to enter hibernation mode is 1 ms and it takes 19 ms to wake up from hibernation mode. The energy required to turn off $V_{\mathrm{DD}}$ is $500 \mu \mathrm{~J}$.
a) Fill in the four empty cells in Table 10.4 above with reasonable values.
b) Considering only dynamic power, how much energy will be used for one frame of the video-rendering application in these two cases: 1.2 V and 0.8 V supply voltage?
(2 p)
c) How much energy will be dissipated due to static power consumption for one frame for the two cases: 1.2 and 0.8 V supply voltage? Assume that the processor immediately enters sleep mode when no computations are required for the video-rendering application.
d) What is your recommendation regarding the hibernation mode? Should it be used or not for the videorendering application? Motivate your reply using data given in this problem and your results from tasks a)-c).
(2 p)

Exercise 10.8: Problem tests ability to calculate power efficiency. Exam 2021-01-07 Problem 3. Solution on page 95 .

Even as a hardware engineer you can get assigned software-related tasks. Here your task is to determine the most energy-efficient way to execute the so-called POP application. The POP application needs to be executed once every 100 milliseconds on the LEG-2000 processor. The LEG-2000 has two types of cores: the L78 core and the L52 core. Data for the POP application and the two types of cores are given in Table 10.5.
a) Your initial task is to decide whether to use the L78 or the L52 core when there is no other applications to execute on the LEG-2000 processor. That is, what is the power usage for the L78 and L52 cores when executing the POP application according to the required timing?
b) To complicate things further there is an additional application which must be executed on the L78 core (because is needs some special hardware that is not available in the L52 core). This additional task requires 60 milliseconds for its execution and also needs to be executed every 100 milliseconds. The tasks are fully independent, and thereby they can be scheduled freely. Does this fact change your conclusion from task a)? Hint: Think about the processor's total power consumption.

Table 10.5: Data for the two different cores.

| Parameter | L78 core | L52 core |
| :--- | ---: | ---: |
| Clock frequency, $f_{\mathrm{c}}$ | 1 GHz | 200 MHz |
| Supply voltage, $V_{\text {dd }}$ | 1.2 V | 1 V |
| Number of clock pulses for execution | 12 M | 15 M |
| Current consumption in active mode, $I_{\text {active }}$ | 1 A | 200 mA |
| Current consumption in sleep mode, $I_{\text {sleep }}$ | 250 mA | 50 mA |
| Current consumption in power off mode, $I_{\text {power-off }}$ | 1 mA | 1 mA |
| Time required to go into sleep, $t_{\text {sleep }}$ | $500 \mu \mathrm{~s}$ | 1 ms |
| Time required to wake-up from sleep, $t_{\text {wake-up }}$ | $500 \mu \mathrm{~s}$ | 1 ms |
| Time required for entering power off mode, $t_{\text {power-off }}$ | 5 ms | 5 ms |
| Time required for power on core, $t_{\text {power-on }}$ | 15 ms | 15 ms |
| Energy consumed when entering sleep, $E_{\text {sleep }}$ | $200 \mu \mathrm{~J}$ | $50 \mu \mathrm{~J}$ |
| Energy consumed when waking up from sleep, $E_{\text {wake-up }}$ | $200 \mu \mathrm{~J}$ | $50 \mu \mathrm{~J}$ |
| Energy consumed when entering power off, $E_{\text {power-off }}$ | 2 mJ | $500 \mu \mathrm{~J}$ |
| Energy consumed when powering on core, $E_{\text {power-on }}$ | 5 mJ | 1 mJ |

### 10.2 Technology scaling and other scalings too

In this section we have collected problems related to technology scaling, but also problems that deal with other scalings, so that there is a mix.

Exercise 10.9: Problem tests ability to calculate scaling dependence for secondary parameters Solution on page 96 .

Technology scaling is (or maybe, one now should say, was) performed to increase speed while lowering power consumption and increasing packing density, thus decreasing chip area, or enabling more functions on the same chip area. The classic scaling is the Dennard scaling. In table10.6 the scaling of the primary parameters is given at the top part of the table. Your task is to derive the scaling for each of the secondary parameters from the primary ones. It is best to do it in order since many of the derived parameters depend on each other.

Table 10.6: Table for Dennard scaling.

| Parameter | Sensitivity expression | Dennard scaling, scaling factor $S$ |
| :---: | :---: | :---: |
| Scaling parameters |  |  |
| $L$ : length |  | 1/S |
| $W$ : width |  | 1/S |
| $t_{\text {ox }}$ : gate oxide thickness |  | 1/S |
| $V_{\mathrm{DD}}$ : power supply voltage |  | 1/S |
| $V_{\mathrm{T}}$ : threshold voltage(s) |  | 1/S |
| $N A$ : substrate doping |  | $S$ |
| Device characteristics |  |  |
| $\beta$ : current factor | $\frac{W}{L} \frac{1}{t_{0 x}}$ |  |
| $I_{\text {DS }}$ : transistor current | $\beta\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2}$ |  |
| $R_{\text {eff }}$ resistance | $\frac{V_{\text {DD }}}{I_{\text {DS }}}$ |  |
| $C$ : gate capacitance | $\frac{W L}{t_{\text {ox }}}$ |  |
| $\tau$ : gate delay | $R_{\text {eff }} C$ |  |
| $f$ : clock frequency | $\frac{1}{\tau}$ |  |
| $E$ : switching energy (per gate) | $C V_{\text {DD }}^{2}$ |  |
| $P$ : switching power (per gate) | $E f$ |  |
| $A$ : area (per gate) | WL |  |
| Switching power density | $\frac{P}{A}$ |  |
| Switching current density | $\frac{I_{\text {DS }}}{A}$ |  |

Exercise 10.10: EXTRA Problem tests ability to calculate scaling dependence for secondary parameters Solution on page 96.

For quite a few years semiconductor manufacturers resisted scaling down the supply voltage and threshold voltages, since it would making interfacing with existing technologies harder. To see the effect of this type of scaling repeat exercise 10.9 but with " 1 " as the entries for $V_{\mathrm{DD}}$ and $V_{\mathrm{T}}$.

Exercise 10.11: Problem tests ability to apply process scaling to delay and power. Exam 2011-08-24, task 4a. Solution on page 96.

The FO4 delay of the AMS $0.35 \mu \mathrm{~m}$ CMOS process running at 3.3 V is 125 ps . What would the FO4 delay be of a $0.13 \mu \mathrm{~m}$ CMOS process running at $V_{\mathrm{DD}}=1.8 \mathrm{~V}$ ?

Exercise 10.12: Problem tests ability to calculate static and dynamic power dissipation under the effect of technology scaling. Exam 2012-10-26 Problem 5. Solution on page 98.

The work horse product of the California-based microprocessor company MegaProcessor is a singlecore microprocessor manufactured in a 90 nm CMOS technology. The microprocessor operates at 3.8 GHz with a 1.2 V supply, a 100 W power dissipation, and a die area of $200 \mathrm{~mm}^{2}$. The company is now about to design a dual-core microprocessor in the same technology, by duplicating the single-core design.
a) What would be the frequency and supply voltage for the dual-core design if the same size of the heat sink is to be maintained, that is if $P_{\text {DUAL_CORE }}=P_{\text {SINGLE_Core }}$ ? Assume that $100 \%$ of the power dissipation is due to dynamic power, and, for simplicity, assume that the frequency of operation is roughly linearly proportional to the supply voltage.
b) When the single core design from above is moved to the 65 nm technology node, what would be its size, power and frequency of operation with a 1 V power supply?
c) Now consider the following what-if situation: assume that $10 \%$ of the 100 W total power dissipated by the single-core in task a) is due to static (leakage) power and that $90 \%$ is due to dynamic switching power. Furthermore, in the 65 nm process the standard threshold voltage (svt) is almost 50 mV lower than in the 90 nm process, yielding a four-fold increase in leakage power at room temperature. What would be the total power dissipation of the single-core processor when transferred to the 65 nm process? How many percent would be leakage power?

Exercise 10.13: Problem tests ability to minimize dynamic power dissipation under constraint of constant delay. Exam 13-08-26 Problem 3. Solution on page 98.

One reason for going to multi-core computer systems is to reduce the total power dissipation by lowering the supply voltage and then compensate for speed losses by using more than one core that work in parallel. A simplified assumption is that four cores at a fourth of full speed can do the same work as one core at full speed. From these simple assumptions and from the assumption that the delay is given by $0.7 \frac{C V_{\mathrm{DD}}}{I_{\mathrm{DSAT}}}$, find out how much one can reduce the power dissipation by using a quad-core design, while still getting the same job done. Assume that we use the same CMOS process and the same core, but that an additional $20 \%$ of the core capacitance must be added to the total chip capacitance for the control unit coordinating the four cores. Let us assume that $V_{\mathrm{T}}$ is $25 \%$ of the original supply voltage. Use the simple square-law current model for the MOSFET saturation current!

## Chapter 11

## Adders

## Need to add more adder problems.

Exercise 11.1: Problem test understanding of block-propagate signal and its use to create faster adders.
Exam 2017-08-21 Problem 6. Solution on page 99.

(a) 8-bit adder.

(b) 4-bit adder.

Figure 11.1: Two adders. In (a) is an 8-bit adder and in (b) a 4-bit adder.

You have designed an 8-bit adder circuit with eight SUM outputs, one carry output and one block-propagate output, as shown in Figure 11.1 (a). The propagation delays expressed in unit delays are shown in Table 11.1. You have now now been asked to design a 32-bit adder with as short a propagation delay as possible. Available to you are AO, OA, AND and OR gates which each have a delay of 1 unit delay.

Table 11.1: Adder propagation delays

| Propagation delay | 8-bit adder <br> [unit delays] | 4-bit adder <br> [unit delays] |
| :--- | :--- | :--- |
| From carry in to carry out | 9 | 5 |
| From any data bit to carry out | 9 | 5 |
| From any data bit to block propagate | 4 | 2 |
| From carry in to highest SUM output | 8 | 4 |
| From any data bit to highest SUM output | 8 | 4 |

a) Draw a diagram of how you would construct a fast 32 -bit adder from your 8-bit adder. In addition to the 32 SUM bits your 32-bit adder should also output the block-propagate and block-generate signals for the entire adder.
b) Derive the propagation delay of your adder as drawn in task a). Assume that all inputs to your adder arrive simultaneously.
c) What if you also had designed a similar 4-bit adder, as shown in Figure 11.1 (b), with propagation delays given in the last column of are shown in Table 11.1. Would it be better to use the 4-bit adder rather than the 8 -bit adder in your 32-bit adder? Motivate by drawing a diagram and calculating the delay.

Exercise 11.2: Problem tests understanding of prefix adder sum generation and critical path. Exam 2016-12-22 Problem 6. Solution on page 99.

Figure 11.2 shows the beginning of the design of an unknown prefix adder. As you can see from the figure, in this type of adder no dot-operator cell in the forward or backwards tree drives more than two other dot-operator cells. (The triangles in the diagram are buffers that you can ignore in this problem.)


Figure 11.2: An unknown prefix adder where the output of any dot-operator cell is connected to a maximum of two other dot-operator cell inputs.
a) Your task is to complete the adder design by adding the missing dot-operator cells, in the red box, so that all input carries needed for the SUM operations are available at the bottom. As an example, group carries C15:0, C11:0 and C7:0 (indicated with blue in Figure 6) are already available to form SUM16, SUM12 and SUM8, respectively. A fully correct solution should maintain the design principle that no cell in the tree drives more than two other dot-operator cells. Solutions that do not fulfill this principle, but are logically correct, will give partial credit. The figure is repeated twice at the end of the exam so that you tear off and turn in with you solutions.
b) What is the critical path of the prefix tree you have drawn in task a)? Indicate it clearly in the complete tree that you hand in.

Exercise 11.3: Problem tests knowledge of delay and area for types of adders. Exam 2017-12-21 Problem 1c. Solution on page 99.

When using a synthesis tool to automatically map the addition operator, " + ", to hardware, the tool will select the type of adder to meet the timing constraint while minimizing the area. One can make an experiment by synthesizing the same adder with tighter and tighter timing constraints and see what type of adder the tool selects and how large area the resulting adders occupy. Figure 11.3 shows the results of such an experiment carried out for both 32-bit and 64-bit adders. The used synthesis tool had these four types of adders available (listed in alphabethical order: (A) carry-lookahead adders (B) carry-select adders (C) prefix adders (D) ripple-carry adders. Match the four types of adders with the four labels in graph of Figure 11.3. No motivation is required.

Exercise 11.4: Problem tests understanding of selection adders. Exam 2020-01-07 Problem 6. The students were allowed to use all sources at this exam. Solution on page 100.


Figure 11.3: The resulting normalized area when synthesising a 32-bit and a 64-bit adder with different timing constraints. The four types of adders listed in alphabetical order are (A) carry lookahead adders (B) carry-select adders (C) prefix adders and (D) ripple-carry adders. But which one is which?

You are an avid reader of tech blogs. You have come across a discussion about adders, more precisely about selection adders. The two entries below catch your attention.

Entry 9 (2020-10-23 19:17)
The carry-lookahead adder is faster than the carry-skip adder since multiplexers can be replaced with And-Or gates that have a shorter delay. This is possible since the logical functionality is equivalent in both these gates.

Entry 10 (2020-10-25 11:32)
This cannot be correct as a multiplexer is very similar to an AO221-gate while an A021-gate is used in the carry-lookahead adder. Thereby the multiplexer and A021-gate cannot be logically equivalent.

Table 11.2: Delay data for some logic gates.

| Logic element | Delay <br> [unit delays] |
| :--- | :---: |
| PG-generation, $t_{\mathrm{pg}}$ | 2 |
| AO-gate, $t_{\mathrm{AO}}$ | 3 |
| XOR-gate, $t_{\mathrm{xor}}$ | 2 |

a) Your task is here to write a reply that explains why it is possible to replace the multiplexer with an AO21-gate in the carry-lookahead adder.
b) The carry-lookahead adder (CLA) presents a good trade-off between area and delay in many cases. Based on the data in Table 11.2, calculate the longest delay for a 16-bit adder when using 4-bit groups. Repeat the calculation with 8-bit groups.
c) The group sizes in a CLA does not have to be identical. Construct a 16-bit CLA that uses two 4-bit groups and one 8 -bit group. Arrange the groups for the shortest delay and explain why this arrangement gives a shorter delay than when only 8 -bit groups are used.
(2 p)
d) The concept of using different group-sizes can be further generalized for a potentially shorter delay. Find the shortest possible delay for a 16-bit CLA with at most 6 groups. Present the resulting delay and the size of each group.

## Chapter 12

## Solutions

### 12.1 Introduction

### 12.2 Background material

### 12.3 Logic functions

Solution 3.1 Problem is on page 9.
a) The circuit is a 4-input NAND gate. For completeness we also show the Karnaugh map and discuss how to get from Karnaugh map to the schematic. The schematic and the corresponding Karnaugh map (with ones and zeros covered) are shown in Figure 12.1.

From the Karnaugh map in Figure 12.1c we directly find that $A \cdot B \cdot C \cdot D$ is the only product that gives the output 0 . With the zeros covered it is easy to directly find the n-net. The n-net implements the inverose of the function; that is why we need to circle the zeros. And the n-transistor are non-inverting, so we can draw the n-net directly. In this we have the product (AND) of all inputs, which is implemented by transistors in series:

$$
\bar{Z}=A \cdot B \cdot C \cdot D
$$

The p-net can then be found from the direct inversion of the logical expression for $\bar{Z}$, by invertering the n -net (series connection becomes parallel) or from the Karnaugh map when we cover all the ones. In this case the two first approaches are easier but we show the third one below for completeness.

Here we need four circles to cover the ones. Each circle covers eight ones, that is half of the map. Thus each circuit corresponds to one of the four inputs and the other three are eliminated. And in this particular case in the sum-of-products formulation each product is one of the inputs inverted:

$$
\begin{equation*}
Z=\bar{A}+\bar{B}+\bar{C}+\bar{D} \tag{12.1}
\end{equation*}
$$

But this expression can be rewritten using De Morgan's theorem as:

$$
Z=\overline{A \cdot B \cdot C \cdot D}
$$

which is the function we had started with in this task.
Because p-transistors are invertering in themselves (the bubble) we still should connect $\mathrm{A}, \mathrm{B}, \mathrm{C}$, and D to the p-transistor gates when we draw the p-net that implements the sum of products (12.1).
b) The logic gate is an AND-OR-INVERT (AOI) $3+1$ gate. One solution with corresponding Karnaugh maps (with ones and zeros circled) is shown in Figure 12.2. We follow the same procedure as in task a). If we find the p-net from the Karnaugh map we get the sum-of-products expression:

$$
Z=\bar{A} \cdot \bar{D}+\bar{B} \cdot \bar{D}+\bar{C} \cdot \bar{D}
$$

We can reduce the number of transistors by simplyfing the logic equation:

$$
Z=(\bar{A}+\bar{B}+\bar{C}) \cdot \bar{D}
$$

c) The logic gate is an OR-AND-INVERT (OAI) $3+1$ gate. One solution with corresponding Karnaugh maps (ones and zeros) is shown in Figure 12.3.
d) One solution with corresponding Karnaugh maps (with ones and zeros circled) is shown in Figure 12.3. A similar simplification as in task b) can be done here but in the n -net.

(a) One schematic for the gate.

(b) Ones covered with four circles.

(c) Zero(s) covered with one circle

Figure 12.1: NAND4 (a) gate and corresponding Karnaugh map with (b) ones and (c) zeros covered.


Figure 12.2: AND-OR-invert $3+1$ schematics (a) and corresponding Karnaugh map with (b) all ones and (c) all zeros covered.

Solution 3.2 Problem is on page 9.
a) The solution with gates requires two NAND2 gates and one NOR2 gate and one inverter ( 14 transistors) or four NAND2 gates (16 transistors). Static CMOS requires eight transistors and separate gates twelve transistors. See Figure 11.59 in Weste \& Harris for more details.
b) It is the same number of transistors here. The inputs are just connected differently.
c) It is possible to apply the solution with the four NAND2 gates from task a) and to implement the entire sum function with two levels of this function. That is eight NAND2 gates which is 32 transistors. There are also several other possible solutions.

A direct implementation of the three-input XOR gate requires 16 transistors. It is shown in Figure 11.3 (b) in Weste and Harris. In addition three inverters are required. So the number of transistors is 20.


Figure 12.3: OR-AND-invert $3+1$ schematics (a) and corresponding Karnaugh map with (b) all ones and (c) all zeros covered.

Solution 3.3 Problem is on page 10.

Start by covering the zeros to find the n-net. Remember to make the circles the maximum power-of-two size (1, 2,4 or 8 for a four-input function). The Karnaugh map should really be a sphere so the circles may wrap over the edges when you draw them. The expression you get from the circles is a sum-of-products form. If the products have a common factor you can simplify (and save transistors in the circuit) by factoring these out.

The p-net can thereafter be found by circling the ones in the Karnaugh map, or by inverting the n-net (parallel becomes series and series becomes parallel) or by invertering the logical function you derive from the $n$-net. Don't forget the bubbles on the p-transistors. Doublecheck that that each input has the same inversion in the $n$ - and p-nets.

The solutions are given in Figure 12.4.

(a) NAND4 gate with two inputs inverted

(b) NAND2 gate.

(c) AND-OR-INVERT $2+2$ gate with one input inverted.

Figure 12.4: The resulting circuit schematics for problem 3.3. Note that there more than one logically equivalent solution for each case.

Solution 3.4 Problem is on page 10.
a) Since the output of the compound cell is inverted, by the inverter in the cell, the logical function of the cell
can be directly found from the n-net as

$$
Y=(A B+C)(D+E) .
$$

b) With the stated connections the n-net has $(A$ and $B)$ in series with ( $A$ or $B$ ). But since $(A$ or $B)$ is always true when $(A$ and $B)$ is true, we can move that connection point to the output node. We can express that same simplification logically, of course. We find that the logical expression for the output now is

$$
Y=(A B+C)(A+B)
$$

which can be simplified to

$$
Y=A B+C(A+B)
$$

because $A B(A+B)$ is redundant. With this modification the n-net looks exactly as the p-net in Figure 8.1.
Solution 3.5 Problem is on page 10.

The logical functions are inverter (Y1) and nand (Y2-Y4). The expressions are found in Figure 12.5.


Figure 12.5: Schematic and logic functions for Harris' multi-output gate.

Solution 3.6 Problem is on page 11.

There are several possible solutions here depending if you let the signals pass from MSB to LSB or from LSB to MSB and how you code the signals. In the solution given here the signals pass from LSB to MSB. It is a good idea to test the opposite direction too.
a) In our solution we use two signals called NEQ and AL. If the inputs to the cell, $A$ and $B$, are not equal NEQ is set to 1 . Furthermore, when $A$ is not the same as $B, A L$ is set to 1 if $A$ is 1 and to 0 if $B$ is 1 . If $A=B$ the cell should just pass the incoming values of NEQ and AL to the corresponding outputs of the cell, because if all more significant bits are equal the incoming values should prevail. The logical expressions for the outputs are thus:

$$
\begin{aligned}
N E Q_{o} & =A \bar{B}+\bar{A} B+N E Q_{i} \\
A L_{o} & =A \bar{B}+(A B+\overline{A B}) A L_{i}
\end{aligned}
$$

b) We could try to find the gates directly from the logical functions. However, it may be helpful to draw the Karnaugh diagrams. They are shown in Figure 12.6. The corresponding gates are shown in Figure 12.7.
c) The setup can be seen in Figure 12.8. The value for the first ALi signal does not matter but the NEQi signal of the least significant cell must be set to 0 . Only if all cells fulfill the condition that its bits are equal, does the 0 appear at the NEQ output of the most significant cell.


Figure 12.6: Karnaugh maps for the comparator gates.

(a) AL cell

(b) NEQ cell

Figure 12.7: The resulting circuit schematics for comparator cell.


Figure 12.8: The 8 -bit comparator.

Solution 3.7 Problem is on page 11.

This ILA performs an 8-bit AND function of the results from the individual cells, so it does not matter in which direction the results are combined. In this solution we go from LSB to MSB, but the other way would work equally well.
a) The cell performs an XOR function of A and B bit and combines it with the input from the previous cells. It is the same function as shown in Figure 12.7 b if we use a NEQ bit; that is a bit that is one when the bits are not equal.
a) We can combine the eight cells the same way as shown in Fig 12.8, but there is only one signal that is passed on from cell to cell.
a) We can simplify the circuit (and also speeding it up) by getting rid of the inverters between the cells. Every other cell would then implement the inverted signal.

### 12.4 The MOS transistor

Solution 4.1 Problem is on page 13.


Figure 12.9: Regions of operation for a p-channel MOSFET. Note that all voltages are negative.
a) For the n-channel MOSFET we have the following conditions. For the transistor being ON:

$$
\begin{equation*}
V_{G S} \geq V_{T N} \tag{12.2}
\end{equation*}
$$

For the transistor being in saturation:

$$
\begin{equation*}
V_{D S} \geq V_{G S}-V_{T N} \tag{12.3}
\end{equation*}
$$

The borders between the regions is found when there is an equal sign in (12.2) and (12.3) instead of a greater-than-equal sign.
b) The corresponding conditions for the p-channel MOSFET are as follows below. For the transistor being ON it is

$$
\begin{equation*}
V_{G S} \leq V_{T P} \tag{12.4}
\end{equation*}
$$

Remember here that $V_{T P}$ is negative for the p-channel MOSFET. condition is

$$
\begin{equation*}
V_{D S} \leq V_{G S}-V_{T P} \tag{12.5}
\end{equation*}
$$

For the p-channel MOSFET the borders between the regions are found when there is an equal sign in (12.4) and (12.5) instead of a less-than-equal sign. So the equations for the borders are the same for the $n$-channel and the p-channel MOSFETs. The difference is that all voltages are negative for the pMOS case. The resulting diagram of the regions of operation for the p-channel MOSFET is shown in Figure 12.9.

Solution 4.2 Problem is on page 13.
a)

$$
\begin{equation*}
V_{G T}=V_{G S}-V_{T}=1.2 \mathrm{~V}-0.3 \mathrm{~V}=0.9 \mathrm{~V} \tag{12.6}
\end{equation*}
$$

b) Assuming that the quadratic equation for the saturation current holds, we find:

$$
\begin{equation*}
I_{D S A T, \max }=\frac{k_{N}}{2} V_{G T}^{2}=\frac{900 \mu \mathrm{~A} / \mathrm{V}^{2}}{2} 0.9^{2} \mathrm{~V}^{2}=364.5 \mu \mathrm{~A} \tag{12.7}
\end{equation*}
$$

c) The saturation voltage is the drain-source voltage for which we have

$$
\begin{equation*}
V_{D S A T}=V_{G S}-V_{T} . \tag{12.8}
\end{equation*}
$$

Thus in the situation in task b) we have

$$
\begin{equation*}
V_{D S A T}=1.2 \mathrm{~V}-0.3 \mathrm{~V}=0.9 \mathrm{~V} \tag{12.9}
\end{equation*}
$$

Solution 4.3 Problem is on page 13.
a) The expression for the gate capacitance is

$$
\begin{equation*}
C=C_{\mathrm{ox}} \times L \times W \tag{12.10}
\end{equation*}
$$

One would assume that in a 65 nm process the transistor length is 65 nm . However, it turns out that the physical length is somewhat less so we have to use 60 nm as the transistor length. So in this case we have:

$$
\begin{equation*}
C=20 \mathrm{fF} / \mu \mathrm{m}^{2} \times 60 \mathrm{~nm} \times 1 \mathrm{~mm}=20 \mathrm{fF} / \mu \mathrm{m}^{2} \times 0.060 \mu \mathrm{~m} \times 1000 \mu \mathrm{~m}=1200 \mathrm{fF}=1.2 \mathrm{pF} \tag{12.11}
\end{equation*}
$$

b)

$$
\begin{equation*}
C=10 \mathrm{fF} / \mu \mathrm{m}^{2} \times 45 \mathrm{~nm} \times 5 \mu \mathrm{~m}=10 \mathrm{fF} / \mu \mathrm{m}^{2} \times 0.045 \mu \mathrm{~m} \times 5 \mu \mathrm{~m}=2.25 \mathrm{fF} \tag{12.12}
\end{equation*}
$$

c)

$$
\begin{equation*}
C=10 \mathrm{fF} / \mu \mathrm{m}^{2} \times 45 \mathrm{~nm} \times 280 \mathrm{~nm}=10 \mathrm{fF} / \mu \mathrm{m}^{2} \times 0.045 \mu \mathrm{~m} \times 0.28 \mu \mathrm{~m}=0.13 \mathrm{fF} \tag{12.13}
\end{equation*}
$$

Solution 4.4 Problem is on page 14.
a)

$$
\begin{gather*}
R_{\text {eff }}=\frac{1 \mathrm{~V}}{500 \mu \mathrm{~A}}=2 \mathrm{k} \Omega  \tag{12.14}\\
R_{\text {eff }}=\frac{1 \mathrm{~V}}{750 \mu \mathrm{~A}}=1.33 \mathrm{k} \Omega \tag{12.15}
\end{gather*}
$$

b)

$$
\begin{equation*}
R_{\mathrm{eff}}=\frac{2 \mathrm{k} \Omega \mu \mathrm{~m}}{5 \mu \mathrm{~m}}=400 \Omega \tag{12.16}
\end{equation*}
$$

c)

$$
\begin{equation*}
R_{\mathrm{eff}}=\frac{2 \mathrm{k} \Omega \mu \mathrm{~m}}{280 \mathrm{~nm}}=\frac{2 \mathrm{k} \Omega \mu \mathrm{~m}}{0.28 \mu \mathrm{~m}}=7.1 \mathrm{k} \Omega \tag{12.17}
\end{equation*}
$$

(a)

(g)

(a) \& (d)
nMOS: medium $V_{G T}$ and $I_{D S}$ saturated pMOS: medium $V_{G T}$ and $l_{D S}$ saturated => green region $V_{\text {OUT }} \approx V_{\text {DD/2 }}$

(h)

(b)\&(f)
nMOS high VGT \& IDS linear region pMOS low $\mathrm{V}_{\text {Gt }}$ \& Ids saturated
=> lower blue region
$V_{\text {OUT }} \approx V_{\text {DD/ }} / 10$

Figure 12.10: Solution to problem 5.1

### 12.5 The CMOS inverter

Solution 5.1 Problem is on page 15.

The solution is shown in Figure 12.10.

Solution 5.2 Problem is on page 15.
a) In Figure 5.2 we see that there are three input voltages to the inverter where we have $V_{\text {IN } 1}<V_{\text {IN } 2}<V_{\text {IN3 }}$. The lowest input voltage, $V_{\text {IN1 }}$, corresponds to the lowest nMOS current, which is VGS1 in Figure 5.2. Similarly $V_{\text {IN2 }}$ corresponds to the VGS2 curve and $V_{\text {IN3 }}$ to the VGS3 curve. For the pMOS transistor we have the opposite situation: the lowest input voltage $V_{\mathrm{IN} 2}$ corresponds to the highest VGS and thus the highest current etc.
b) To find the output voltage one has to consider the voltage at which the two current curves that correspond to the same input voltage cross. Thus, we find that for $V_{\mathrm{IN} 1}$ the bias point is in region B , for $V_{\mathrm{IN} 2}$ in region C , and for $V_{\mathrm{IN} 3}$ in region D. See also Figure 12.10 where graph (i) corresponds to $V_{\mathrm{IN} 1}$, (g) to $V_{\mathrm{IN} 2}$, and (h) to $V_{\text {IN3 }}$.

Solution 5.3 Problem is on page 15.
a) In the red regions the current is zero (at least on the scale we draw it here). In the blue regions, the current equal to the saturation current for the transistor that is in the saturation region (that is the one with the lowest effective gate voltage, VGT. In the green region the current is equal to the saturation current when the two transistors have the same effective gate voltage. See Figure 12.11.


Figure 12.11: Solution to problem 5.3 a)
b) The transistor switching voltage, $V_{\text {sw }}$, is defined as the input voltage for which it is true that

$$
\begin{equation*}
V_{\text {OUT }}=V_{\text {IN }} . \tag{12.18}
\end{equation*}
$$

That condition can only hold in the steep part of the VTC, that is the green area. The MOS current equation in saturation is:

$$
\begin{equation*}
I_{D S}=\frac{k}{2}\left(V_{G S}-V_{T}\right)^{2} \tag{12.19}
\end{equation*}
$$

This equation is the same for both nMOS and pMOS transistors, but with a minus sign for the pMOS transistor, which is because its current is due to holes rather than electrons. Kirchhoff's current law says that the sum of the currents (in or out) of the node is zero. In the output node we then get

$$
\begin{equation*}
I_{D S n}+I_{D S p}=0 . \tag{12.20}
\end{equation*}
$$

Thus, we arrive at:

$$
\begin{equation*}
I_{D S n}=-I_{D S p} \tag{12.21}
\end{equation*}
$$

which we rewrite as

$$
\begin{equation*}
\frac{k_{n}}{2}\left(V_{G S n}-V_{T P}\right)^{2}=\frac{k_{p}}{2}\left(V_{G S p}-V_{T P}\right)^{2} . \tag{12.22}
\end{equation*}
$$

To find $V_{\mathrm{sw}}$ we have to exchange $V_{G}$ and $V_{S}$ for the actual voltages in the inverter circuit. For both the nMOS and the pMOS transistor we have: $V_{G}=V_{\text {sw }}$. For the nMOS transtor we have $V_{G}=0$ and for the pMOS transistor $V_{G}=V_{\mathrm{DD}}$. Thus, we rewrite Eq. 12.22 as:

$$
\begin{equation*}
\frac{k_{n}}{2}\left(V_{\mathrm{sw}}-0-V_{T N}\right)^{2}=\frac{k_{p}}{2}\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}-V_{T P}\right)^{2} . \tag{12.23}
\end{equation*}
$$

In this task we have the additional simplifying condition that $k \equiv k_{n}=k_{p}$ and $V_{T} \equiv V_{T N}=-V_{T P}$. Thus, we have:

$$
\begin{equation*}
\frac{k}{2}\left(V_{\mathrm{sw}}-V_{T}\right)^{2}=\frac{k}{2}\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}+V_{T}\right)^{2} \tag{12.24}
\end{equation*}
$$

or

$$
\begin{equation*}
\left(V_{\mathrm{sw}}-V_{T}\right)^{2}=\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}+V_{T}\right)^{2} . \tag{12.25}
\end{equation*}
$$

In the next step one has to be careful, because the voltage that is squared for the nMOS transistor is positive while the voltage that is squared for the pMOS transistor is negative. So when we remove the squares we should keep the correct solutions of the two possibilities on each side. Finally we arrive at:

$$
\begin{equation*}
V_{\mathrm{sw}}-V_{T}=-\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}+V_{T}\right), \tag{12.26}
\end{equation*}
$$

which we rearrange to arrive at the solution:

$$
\begin{equation*}
V_{\mathrm{sw}}=\frac{V_{\mathrm{DD}}}{2} \tag{12.27}
\end{equation*}
$$

c) Let us reason about it. If the nMOS transistor gives four times as much current for the same effective gate voltage ( $V_{G T}$ ) then the point where the two transistors give the same current must happen at a lower input voltage than in b). Thus $V_{\mathrm{sw}}$ must be a bit lower than $\frac{V_{\mathrm{DD}}}{2}$. Because the currents are related to the square of the effective gate voltages the change in voltage will have to be related to the square root of the ratios between the current factors. We can write the current equation as:

$$
\begin{equation*}
\frac{k_{n}}{k_{p}} V_{G T n}^{2}=V_{G T p}^{2} \tag{12.28}
\end{equation*}
$$

From this equation we see that with a ratio of four between the currect factors the ratio between the effective gate voltages has to be two. One way of seeing this is that of the available voltage range for effective gate voltages, $V_{\mathrm{DD}}-V_{T N}+V_{T P}$, one third is used by the nMOS transistor and two thirds by the pMOS transistor. Thus we have

$$
\begin{equation*}
V_{\mathrm{sw}}=V_{T N}+\frac{V_{\mathrm{DD}}-V_{T N}+V_{T P}}{3}=\frac{2 V_{T N}}{3} V_{T N}+\frac{V_{\mathrm{DD}}}{3}+\frac{V_{T P}}{3} \tag{12.29}
\end{equation*}
$$

A formal derivation is given under d) below.
d) To find $V_{\text {sw }}$ in the general case we return to Eq. 12.23 which we can rearrange to:

$$
\begin{equation*}
\frac{k_{n}}{k_{p}}\left(V_{\mathrm{sw}}-V_{T N}\right)^{2}=\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}-V_{T P}\right)^{2} . \tag{12.30}
\end{equation*}
$$

Again we are careful when removing the squares and thus arrive at:

$$
\begin{equation*}
\sqrt{\frac{k_{n}}{k_{p}}}\left(V_{\mathrm{sw}}-V_{T N}\right)=-\left(V_{\mathrm{sw}}-V_{\mathrm{DD}}-V_{T P}\right) . \tag{12.31}
\end{equation*}
$$

When we simplify we arrive at

$$
\begin{equation*}
V_{\mathrm{sw}}=\frac{V_{\mathrm{DD}}+V_{T P}+\sqrt{\frac{k_{n}}{k_{p}}} V_{T N}}{1+\sqrt{\frac{k_{n}}{k_{p}}}} \tag{12.32}
\end{equation*}
$$

Eq 12.32 is how the expression is most often written in textbooks. However, a slight rearrangement makes it much easier to remember and understand:

$$
\begin{equation*}
V_{\mathrm{sw}}=\frac{V_{\mathrm{DD}}+V_{T P}-V_{T N}}{1+\sqrt{\frac{k_{n}}{k_{p}}}}+V_{T N} \tag{12.33}
\end{equation*}
$$

From this formulation we see that our a bit informal reasoning in task c) has been verified more formally.
Solution 5.4 Problem is on page 16.

## Solution will be added after prelab 1 is done.

Solution 5.5 Problem is on page 17.
a) $\mathrm{X}($ red $)$ is $\mathrm{Y} 1, \mathrm{Y}($ green $)$ is $\mathrm{Y} 2, \mathrm{~W}$ (purple) is Y 3 and the one that is not shown is Y 4 . With all inputs connected together we have four inverters with different p-to-n current and thus resistance ratios. The VTC formula gives these results for the different ratios.
$\mathrm{Y} 1: \mathrm{RN} / \mathrm{RP}=2 / 1, \mathrm{x}=\mathrm{kN} / \mathrm{kP}=2 \mathrm{Vsw}=0.565 \mathrm{~V}$.
$\mathrm{Y} 2: \mathrm{RN} / \mathrm{RP}=1 / 2, \mathrm{x}=\mathrm{kN} / \mathrm{kP}=0.5 \mathrm{Vsw}=0.674 \mathrm{~V}$.
Y3: $\mathrm{RN} / \mathrm{RP}=2 / 9, \mathrm{x}=\mathrm{kN} / \mathrm{kP}=1 / 2 \mathrm{Vsw}=0.735 \mathrm{~V}$.
Y4: RN/RP $=1 / 8, \mathrm{x}=\mathrm{kN} / \mathrm{kP}=0.125 \mathrm{Vsw}=0.773 \mathrm{~V}$.
b) In Figure 12.13 below the two points that define the four voltages are shown. The resulting values are: For high level $M N H=V_{\text {OHmin }}-V_{\text {IHmin }}=1.19 \mathrm{~V}-0.832 \mathrm{~V}=0.358 \mathrm{~V}$. For low level $M N L=V_{\text {ILmax }}-V_{\text {OLmax }}=$ $0.735 \mathrm{~V}-0.05 \mathrm{~V}=0.73 \mathrm{~V}$. As could be expected from the VTC the noise margins are not very equal so this inverter is not that well designed.


Figure 12.12: Plot of the four inverter curves resulting when all four inputs are connected together in the multiinput gate in Figure 12.5.


Figure 12.13: VTC with voltages necessary to calculated noise margins for output X indicated.

Solution 5.6 Problem is on page 17.

The expression for the discarging of a capacitor through a resistor is

$$
V_{c}=V_{o} e^{-t / R C}
$$

where $V_{c}$ is the voltage across the capacitance and $V_{o}$ is the starting voltage. Here we want to find t when $V_{c}=V_{o} / 2$. Thus we have

$$
\frac{V_{o}}{2}=V_{o} e^{-t / R C}
$$

which we can simplify to

$$
\frac{1}{2}=e^{-t / R C}
$$

This equation we can solve for $t$ by taking the logarithm of it:

$$
-\ln 2=\frac{-t}{R C}
$$

Thus, we find

$$
t=R C \ln 2
$$

Solution 5.7 Problem is on page 18.
a) An ideal inverter has no parasitic capacitance at its output.
b) This delay is called the fanout-of-one delay, FO1. For an ideal inverter it is:

$$
\begin{equation*}
\mathrm{FO} 1=0.7 R_{\mathrm{eff}} C_{\mathrm{IN}}=0.7 \frac{1.2 \mathrm{~V}}{500 \mu \mathrm{~A} / \mu \mathrm{m}} \times 3 \times 1.3 \mathrm{fF} / \mu \mathrm{m}=6.5 \mathrm{ps} . \tag{12.34}
\end{equation*}
$$

The factor 3 for the capacitance in the expression is due to the inverter input capacitance being three times as large as the gate capacitance for that of only the nMOS transistor. Note that the delay is the same regardless of the size of the two inverters.

Solution 5.8 Problem is on page 18.
a) The fanout-of-four delay is given by this equation:

$$
\begin{equation*}
\mathrm{FO} 4=0.7 R C(4+p) . \tag{12.35}
\end{equation*}
$$

In the $0.35 \mu \mathrm{~m}$ process we find:

$$
\begin{equation*}
0.7 R C=0.7 \times 6 \mathrm{k} \Omega \mu \mathrm{~m} \times 6 \mathrm{fF} / \mu \mathrm{m}=25 \mathrm{ps} . \tag{12.36}
\end{equation*}
$$

With $p=1$ we thus arrive at $\mathrm{FO} 4=125 \mathrm{ps}$.
b)

Solution 5.9 Problem is on page 18.
a)

## Check which inverter we refer to here.

b) The resulting delay equation is then

$$
\begin{equation*}
t_{p d}=R^{\prime} C . \tag{12.37}
\end{equation*}
$$

Comment: When you only consider gate delays this simplification seems like a good idea (and it is the way is done in Weste \& Harris) because it simplifies the delay equations. However, when we introduce wires with real physical resistances, it gets messy if one treats gates and wires differently. So therefore we do not do so in this course. We have to carry the 0.7 factor with us in delay calculations.

Solution 5.10 Problem is on page 19.
a) They must be equal, $k_{1}=k_{2}$ since $V_{s w}=V_{D D} / 2$.
b) It flips when both devices have the same gate voltage overdrive, i.e. $V_{I N}=V_{D D}-V_{B}=0.4 V_{D D}$.
c) The current through the PMOS load is given by:

$$
I_{D S}=\frac{k_{2}}{2}\left(V_{D D}-V_{B}+V_{T P}\right)^{2}=300(1-0.6-0.2)^{2} 1.2^{2}=17 \mu \mathrm{~A}
$$

d) The region where both M1 and M2 are saturated is given by the saturation conditions, i.e.:
$V_{S w}-V_{T N}<V_{O U T}<V_{S w}-V_{T P}$, i.e. $0.3 V_{D D}<V_{O U T}<0.7 V_{D D}$.
e) $V_{S w}-V_{T N}<V_{O U T}<V_{B}-V_{T P}$, i.e. $0.2 V_{D D}<V_{O U T}<0.8 V_{D D}$.

Solution 5.11 Problem is on page 19.
a)
b)
c)

Solution 5.12 Problem is on page 20.

For simplicity we express the load capacitance that is to be driven as $X C=Y^{3} C$ where $C$ is the capacitance of the original inverter. Since we are only comparing delays, we will not include the 0.7 factor in our delay derivations here.

The first inverter has capacitance $C$ and resistance $R$ and we assume that its parasitic output capacitance is $p C$. The $R C$ product when this first inverter drives the load capacitance directly is

$$
R\left(p C+Y^{3} C\right)=R C\left(p+Y^{3}\right)
$$

With the buffer inserted we have three inverters. For each inverter we scale the inverter by the same factor $Y$, increasing the capacitance a factor $Y$ while decreasing the resistance a factor $1 / Y$. Thus we arrive at the RC product

$$
R(p C+Y C)+\frac{R}{Y}\left(p Y C+Y^{2} C\right)+\frac{R}{Y^{2}}\left(p Y^{2} C+Y^{3} C\right)=3 R C(p+Y)
$$

So the delay is decreased if

$$
3(p+Y)<p+Y^{3} .
$$

We see that the exakt value of $Y$ and thus $X$ depends on how large the parasitic. With no parasitic capacitance (which is totally unrealistic) we have $X=\sqrt{3}^{3} \approx 5.2$. With $p=1$ the answer is $X=8$.

Solution 5.13 Problem is on page 20.
a) It is possible to derive the optimal number of inverters, but in general it is not necessary, since we cannot use fractions of inverters. If we want to be close to a scaling factor of 4 , which we know is often the best choice, we should find the $N$ that makes $\sqrt[N]{1000}$ closest to 4 and then check $N-1$ and $N+1$. We find that $\sqrt[5]{1000} \approx 4$. So five inverters we believe to be the best choice. We should of course verify this conclusion. We assume that the parasitic delay of the inverter, pinv, is 1 . Then the normalized delay is

$$
D=N \sqrt[v]{1000}+N
$$

We check $\mathrm{D}(4), \mathrm{D}(5)$ and $\mathrm{D}(6)$. We find $D(4) \approx 26.5, D(5) \approx 24.9$ and,$D(6) \approx 25.0$. So the delay difference is not that large. but 5 gives the shortest delay.
b) With five inverters the optimum tapering factor is $\sqrt[5]{1000}=3.981 \approx 4$ as we had already computed in a).

Solution 5.14 Problem is on page 20.
a) We know that for minimum delay each stage should have the same effort and that a stage effort, $f$, of 4 is good to minimize the delay. 1024 is $2^{10}$, which is also $4 \times 4 \times 4 \times 4 \times 4$, so 5 inverters in all and thus four inverters in the box is a good solution.
b) The delay is $p+g h$ or $p+f$. We have $f=4$ (see a) above) and $p_{\text {inv }}=0.5$ (from problem statement). So the normalized delay, $d$, is $5 \times(4+0.5)=22.5$ and with $\tau=4 \mathrm{ps}$ we have a delay of $d \times \tau=22.5 * 4=90 \mathrm{ps}$.
e) BONUS QUESTION It would be better to remove one inverter than to add one inverter since the dynamic power will be lower while the delay would only be negligibly longer.

### 12.6 Delay for complex gates and paths

Solution 6.1 Problem is on page 21.
a) In this unusual CMOS process the pMOS transistor in the inverter has to be three times as wide as the nMOS transistor to give the same saturation current, that is to have the same effective resistance. Such an inverter is shown in Figure 12.14(a). Assuming that the oxide capacitance, $C_{\mathrm{ox}}$, is the same for nMOS and pMOS transistors (which we always assume), that means the when the transistor is scaled electrically symmetrically

(a)

(b)

(c)

(d)

Figure 12.14: (a) Inverter with 3-to-1 scaling for same resistance, (b) NAND2 gate with same scaling as inverter, (c) and (d), two scalings that make the worst-case nMOS and pMOS resistances the same.
the pMOS transistor accounts for three fourths of the inverter input capacitance and the nMOS transistor for one fourth. It is then convenient to use:

$$
\begin{equation*}
C_{\text {inINV }}=3 C+1 C=4 C \tag{12.38}
\end{equation*}
$$

where $C$ is the capacitance for any type transistor with width $W$. The RC product for the inverter can then be written as

$$
\begin{equation*}
R C_{\mathrm{INV}}=R \times C_{\mathrm{inINV}}=R \times 4 C \tag{12.39}
\end{equation*}
$$

Note that we use $4 C$ just for our convenience. It is not necessary to do so.
The NAND2 gate is shown in Figure 12.14(b) with the same transistor widths and thus resistances as for the inverter in Figure 12.14(a). For the NAND2 gate in Figure 12.14(b) the worst-case resistance (that is the highest resitance) for the n -net is $2 R$ while for the p-net it is $R$. Therefore, we must scale the nMOS transistors relative to the pMOS transistors. Figures 12.14(c) and (d) shows two different ways of scaling. In Figure 12.14(c) the nMOS transistor are made wider so the worst-case resistance is $R$ also in the $n$-net. In Figure 12.14(d) the pMOS transistors were instead made narrower which makes the worst-case resistance of the p-net also $2 R$. In both cases the p-to-n ratio is the same.

From Figure 12.14(c) we find

$$
\begin{equation*}
R C_{\mathrm{NAND} 2}=R_{\mathrm{NAND} 2} \times C_{\mathrm{inNAND} 2}=R \times(2 C+3 C)=5 R C \tag{12.40}
\end{equation*}
$$

From Figure 12.14(d) we find

$$
\begin{equation*}
R C_{\mathrm{NAND} 2}=R_{\mathrm{NAND} 2} \times C_{\mathrm{inNAND} 2}=2 R \times\left(C+\frac{3}{2} C\right)=5 R C \tag{12.41}
\end{equation*}
$$

So we see that the two scalings give the same RC product for the NAND2 gate.
The logical effort is defined as:

$$
\begin{equation*}
g=\frac{R_{g a t e} \times C_{\text {gateinput }}}{R_{\text {inv }} \times C_{\mathrm{inv}}} \tag{12.42}
\end{equation*}
$$

In this case we get $g_{\text {NAND2 }}=5 / 4$, which holds for both inputs since they have the same transistor widths and thus the same gate capacitance. If the resitances are the same in the inverter and in the complex gate Equation simplifies to:

$$
\begin{equation*}
g=\frac{C_{\text {gateinput }}}{C_{\text {inv }}} \tag{12.43}
\end{equation*}
$$

The parasitic delay, $p$, is the part of the delay that does not depend on the load capacitance, that is the constant, or internal part. For a static CMOS gate that is the part of the delay that is due to the parasitic capacitances at the drains of the output transistors. The definition for $p$ is:

$$
\begin{equation*}
p=\frac{R_{\text {gate }} \times C_{\text {parasitic }}}{R_{\mathrm{inv}} \times C} \tag{12.44}
\end{equation*}
$$

In this case the problems states that $p_{\text {inv }}=0.5$, which means that the transistor output capacitances are halv of the intput capacitances. As for the electrical effort the equation can be simplified if the resistances are the same. Thus, we arrive at:

$$
\begin{equation*}
p=\frac{C_{\text {parasitic }}}{C_{i n v}}=\frac{p_{\text {inv }} C_{\text {gate -to-output }}}{C_{i n v}} \tag{12.45}
\end{equation*}
$$

where $C_{\text {gate-to-output }}$ is the gate capacitance of all transistors connected to the output node. For the NAND2 gate in this strange process we find

$$
\begin{equation*}
p_{\mathrm{NAND} 2}=\frac{0.5 \times(3+3+2) C}{4 C}=1 \tag{12.46}
\end{equation*}
$$

For the NOR2 gate we just give the answers:

$$
\begin{gather*}
g_{\text {NOR2 } 2}=\frac{C+6 C}{4 C}=\frac{7}{4}  \tag{12.47}\\
p_{\text {NOR } 2}=\frac{0.5 \times(6+1+1) C}{4 C}=1 \tag{12.48}
\end{gather*}
$$

b) In task a) we found that the NAND2 gate in this strange process has $g_{N A N D 2}=5 / 4$ and $p_{N A N D 2}=1$. The normalized delay is defined as:

$$
\begin{equation*}
d=g \times h+p, \tag{12.49}
\end{equation*}
$$

where $h$ is the electrical effort which can only be known when a load is connected to the gate. The electrical effort is defined as:

$$
\begin{equation*}
h=\frac{C_{\text {load }}}{C_{\text {in }}} \tag{12.50}
\end{equation*}
$$

With the scaling stated in the problem, we have the inverter input capacitance, which is the capacitance loading the NAND2 gate, as $6 C+2 C=8 C$, whereas the input capacitance for both inputs of the NAND2 case is $3 C+2 C=5 C$. Consequently the normalized delay for the NAND2 gate with these transistorsizes is

$$
\begin{equation*}
d_{N A N D 2}=\frac{5}{4} \times \frac{8 C}{5 C}+1=3 \tag{12.51}
\end{equation*}
$$

Note that the normalized delay is expressed in units of $\tau$, so one has to multiply with $\tau$ to find the delay in seconds. For this process we do not know the value of $\tau$.

Solution 6.2 Problem is on page 21.

We use the scaling that gives the same effective resistance as for the inverter wich has width 2 for the pMOS transistor and 1 for the nMOS transistors.
a) In this gate all nMOS transistors have width 2 . The pMOS transistors connected to $\mathrm{A}, \mathrm{B}$ and C have widths 6 , whereas the one connected to D has width 2 . Then we find, for $\mathrm{A}, \mathrm{B}$ and C inputs:

$$
\begin{equation*}
g_{A, B, C}=\frac{6 C+2 C}{3 C}=\frac{8}{3} \tag{12.52}
\end{equation*}
$$

For the D input:

$$
\begin{equation*}
g_{D}=\frac{2 C+2 C}{3 C}=\frac{4}{3} \tag{12.53}
\end{equation*}
$$

In the problem it was not specified what $p_{\text {inv }}$ is, so we express the parasitic delay in pinv:

$$
\begin{equation*}
p=p_{\mathrm{inv}} \frac{(6+2+2) C}{3 C}=\frac{10}{3} p_{\mathrm{inv}} \tag{12.54}
\end{equation*}
$$

b) For the second gate all pMOS transistors have width 4, the two nMOS transistors in series to the left have width 2 and all other nMOS transistors have widths 3 .

$$
\begin{equation*}
g_{A_{2}, B_{2}}=\frac{2 C+3 C+4 C+4 C}{3 C}=\frac{13}{3} \tag{12.55}
\end{equation*}
$$

For the D input:

$$
\begin{equation*}
g_{A_{1}, B_{1}}=\frac{4 C+3 C}{3 C}=\frac{7}{3} \tag{12.56}
\end{equation*}
$$

In the problem it was not specified what $p_{\text {inv }}$ is, so we express the parasitic delay in pinv:

$$
\begin{equation*}
p=p_{\text {inv }} \frac{(4+4+4+2+3) C}{3 C}=\frac{17}{3} p_{\text {inv }} \tag{12.57}
\end{equation*}
$$

Solution 6.3 Problem is on page 21.

There are many possibilities. The requirement stated in the problem is that the worst-case $n$-net resistance and p-net resistance should be the same for each of the four gates, but there is no requirement that they should all be the same (that is probably impossible). One solution is to make all the nMOS transistors the same width; let's call it W and then scale the pMOS transistors to match. We assume that such an nMOS transistor has the resistance R. Then the n-net worst-case resistance is R for output $Y_{1}$ (inverter); 2R for output $Y_{2}$ (NAND2), 3R for output $Y_{3}$ (NAND3), and 4R for output $Y_{4}$ (NAND3). Assuming that the pMOS transistors are half as strong as the nMOS transistors we then must use the widths 2 W for transistor 161 , W for transistors numbered $15 \mathrm{X}, 2 \mathrm{~W} / 3$ for pMOS transistors numbered 14X and W/2 for pMOS transistors numbered 13X.

Solution 6.4 Problem is on page 21.
a) For the NAND3 gate the logical effort, $g_{\mathrm{NAND} 3}$, is $5 / 3$ for all three inputs. With $p_{\text {inv }}=1$, the parasitic delay, $p_{\text {NAND }}$, is 3 .
The normalized delay for one stage can be written as:

$$
\begin{equation*}
d=g \times h+p . \tag{12.58}
\end{equation*}
$$

From Figure 6.2 we see that the electrical effort with branching for the 2-input NAND gates is:

$$
\begin{equation*}
h_{N A N D 2}=\frac{3 \times C_{\text {inNAND } 3}}{C_{\text {inNAND2 }}}=\frac{3 \times 8 C}{8 C}=3 . \tag{12.59}
\end{equation*}
$$

And for the 3-input NAND gates the electrical effort is:

$$
\begin{equation*}
h_{N A N D 3}=\frac{2 \times C_{i n N O R 2}}{C_{i n N A N D 3}}=\frac{2 \times 16 C}{8 C}=4 . \tag{12.60}
\end{equation*}
$$

And for the last stages, the NOR2 gates, we have:

$$
\begin{equation*}
h_{N O R 2}=\frac{C_{\mathrm{LOAD}}}{C_{\text {inNOR2 }}}=\frac{45 C}{16 C}=\frac{45}{16} \approx 2.8 \tag{12.61}
\end{equation*}
$$

To find the total normalized delay we can just sum up the normalized delay for each stage, which we now number, 1, 2, 3 from the input for simplicity:

$$
\begin{equation*}
d_{\mathrm{tot}}=d_{1}+d_{2}+d_{3}=\frac{4}{3} \times 3+2+\frac{5}{3} \times 4+3+\frac{5}{3} \times 2.8+2 \approx 22 \frac{1}{3} \tag{12.62}
\end{equation*}
$$

b) We already have the logical efforts and the parasitic delays for all gates from the solution to the previous problem. To find the best scaling we need to calculate the path effort:

$$
\begin{equation*}
F=G \times H \times B . \tag{12.63}
\end{equation*}
$$

We have the path logical effort:

$$
\begin{equation*}
G=g_{1} \times g_{2} \times g_{3}=\frac{4}{3} \times \frac{5}{3} \times \frac{5}{3}, \tag{12.64}
\end{equation*}
$$

the path electrical effort:

$$
\begin{equation*}
H=\frac{C_{\text {out-for-path }}}{C_{\text {in-for-path }}}=\frac{45 C}{8 C}=\frac{45}{8} \tag{12.65}
\end{equation*}
$$

and finally the path branching effort:

$$
\begin{equation*}
B=b_{1} \times b_{2}, \tag{12.66}
\end{equation*}
$$

where branching effort for each stage is defined as:

$$
\begin{equation*}
b_{i}=\frac{C_{\text {onpath }}+C_{\text {offpath }}}{C_{\text {onpath }}} . \tag{12.67}
\end{equation*}
$$

Note that one usually does not have a branching effort for the last stage since all capacitance in on the path for that stage. So in this problem we have

$$
\begin{equation*}
b_{1}=\frac{8 C+2 \times 8 C}{8 C}=3, \tag{12.68}
\end{equation*}
$$

and

$$
\begin{equation*}
b_{2}=\frac{16 C+16 C}{16 C}=2 . \tag{12.69}
\end{equation*}
$$

Thus, we arrive at

$$
\begin{equation*}
B=3 \times 2=6 . \tag{12.70}
\end{equation*}
$$

All in all we find

$$
\begin{equation*}
F=G \times H \times B=\frac{4}{3} \times \frac{5}{3} \times \frac{5}{3} \times \frac{45}{8} \times 6=5^{3} \tag{12.71}
\end{equation*}
$$

We know that the optimum is when the stage effort in each stage is the same, and here it is obvious that $f_{\text {opt }}=5$ since we have three stages. Once we have $f_{\text {opt }}$ we can immediately find the optimum delay as

$$
\begin{equation*}
D=N \times f_{\mathrm{opt}}+\sum p_{i} \tag{12.72}
\end{equation*}
$$

which in this problem is:

$$
\begin{equation*}
D=3 \times 5+7=22 \tag{12.73}
\end{equation*}
$$

The sizes (input capacitances) for the three stages can be found starting either from the input or from the output of the path. Here we choose the output. For the third stage we have with $f_{\text {opt }}=5$

$$
\begin{equation*}
5=\frac{5}{3} \times \frac{45 C}{C_{\mathrm{in} 3}}, \tag{12.74}
\end{equation*}
$$

which results in $C_{\mathrm{in} 3}=15 C$. Similarly for the second stage we have:

$$
\begin{equation*}
5=\frac{5}{3} \times \frac{2 \times 15 C}{C_{\mathrm{in} 2}}, \tag{12.75}
\end{equation*}
$$

which gives us $C_{\mathrm{in} 2}=10 C$. The first stage in a path is not scaled; however, it is good practice to check that the effort for that stage also becomes $f_{\text {opt }}$ when we put in the numbers. For our first stage we find:

$$
\begin{equation*}
f_{1}=\frac{4}{3} \times \frac{3 \times 10 C}{8 C}=5 \tag{12.76}
\end{equation*}
$$

Solution 6.5 Problem is on page 22.
a) Below is the detailed schematic of the decoder circuitry with the resulting sizes for task a).


This is a problem where path delay is applicable. The driver inverters each drive eight wires connected to eight of the nand gates (to half of the 16 words). The load capacitance for the z inverter is $32 \times 3 C=96 C$. The logical effort for the 4 -input nand gate is 2 (which one may have to arrive at from the circuit schematics, though not included here). So for the path we have $G=1 \times 2 \times 1, B=8 \times 1 \times 1$ and $H=\frac{96 C}{10 C}=9.6$. All in all the path effort, $F=G B H$ is $2 \times 8 \times 9.6=153.6$. The optimal stage effort is then $f_{\text {opt }}=\sqrt[3]{153.6}=5.36$. We can find the sizing by starting from the output or the input, but usually it is easier to start from the output. We use the relation $f_{\text {opt }}=g h$ for each gate and determine the input capacitance that makes this relation true. Inverter z should accordingly have an input capacitance $C_{\mathrm{inz}}=\frac{96 C}{5.36}=17.9 C$. The 4-input nand gate should have an input capacitance of $C_{\text {iny }}=\frac{2 \cdot 17.9 \mathrm{C}}{5.36}=6.67 \mathrm{C}$.

We should also check that the relationship holds for the first inverter: $C_{\text {inx }}=\frac{8 \cdot 6.67 C}{5.36}=9.95 C$. It is not exactly $10 C$ because there has been some rounding in the calculations, but it is close enough to convince us that we did not make any calculation mistake.
b) The normalized delay $D=3 f_{\text {opt }}+\sum p$. The inverter has a parasitic delay of $p_{i n v}=1$. But the part we do not know is the parasitic delay of the 4 -input nand gate. From the schematic we deduce that the parasitic delay for a scaled 4 -input nand gate is 4 (this calculation is not included here). Thus we arrive at $D=3 \times 5.36+1+4+1=22.08$. In the 65 nm process which has $\tau=5 \mathrm{ps}$ we would thus have a delay of 110 ps .
c) The difference from the case in task a) is that the word capacitance is doubled which doubles $H$. In this case we have $F=G B H, 2 \times 8 \times 19.2=307.2$. With four stages we get $f_{\text {opt }}=\sqrt[4]{307.2}=4.18$. The resulting delay with parasitics is then $=4 \times 4.18+1+4+1+1=23.72$. With three stages we instead have $f_{\text {opt }}=\sqrt[3]{307.2}=6.74$ and a total delay of $3 \times 6.74+1+4+1=26.22$. So the answer is yes, the delay will be shorter with an extra inverter.

### 12.7 Wire delay

Solution 7.1 Problem is on page 25.
a) Sum the currents into (or out of) each circuit node to find the nodal equations:

$$
\begin{align*}
& 0=\frac{v_{1}(t)-V_{S}}{R_{1}}+\frac{v_{1}(t)-v_{2}(t)}{R_{2}}+C_{1} \frac{d v_{1}(t)}{d t}  \tag{12.77}\\
& 0=\frac{v_{2}(t)-v_{1}(t)}{R_{2}}+C_{2} \frac{d v_{2}(t)}{d t} \tag{12.78}
\end{align*}
$$

where $V_{S}$ is the source voltage.
b)
c)

Solution 7.2 Problem is on page 25.
a) The wire capacitance is:

$$
\begin{equation*}
C_{w}=0.2 \mu \mathrm{~m} \times 25 \mu \mathrm{~m} \times 0.4 \mathrm{fF} / \mu \mathrm{m}^{2}=2 \mathrm{fF} \tag{12.79}
\end{equation*}
$$

The wire resistance is:

$$
\begin{equation*}
R_{w}=\frac{25 \mu \mathrm{~m}}{0.2 \mu \mathrm{~m}} \times 0.2 \Omega / \square=25 \Omega . \tag{12.80}
\end{equation*}
$$

b) For the wire that has a width of 100 nm we have

$$
\begin{equation*}
c=0.1 \mu \mathrm{~m} \times 0.4 \mathrm{fF} / \mu \mathrm{m}^{2}=0.04 \mathrm{fF} / \mu \mathrm{m}, \tag{12.81}
\end{equation*}
$$

and

$$
\begin{equation*}
r=\frac{1 \mu \mathrm{~m}}{0.1 \mu \mathrm{~m}} \times 0.2 \Omega / \square=2 \Omega / \mu \mathrm{m} \tag{12.82}
\end{equation*}
$$

The critical length for wire insertion with $p_{i n v}=1$ is:

$$
\begin{equation*}
L_{\text {crit }}=2 \sqrt{\frac{R C}{r c}} \tag{12.83}
\end{equation*}
$$

which we can also write as

$$
\begin{equation*}
L_{\mathrm{crit}}=2 \sqrt{\frac{t_{\mathrm{rep}}}{r c}} \tag{12.84}
\end{equation*}
$$

Thus, we find:

$$
\begin{equation*}
L_{\text {crit }}=2 \sqrt{\frac{4.6 \mathrm{ps}}{0.08 \mathrm{fs} / \mu \mathrm{m}^{2}}}=2 \sqrt{\frac{4600 \mathrm{fs}}{0.08 \mathrm{fs} / \mu \mathrm{m}^{2}}}=480 \mu \mathrm{~m} \tag{12.85}
\end{equation*}
$$

Solution 7.3 Problem is on page 25.
a) Answer: $R=1 \mathrm{k} \Omega, C=650 \mathrm{fF}$, calculated from $r=100 \Omega / \mathrm{mm}, C=650 \mathrm{fF} / \mathrm{mm}$.
b) Answer:

$$
\begin{equation*}
2 \mathrm{k} \Omega \times(3.25+650+3.25) \mathrm{fF}+1 \mathrm{k} \Omega \times(325+3.25) \mathrm{fF}=1.64 \mathrm{~ns} \tag{12.86}
\end{equation*}
$$

c) Answer: $R_{\text {rep }} C_{\text {rep }}=6.5 \mathrm{ps}, r c=6.5 \mathrm{ps} / \mathrm{mm}^{2}$. Together they give $L_{\text {crit }}=2 \sqrt{\frac{R_{\text {rep }} C_{\text {rep }}}{r c}}=2 \sqrt{\frac{6.5}{6.5}}=2 \mathrm{~mm}$.

Solution 7.4 Problem is on page 26.
a) The sketch is shown below:


Note that the input of the inverter is not connected to the inverter output. Here we have not even drawn in the capacitance at inverter input since it does not influnce the dealy at all.
b) The RC product is:

$$
\begin{equation*}
\tau=R \times 10 C+4 R \times 5 C=30 R C \tag{12.87}
\end{equation*}
$$

Thus, the propagation delay is $0.7 \times 30 R C$.
c) The higher receiver input capacitance results in a RC prodcruct of:

$$
\begin{equation*}
\tau=R \times 11 C+4 R \times 6 C=35 R C \tag{12.88}
\end{equation*}
$$

Thus, there is an increase in the propagation delay of $0.7 \times 5 R C$.
d) A 2-input NAND gate has $g=4 / 3$ and $p=2$.That means that its parasitic capacitance is $3 / 2$ times its input capacitance (if $p_{\text {inv }}$ is 1 which is the case in this problem). Thus, in this case the parasitic capacitance of the NAND gate is 3C. When we are driving wires we also need to calculate the resistance of the gate explicitly. We know from the definition that $g=\frac{R C \text { gate }}{R C_{\text {inv }}}$. In this case we thus find:

$$
\begin{equation*}
\frac{R_{N A N D 2}}{R_{i n v}}=g \times \frac{C_{i n v}}{C_{N A N D 2}}=4 / 3 \times 1 / 2=2 / 3 . \tag{12.89}
\end{equation*}
$$

Thus, the new RC product is:

$$
\begin{equation*}
\frac{2}{3} R \times 12 C+4 R \times 5 C=26 R C \tag{12.90}
\end{equation*}
$$

In this case there is a decrease in the propagation delay of $0.7 \times 4 R C$.s
e) In the Elmore delay model a branch contributes its entire capacitance multiplied with the resistance up to the point where the branch connects to the main path. In this case the added capacitance is $2 C$ and the resistance to the midpoint is $R+2 R$. Thus, the result of adding the branch is an increase in the propagation delay of $0.7 \times 6 R C$.
f) We denote the RC product for the inverter with $t_{\text {inv }}$ since it is a constant. Then we have $R=\frac{t_{\text {inv }}}{x C}$ and $C=\frac{x t_{\text {inv }}}{R}$ where $x$ is the size of the inverter. The general RC expression for the delay is then

$$
\begin{equation*}
\tau=\frac{t_{\mathrm{inv}}}{x C}\left(\frac{2 x t_{\mathrm{inv}}}{R}+C_{w}\right)+R_{w}\left(\frac{x t_{\mathrm{inv}}}{R}+\frac{C_{w}}{2}\right) \tag{12.91}
\end{equation*}
$$

which we can simplify to

$$
\begin{equation*}
\tau=\frac{2 t_{\mathrm{inv}}^{2}}{R C}+\frac{t_{\mathrm{inv}} C_{w}}{x C}+\frac{x t_{\mathrm{inv}} R_{w}}{R}+\frac{R_{w} C_{w}}{2} . \tag{12.92}
\end{equation*}
$$

Only the two middle terms depend on $x$. We find the minimum for

$$
\begin{equation*}
x_{\mathrm{opt}}=\sqrt{\frac{R C_{w}}{R_{w} C}} \tag{12.93}
\end{equation*}
$$

Note: In in this kind of problem in an exam it would not be required to derive the expression for the optimum size if you remember it, but it may actually be easier than to memorize it.
In this particular case we have

$$
\begin{equation*}
x_{\mathrm{opt}}=\sqrt{\frac{R \times 8 C}{4 R \times C}}=\sqrt{2} \tag{12.94}
\end{equation*}
$$

So the size of the inverter should be such that its resistance is $\frac{R}{\sqrt{2}}$ and its capacitance is $\sqrt{2} C$. The RC product for the wire and driver is then,

$$
\begin{equation*}
\tau=2 R C+4 \sqrt{2} R C+4 \sqrt{2} R C+16 R C \approx 29.3 R C, \tag{12.95}
\end{equation*}
$$

and the propagation delay is $0.7 \times 29.3 R C$.
So we did not gain much by optimizing the original setup.

Solution 7.5 Problem is on page 26.


Figure 12.15: Clock-distribution network with a driver driving three receivers over different wires with the three paths indicated. The input capacitance of the NAND gates is 2C.
a) In this circuit there are three paths to the clock gaters A, B and C (red, blue and green in Figure 12.15).

We need to model the wire segments in these paths using the pi model and calculate the wire delay using Elmore's model. However, the clock skew is the difference in delay. Those parts of the delays that are the
same for all three paths we do not have to calculate. The Elmore delay can be calculated as the delay due to the main path plus the delay due to branches. In this circuit the main path due to the wire segments to the inputs $\mathrm{A}, \mathrm{B}$, and C all have R and 6 C and the input capacitances are also all the same: 2 C . Thus, we do not have to calculate the main-path delay. However, the branch delays differ. There are three branches and each path incorporates two of them. Remember that in Elmore's model the entire capacitance of a branch should be multiplied with the resistance up to the point where the branch starts.

Branch RC delay due to red branch: $b d_{A}=\left(R+\frac{R}{3}\right) \times 6 C=\frac{24}{3} R C$
Branch RC delay due to blue, green branches: $b d_{B}=b d_{C}=\left(R+\frac{2 R}{3}\right) \times 4 C=\frac{20}{3} R C$
Total branch RC delay to node A is: $b d_{B}+b d_{C}=2 \times \frac{20}{3} R C=\frac{40}{3} R C$
Total branch delay to nodes B and C are the same: $\left(\frac{24}{3}+\frac{20}{3}\right) R C=\frac{44}{3} R C$
So the clock skew is $0.7 \times \frac{4}{3} R C$ and the delay to clock gaters B and C are the longest ones, whereas the delay to gater A is shorter.
b) The difference in load capacitance between clock gaters B and C is what will cause clock skew between the outputs of gaters B and C since they are scaled the same. That difference is 2C. The delay to the gate inputs will not change.

Since there is no wires to account for at the outputs of the NAND gates the delay can be calculated using logical effort:

$$
\begin{equation*}
\tau_{\text {skew }}=g_{\mathrm{NAND} 2} \times\left(h_{\mathrm{C}}-h_{\mathrm{B}}\right) \tag{12.96}
\end{equation*}
$$

where we were careful to take the difference in the order that makes the skew positive. Equation 12.96 evalutes to $4 / 3$ because we have:

$$
\begin{equation*}
h_{\mathrm{C}}-h_{\mathrm{B}}=\frac{8 C}{2 C}-\frac{6 C}{2 C}=\frac{2 C}{2 C}=1 \tag{12.97}
\end{equation*}
$$

We can also calculate R explicitly which is done below.
Gaters A, B and C are identical. The logical effort, $g$, is defined as "the ratio of the input capacitance to that of an inverter that can drive the same current" or, in other words, the same effective resistance. That is, $g \mathrm{C}$ corresponds to R. Here we have that $C_{N A N D}=2 C=g \frac{3}{2} C$ and thus $R_{N A N D}=\frac{2}{3} R$. So the resulting clock skew is $0.7 \times \frac{2}{3} R \times 2 C=$ is $0.7 \times \frac{4}{3} R C$.
ADDITION: We did not have to calculate the main path delay, but it could be of interest to do so anyway for practice. To do so we need only one wire segment with resistance $R$ and capacitance 6 C . With the usual model for a driver that drives a receiver over a wire we find the RC product to be:

$$
\begin{equation*}
R \times(C+3 C)+2 R \times(3 C+2 C)=14 R C \tag{12.98}
\end{equation*}
$$

Thus, the resulting delay for the main path is $t_{p d}=0.7 \times 14 R C$.
Solution 7.6 Problem is on page 26.

There are two parts in this circuit as is shown in the figure below:
For the second part we use a collapsed tree in this solution, but one can also use Elmore branches.
a) Elmore for first part: $100 \Omega \times(72+100+100+36) \mathrm{fF}+800 \Omega \times(100+36) \mathrm{fF}=139.6$ ps. Elmore for the second part: $200 \Omega \times(36+100+100+36) \mathrm{fF}+200 \Omega \times(100+36) \mathrm{fF}=3 \times 200 \times 136 \mathrm{fF}=81.6 \mathrm{ps}$. All in all the propagation delay becomes $t_{p d}=0.7 \times(139.6+81.6)=\mathbf{1 5 5} \mathbf{p s}$.
b) We know that for the 2 -input NAND gate we have that $g_{\text {NAND } 2}$ is $4 / 3$ and $p_{\text {NAND } 2}$ is 2 (otherwise we could derive these numbers). The path effort from A to B is then $F=G \times H=4 / 3 \times 72 / 1.5=64$. For minimum delay all stage efforts should be the same. In this case all stage efforts, f, should be 4 since $4 \times 4 \times 4=64$. To find the inverter sizes we can start from the output or the input of the path. From the input we have $4 / 3 \times h_{N A N D}=4=>h_{N A N D}=3$ so the input capacitance of the first inverter in the buffer should be $\mathbf{4 . 5} \mathbf{f F}$. The input of the second inverter should be $4 \times 4.5 \mathrm{fF}=\mathbf{1 8} \mathbf{f F}$. These capacitances correspond to drive strenghs $200 / 4=\mathbf{5 0 X}$ for the second inverter and 50/4=12.5X for the first inverter.

c) The resulting normalized delay, $d$, is $4+4+4+P$ where $P$ is the sum of the parasitic delays for the three gates. We have $P=2+1+1=4$. Here we use our prior knowledge that the 2 -input NAND gate has $p=2$, but we could also derive it, if we did not remember. Thus, we have $d=16$. To calculate the delay in seconds we also need $\tau$. But what is $\tau$ in this process? We had better check that too. It is $0.7 R C=0.7 \times 72 \mathrm{fF} \times 0.1 \mathrm{k} \Omega=5 \mathrm{ps}$. So, the propagation delay from $A$ to $B$ is $16 \times 5 \mathrm{ps}=\mathbf{8 0} \mathbf{p s}$.

COMMENT: Is the total delay from A to C now $80 \mathrm{ps}+155 \mathrm{ps}=235 \mathrm{ps}$ or is there some delay we have not accounted for, since we assumed infinite drive strength at point B in task a)? In our path delay calculation in b) we accounted for the electrical effort of the second buffer so we have accounted for that delay and therefore it is correct to assume that the total delay is the sum of the two delays.

Solution 7.7 Problem is on page 27.
a) The number of squares for one WL wire is its length/width $=128 / 0.1=1280$. Thus, the resistance of the WL is $R_{\mathrm{WL}}=1280 \times 0.1 \Omega / \square=128 \Omega$.
b) The capacitance of one WL is $C_{\mathrm{WL}}=$ length $\times\left(C_{\mathrm{GND}}+2 \times C_{\text {INTERWIRE }}\right)+\#$ cells $\times 2 \times C_{\mathrm{G}} \mathrm{fF}$. In this case we have $C_{\mathrm{WL}}=128 \mu \mathrm{~m} \times(0.1 \mathrm{fF} / \mu \mathrm{m}+2 \times 0.02 \mathrm{fF} / \mu \mathrm{m})+128 \times(2 \times 0.1 \mathrm{fF})=128 \times 0.34 \mathrm{fF}=43.5 \mathrm{fF}$.
c) Here is the model:


The delay can be computed as $t_{d W L}=3 / 2 \times R_{W L} C_{W L}=3.7 \mathrm{ps}$.
d) The energy is computed as $E_{\mathrm{WL}}=C_{\mathrm{WL}} V_{\mathrm{DD}}^{2}=43.5 \mathrm{pJ}$ since only one WL is charged for each reading of the memory.
e) Resistance: The length of the wire is halved, but its width is not changed since the wire already has the minimum width. Thus, we have half the number of squares which gives $R_{\mathrm{WL} 2}=R_{\mathrm{WL}} / 2=64 \Omega$.

Capacitance: The parallel M2 wires are now approximately at half the distance they were before. If we
assume plate capacitances, the capacitance doubles so we have $C_{\text {INTERWIRE } 2}=2 \times C_{\text {INTERWIRE }} . C_{\text {GND }}$ and $C_{\mathrm{G}}$, and the number of cells remain the same. So in this case we get: $C_{\mathrm{WL} 2}=64 \mu \mathrm{~m} \times(0.1 \mathrm{fF} / \mu \mathrm{m}+2 \times$ $0.04 \mathrm{fF} / \mu \mathrm{m})+128 \times(2 \times 0.1 \mathrm{fF})=64 \times 0.58 \mathrm{fF}=37.1 \mathrm{fF}$ The shorter WL wires improved the resistance much more than the capacitance.

The new delay is $t_{\mathrm{dWL} 2}=3 / 2 \times 2.37=1.6 \mathrm{ps}$. The new energy $E_{\mathrm{WL} 2}$ is 37 pJ . So making the memory smaller more than halved the delay, but the energy required is almost the same. This result is due to the fact that the main improvement was the halved resistance whereas the capacitance remained almost the same.

Solution 7.8 Problem is on page 28.

This problem can be solved by collapsing the H-tree; it is also possible apply Elmore's formula directly and handle the branches explicitly. Here we use the collapsed tree. If you have not yet done lab 4 consult the instructions for prelab 4 to read more about how to collapse the tree. The resulting schematic is shown in the figure below:


Since the three resulting wire segments are identical we can merge them into one segment to further simplify the calculations (You may remember from lab 4, that in the Cadence simulations the number of segments a wire is divided into makes a difference in the delay calculations, but in the hand calculations it does not). The further simplified schematic is shown below:

a) We use Elmore's formula to find this expression for $t_{\mathrm{dFO} 4}^{\prime}$ (the delay without the 0.7 factor):

$$
\begin{equation*}
t_{\mathrm{dFO} 4}^{\prime}=R_{\mathrm{eff}} C_{\mathrm{D}}+R_{\mathrm{eff}} 24 C+R_{\mathrm{eff}} 4 C_{\mathrm{G}}+3 R 12 C+3 R 4 C_{\mathrm{G}} \tag{12.99}
\end{equation*}
$$

This expression we can simplify further using $C_{\mathrm{D}}=C_{\mathrm{G}}$ and rearranging:

$$
\begin{equation*}
t_{\mathrm{dFO} 4}^{\prime}=5 R_{\mathrm{eff}} C_{G}+24 R_{\mathrm{eff}} C+36 R C+12 R C_{\mathrm{G}} \tag{12.100}
\end{equation*}
$$

We have that the FO4 delay is $0.7 t_{\mathrm{dFO} 4}^{\prime}$. We identify $\tau^{\prime}=R_{\text {eff }} C_{\mathrm{G}}$, which we know is a process constant; we then see that the first term in corresponds to the usual FO4 delay.
b) To find the optimal $R_{\text {eff }}$ we need to set the derivative of $t_{\mathrm{dFO} 4}^{\prime}$ with respect to $R_{\text {eff }}$ equal to 0 and solve for $R_{\text {eff }}$. But first we must eliminate $C_{\mathrm{G}}$ from the expression for $t_{\mathrm{dFO} 4}^{\prime}$ since $R_{\mathrm{eff}}$ and $C_{\mathrm{G}}$ are connected through the relation $\tau^{\prime}=R_{\mathrm{eff}} C_{\mathrm{G}}$, where $\tau^{\prime}$ is a constant. When we eliminate $C_{\mathrm{G}}$ we get:

$$
\begin{equation*}
t_{\mathrm{dFO} 4}^{\prime}=5 \tau^{\prime}+24 R_{\mathrm{eff}} C+36 R C+12 \frac{R \tau^{\prime}}{R_{\mathrm{eff}}} \tag{12.101}
\end{equation*}
$$

It is now clear that there are two terms that depend on $R_{\text {eff }}$. The derivative is:

$$
\begin{equation*}
\frac{d t_{\mathrm{dFO} 4}^{\prime}}{d R_{\mathrm{eff}}}=24 C-\frac{12 R \tau^{\prime}}{R_{\mathrm{eff}}^{2}} \tag{12.102}
\end{equation*}
$$

When we set the derivative equal to 0 and solve for $R_{\text {eff }}$ we find:

$$
\begin{equation*}
R_{\mathrm{eff}}=\sqrt{\frac{R \tau^{\prime}}{2 C}} \tag{12.103}
\end{equation*}
$$

Solution 7.9 Problem is on page 29.

| \|width | 2 | 1 | 1 | 0,5 | 1 | 0,5 | 1 | 0,5 | 1 |  | Width |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| \|RS | 0,1 | 0,1 | 0,1 | 0,1 | 0,15 | 0,15 | 0,15 | 0,15 | 0,15 | 0,15 | Ohm/square |  |
| \|wirec | 0,2 | 0,2 | 0,3 | 0,3 | 0,4 | 0,4 | 0,4 | 0,35 | 0,4 | 0,35 | fF/um |  |
| \|wirer | 0,05 | 0,1 | 0,1 | 0,2 | 0,15 | 0,3 | 0,15 | 0,3 | 0,15 | 0,3 | Ohm/um |  |
| Length (um) | 6788,23 | 4800,00 | 4898,98 | 3464,10 | 4156,92 | 2939,39 | 4849,74 | 3666,06 | 5542,56 | 4189,78 |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
| L (in problem) | 6800 | 4800 | 4900 | 3450 | 4150 | 2950 | 4850 | 3675 | 5550 | 4200 |  |  |
| WE (calculated) | 64,222222 | 64 | 100,04167 | 99,1875 | 143,52083 | 145,04167 | 196,020833 | 196,957031 | 256,6875 | 257,25 |  |  |
| \|sqrt(WE) | 8,0138769 | 8 | 10,002083 | 9,9592921 | 11,980018 | 12,043325 | 14,000744 | 14,0341381 | 16,02147 | 16,03901493 |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
| WE (rounded) | 64 | 64 | 100 | 100 | 144 | 144 | 196 | 196 | 256 | 256 |  |  |
| satr(WE) | 8 | 8 | 10 | 10 | 12 | 12 | 14 | 14 | 16 | 16 |  |  |
| Ropt(driver) RW/qqrt(WE) | 42,5 | 60,0 | 49,0 | 69,0 | 51,9 | 73,8 | 52,0 | 78,8 | 52,0 | 78,8 |  |  |
| Copt $=7200 /$ Ropt (fF) | 169,4 | 120,0 | 146,9 | 104,3 | 138,8 | 97,6 | 138,6 | 91,4 | 138,4 | 91,4 |  |  |
| mopt $=$ sqrt(WE)/4 | 4 | 4 | 5 | 5 | 6 |  | 7 | 7 | 8 | 8 |  |  |
| Dopt $=4 *$ sqrit(WE) | 32 | 32 | 40 | 40 | 48 | 48 | 56 | 56 | 64 | 64 |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
| Ewire | 1958,4 | 1382,4 | 2116,8 | 1490,4 | 2390,4 | 1699,2 | 2793,6 | 1852,2 | 3196,8 | 2116,8 | f) |  |
| E one repeater | 488 | 346 | 423 | 301 | 400 | 281 | 399 | 263 | 399 | 263 | f) |  |
| mopt-1 repeaters | 1464 | 1037 | 1693 | 1202 | 1999 | 1406 | 2394 | 1580 | 2790 | 1843 | $f$ |  |
| \|mopt-1+1 repeaters | 1952 | 1382 | 2116 | 1503 | 2398 | 1687 | 2793 | 1843 | 3188 | 2107 | f) | , |

Figure 12.16: Wire data for problem 7.9 and the corresponding solutions.


Figure 12.17: Wire delay with different number of repeaters. In problem 7.9 we had solution with all these solutions except $\mathrm{WE}=81$.
a) Different students had different data - ten different data sets. See the excel data in Figure 12.16. The equations are:

$$
\begin{aligned}
C_{W} & =L \times c \\
R_{W} & =R_{S H} \times \frac{L}{W}
\end{aligned}
$$

b) We have the equations:

$$
\begin{aligned}
W_{E} & =\frac{R_{W} C_{W}}{R C} \\
D_{\text {Eopt }} & =4 \sqrt{W_{E}}
\end{aligned}
$$

The wire values were selected to give wire effort close to $64,100,144,196$ and 256 resulting in the optimal Elmore delay being 32, 40, 48, 56 and 64.
c) We have

$$
R_{\mathrm{opt}}=\frac{R_{W}}{\sqrt{W_{E}}}
$$

and then we find the corresponding capacitance as:

$$
C_{\mathrm{opt}}=\frac{R C}{R_{\mathrm{opt}}}
$$

Resistance and capacitance values for the optimal repeaters are given in Figure 12.16.
d) The energy can be computed as

$$
E=V_{D D}^{2} \times C
$$

both for the wire and for the repeaters. For the repeaters we have to remember that the output capacitance is the same as the input capacitance. We should also remember that $m_{\text {opt }}$ is the number of segments, not the number of repeaters. However since we should also include the driver output capacitance and receiver input capacitance we arrive at

$$
E_{\text {repeaters }}=V_{D D}^{2} \times m_{\text {opt }} 2 C_{r e p}
$$

The resulting values are shown in Figure 12.16. We see that the wire energy and the repeater energy are quite close in all cases.
e) Here there are many different possible solutions for this task. In Fig. 12.17 we show a graph that gives an overview of the delays with different number of repeaters for different WE values. From the graph it is clear that the there is very little improvement in delay for the higher number of repeaters. It seems that going down to around half the optimal number of repeaters would be good it one wants to save energy. Then the repeater energy is halved with almost no decrease in delay.

### 12.8 Layout

Solution 8.1 Problem is on page 31.
a) In the p-net graph all vertices have the order 2 ; so zero nodes with odd order. In the n-net graph there are two nodes with odd order: the output node and GND. So Euler paths are OK but we also have to check the order of the nodes. All paths in the n-net graph has to start with the D vertice. There are two possible orders that also work with the p -net, which is a ring: $(\mathrm{D}, \mathrm{C}, \mathrm{B}, \mathrm{A})$ and $(\mathrm{D}, \mathrm{A}, \mathrm{B}, \mathrm{C})$.
b) In the p-net there are two nodes with order three: VDD and the output node. In the p-net there are also two nodes with order three: GND and the node above the A2 and B2 transistor that are in parallel connected to GND. So for each of the nets it is possible to find a Euler path, but there are constraints on the start and end nodes for both paths.

Now we must check if it also possible to use the same order in both graphs. The limiting cases are often the nodes where there are no options, that is the ones connected in series. There are more of those in the n-net where we find four in row: A1, B1, B2, A2. In the n-net we could start at each end of this row since it is connected to the odd-vertice nodes at each end. However, when we check the p-net we find that there is only one possibility there and that is to start at VDD with A1. So we then conclude that there are two orders that work for both paths and they are: (A1, B1, B2 A2, A2, B2) and (A1, B1, B2 A2, B2, A2).

Solution 8.2 Problem is on page 31.


Figure 12.18: 5-input cell (a) schematic with Euler path and labeled internal nodes, (b) Corresponding layout.
a) Yes, it is possible to use continuous-line-of-diffusion layout. First check the number of connections to the nodes in both p-net and n-net. We find that in the p-net there are exactly two nodes with odd number of connections: VDD and p1. The diffusion line has to start and stop in these two nodes. Still there are several possibilities since the n-net does not have any restrictions on the nodes because all its nodes have even number of connections. One possibility is shown in Figure 12.18a.
b) One layout corresponding to the Euler paths in 12.18a is shown in 12.18b. There are of course several other solutions.

Solution 8.3 Problem is on page 32.

1. Layout with minimal number of diffusion areas is shown in Figure 12.19.
2. For our layout of the 4-input NAND gate, with worst-case path resistance $R$ in both $n$ and $p$ paths, there are two p-diffusion areas with width 2 connected to the output and $1 n$-diffusion area with width 4 . For a CMOS inverter with resistance $R$, there are diffusion areas of width $2+1$ connected to the output (corresponding to $p_{i n v}=1$ since that was given in the problem). So $p_{N A N D 4}=(2+2+4) / 3=8 / 3$ which is quite a bit less than 4 which is what we get from the schematic.
3. We need to find an expression for the normalized delay $d$ of the AND4 gate as $d=p_{\text {AND4 }}+g_{\text {AND4 }} h_{\text {AND4 }}$ where $p_{\text {AND } 4}$ is the part of the normalized delay that does no depend on the load capacitance and $h_{\text {AND } 4}$ is $\frac{C_{L}}{C_{\text {inAND }}}$. For the entire gate the expression for the normalized delay is

$$
\begin{equation*}
d_{\mathrm{AND} 4}=g_{\mathrm{NAND} 4} h_{\mathrm{NAND} 4}+p_{\mathrm{NAND} 4}+h_{\mathrm{inv}}+p_{\mathrm{inv}} . \tag{12.104}
\end{equation*}
$$

In this expression the only part that depends on external load capacitance is $h_{\text {inv }}$. The other three terms in the expression will form $p_{\text {AND4 }}$ so we have:

$$
\begin{equation*}
p_{\mathrm{AND} 4}=g_{\mathrm{NAND} 4} h_{\mathrm{NAND} 4}+p_{\mathrm{NAND} 4}+p_{\mathrm{inv}} \tag{12.105}
\end{equation*}
$$

The electrical effort for the NAND4 gate, $h_{N A N D 4}$ is 1 , because its load capacitance is the same as its input capacitance. Thus, we find:

$$
\begin{equation*}
p_{\mathrm{AND} 4}=2 \times 1+\frac{8}{3}+1=5 \frac{2}{3} \tag{12.106}
\end{equation*}
$$

The input capacitance of the NAND4 gate is the same as for the inverter. Therefore, the electrical effort will be 1 also for the AND4 gate. In conclusion the solution is $p_{\text {AND4 }}=5.67$ and $g_{\text {AND4 }}=1$.


Figure 12.19: One possible layout of the AND4 gate.

Solution 8.4 Problem is on page 33.
a) There are several solutions to this layout problem. In the figure below you see one possibility.

b) Below is the schematic with the order of the transistors shown. The logical function of the gate is $Y=$ $\overline{A B C+D(A+B+C)}$. The reason the circuit is symmetrical is that the direct inverse of this function is $\bar{Y}=(A+B+C)(D+(A B C))=D(A+B+C)+A B C$.


Solution 8.5 Problem is on page 33.
a) For reference we number the nMOS transistors in the layout 1-7 from left to right and the pMOS transistors $8-14$ from left to right. We name the output of the compound gate X and the output of the inverter Y . The corresponding transistor schematics is shown below; the numbers of each transistor is to the right of that transistor:

b) Solution:

$$
\begin{equation*}
Y=A B+(A+B) C D \tag{12.107}
\end{equation*}
$$

It is easiest to find the function from the n-net of the compound gate. Since its output, X , is inverted to form Y , the n -net gives the function for Y directly.

Solution 8.6 Problem is on page 34.

The layout is shown here:


Solution 8.7 Problem is on page 34.

The discrepancies are marked in the layout here:


The discrepancies are:

1. Missing contact from p-active to VDD metal-1.
2. Inverter input and output shorted.
3. Wrong order of inputs compared to schematics.
4. Accidentally misplaced metal- 1 wires.

### 12.9 Sequential circuits

Solution 9.1 Problem is on page 37.
a) The maxmimum clock frequency is found from the propagation delay. We must consider the first flip-flop, the longest path through the adder and the setup time for the second flip-flop. Thus, we have

$$
t_{p c q}+t_{p d a d d e r 16}+t_{\text {setup }} \geq T_{c}
$$

The carry has to propagate through all 16 carry cells and then through one sum cell before we have the last sum bit available (that is the critical path). So we find

$$
t_{\text {pdadder } 16}=2 \times 340 \mathrm{ps}+60 \mathrm{ps}=740 \mathrm{ps}
$$

Thus we find

$$
T_{c}=50 \mathrm{ps}+740 \mathrm{ps}+50 \mathrm{ps}=840 \mathrm{ps}
$$

Thus, the maximum clock frequency is 1.19 GHz .
b) We apply the same approach here too, but the since the number of bits is increased so is the longest path. Instead of two instances of the 8-bit carry chain, we need eight. We thus find:

$$
t_{\text {pdadder } 64}=8 \times 340 \mathrm{ps}+60 \mathrm{ps}=2780 \mathrm{ps}
$$

And for the shorest clock period we have:

$$
T_{c}=50 \mathrm{ps}+2780 \mathrm{ps}+50 \mathrm{ps}=2880 \mathrm{ps}
$$

The corresponding maximum clock frequency is 347 MHz .
c) In the worst case the clock skew increases the delay between the flip-flops so we have to add that to the maximum time and thus increase the clock period which decreases the clock frequency. Thus the maximum clock frequency becomes slightly lower than without clock skew.

## Solution 9.2 Problem is on page 37.

a) The shortest path is through only one sum cell, because the output of that cell can flip immediately when A and B change value, even if the carry has not been generated correctly yet. So the contamination delay for the 16 -bit adder is merely 30 ps . And it is the same for the 64 -bit adder, which is worth noticing.
b) The requirement for avoiding a hold violation is:

$$
t_{c c q}+t_{c q} \geq t_{\mathrm{hold}}
$$

In this case that translates to

$$
35 \mathrm{ps}+30 \mathrm{ps} \geq 10 \mathrm{ps}
$$

which is true. So we do not have to worry about hold violations.
c) With clock skew the worst case is when the clock skew decreases the contamination delay and the output changes even earlier. The requirement then becomes:

$$
t_{c c q}+t_{c q}-t_{\text {skew }} \geq t_{\text {hold }}
$$

In this case that translates to

$$
35 \mathrm{ps}+30 \mathrm{ps}-75 \mathrm{ps} \geq 10 \mathrm{ps}
$$

which is not true. Thus, we will have a hold violation with this clock skew.

Solution 9.3 Problem is on page 38.
a) The requirement for avoiding a hold violation is:

$$
t_{c c q}+t_{c q} \geq t_{\mathrm{hold}}
$$

That is the, data that is output from one flip-flop is not allowed to change until the next flip-flop has locked. Note that the shortest path between the two registers has only one logic gate, the nor gate. Thus, we have $t_{\text {hold }}=60 \mathrm{ps}$ while $t_{c c q}+t_{c d}$ is $30 \mathrm{ps}+25 \mathrm{ps}=55 \mathrm{ps}$ which it too short.
b) The maximum clock frequency can be found from the maximum delay between flip-flops. The requirement for the clock period is then

$$
t_{p c q}+t_{p d}+t_{\text {setup }} \geq T_{c}
$$

The longest path is through three logic gates, not only one. With equal sign we find the minimum clock period as:

$$
T_{c}=80 \mathrm{ps}+3 \times 40 \mathrm{ps}+50 \mathrm{ps}=250 \mathrm{ps}
$$

Consequently, the maximum possible clock frequency is 4 GHz .
c) Both solutions will work; they increase the shortest path by one logic gate and thus the contamination delay by $t_{c d}=25 \mathrm{ps}$ which is enough to avoid a hold violation. Ben's proposal in b ) will decrease the maximum clock frequency because it also increases the propagation delay for the combinational logic between the flipflops. Alyssa's proposal in does not increase the propagation delay. Thus, Alyssas's proposal is preferable.

Solution 9.4 Problem is on page 39.
a) Both 16 -bit adders have a worst-case delay of 250 ps . The figure below shows the unrolled steps in the control logic of the multiplier in the worst-case situation where every step results in a write to the product register. From the figure it is clear that three steps are completed in 300 ps . Thus, one step takes 100 ps and the resulting clock frequency is $1 /(100 \mathrm{ps})=10 \mathrm{GHz}$. An 8-bit multiplication takes $1+8 \times 4=33$ steps which is then 3300 ps or 3.3 ns .

b) We will use the Sklansky adder because it will have a shorter worst-case delay. From the table, and our knowledge about prefix adders, we find that the expression for the worst-case delay is $50 \mathrm{ps}+\log 2(n) \times 50 \mathrm{ps}$. For $\mathrm{n}=64$ we then have the worst-case delay as $50+6 \times 50=350 \mathrm{ps}$. The ripple-carry adder has much longer delay since its delay grows linearly with n . The figure below show the unrolled steps in the control logic of the multiplier where it becomes clear that three steps are completed in 400 ps. Thus, one step takes 133 ps and the resulting clock frequency is $1 /(133 \mathrm{ps})=7.5 \mathrm{GHz}$. A 32-bit multiplication with worst-case data lasts for $1+32 \times 4=129$ steps which each takes 133 ps . All in all then 17.2 ns .

c) See figure below from Hennessy and Patterson Computer Organization and Design. By shifting the product rather than the multiplicand we can use a 32 -bit adder rather than a 64 -bit adder. The 32 -bit Sklansky adder has a worst-case delay of 300 ps . The cycle time is then $350 / 3=117 \mathrm{ps}$. The time for one multiplication to complete with worst-case data is then $129 \times 117 \mathrm{ps}=15.1 \mathrm{~ns}$. So the gain in calculation delay is not that great. One saves space and power too though.


FIGURE 3.5 Refined version of the muitipilication hardware. Compare with the first version in Figure 3.3. The Multiplicand register, ALU, and Multipler register are all 32 bits wide, with only the Product regsiter left at 64 bits. Now the product is shifted right. The separate Multiplier register also disappeared. The multiplier is placed instead in the right half of the Product reglster. These changes are highlighted in color (The Product register should really be 65 bits to hold the carry out of the adder, but it's sbown here as 64 bits to highlight the evolutton from Figure 3.3.)

Solution 9.5 Problem is on page 40.
a) The propagation delay through the entire adder is:
a. Full-adder 1: Max of propagation delays from $A, B$, and $C_{\text {in }}$ inputs to $C_{\text {out }}$ output
b. Full-adder 2: Propagation delay from $C_{\text {in }}$ to $C_{\text {out }}$
c. Full adder 3: Max propagation delays from $C_{\text {in }}$ to $S$ um and $C_{\text {out }}$

With numbers we get: $t_{\mathrm{pd}}=\max (25,20)+20+\max (20,20)=65 \mathrm{ps}$
The scheduling overhead is $t_{\text {sched }}=t_{p c q}+t_{\text {setup }}=35+30=65 \mathrm{ps}$.
All in all $T_{c}=t_{\mathrm{pd}}+t_{\text {sched }}=65+65=130 \mathrm{ps} \rightarrow f_{\text {clk }}=7.7 \mathrm{GHz}$.
b) The minimum time until any output changes at the output of the adder is: $t_{c c q}+$ minmimum of contamination delays from inputs $A, B, C_{\text {in }}$ to Sum output for the full adder. With numbers we get: $21[\mathrm{ps}]+\min (22,15)[\mathrm{ps}]$ $=36 \mathrm{ps}$. The change at the adder output is not allowed to happen within the hold time because then we have a hold violation. We have thold $=10 \mathrm{ps}$. So thus the maximum possible clock skew is:

$$
\begin{align*}
& T_{\text {skew }} \leq t_{c c q}+t_{c d, C_{\text {in }}, C_{\text {out }}-t_{\text {hold }}}  \tag{12.108}\\
& T_{\text {skew }} \leq 21+15-10=26 \mathrm{ps} \tag{12.109}
\end{align*}
$$

c) Description: When we have the slow-slow and fast-fast corners the calculation for maximum clock frequency has to be repeated for the slow-slow corner only because all delays will be shorter for fast-fast corner. However, a hold violation can happen for any condition, so we have to check both corners when calculating the maximum allowed clock skew.

Calculation: For an update of the solution for task a) we arrive at these values from the slow-slow column in the table:

$$
\begin{equation*}
t_{\mathrm{pd}}=\max (30,25)+25+\max (25,25)=80 \mathrm{ps} \tag{12.110}
\end{equation*}
$$

The scheduling overhead in the slow-slow corner is: tsched $=\mathrm{pcq}+$ tsetup $=40+35=75 \mathrm{ps}$ All in all we find: $T_{c}=t_{\mathrm{pd}}+t_{\text {sched }}=80+75[\mathrm{ps}]=155 \mathrm{ps} \rightarrow f_{\text {clk }}=6.45 \mathrm{GHz}$. For the solution in b) we have to check the requirement for both corners. In both cases we have $t_{c d, C_{\mathrm{in}}, C_{\mathrm{out}}}<t_{c d, A B, C_{\text {out }}}$ so the requirement can still be expressed as $T_{\text {skew }} \leq t_{c c q}+t_{c d, C_{\text {in }}, C_{\text {out }}}-t_{\text {hold }}$ for both corners:

$$
\begin{gather*}
\text { Fast-fast }: T_{\text {skew }} \leq 16+12-5=23 \mathrm{ps}  \tag{12.111}\\
\text { Slow-slow }: T_{\text {skew }} \leq 24+20-20=24 \mathrm{ps} \tag{12.112}
\end{gather*}
$$

All in all, taking the additional corners into account the maximum clock frequency is 6.45 GHZ and the maximum allowed clock skew is 23 ps .

Solution 9.6 Problem is on page 41.
a) The maximum clock frequency is determined by the maximum of all propagation delays between pairs of registers, since all registers have the same data. We find with TT parameters A to B: 100 ps , B to C: $30 \mathrm{ps}, \mathrm{C}$ to B: $120 \mathrm{ps}+100 \mathrm{ps}=220 \mathrm{ps}$. So the C-to-B path has the longest delay. We also have to add the scheduling overhead for the registers which is tpcq + tsetup $=25 \mathrm{ps}+40 \mathrm{ps}=65 \mathrm{ps}$. Thus, we find $\mathrm{Tc}=285 \mathrm{ps}$ and the maximum clock frequency of 3.5 GHz .
b) The requirement that has to be fulfilled has nothing to do with the clock frequency since it is a race condition:

$$
\begin{equation*}
t_{c c q}+t_{c q} \geq t_{\text {hold }} \tag{12.113}
\end{equation*}
$$

And the slack (or the margin) in the system can expressed as:

$$
\begin{equation*}
t_{\text {slack }}=t_{c c q}+t_{c q}-t_{\text {hold }} \tag{12.114}
\end{equation*}
$$

The minimum of the slack for all paths between the registers sets the allowed clock skew. And that in turn is set by the minimum contamination delay. In this case that is for blocks $C$ and $B$. There we find the slack to be 15 ps which is the maximum clock skew the system can tolerate.
c) The system will not fulfill the setup requirement for the SS corner (not intended) for the clock frequency 3 GHz So any solution that shows that, will get the full points for this part.

However, with the quite low clock frequency that was originally given in the problem, 300 MHz , it is clearly the hold requirement that limits the clock skew. The calculation is similar to the one in task b) and has to be done for both the SS and the FF corners (and also for the TT corner, but that was already done in task b) ). The maximum clock skew was found to be 5 ps for these data, for the FF corner.

Solution 9.7 Problem is on page 42.
a) The contamination delay plays a role only for hold violations. Since CL B is connected between RB and RC they are the two registers of interest. So the path from the clock edge, through RB, through CL B, to the input of RC has to have a delay of at least the hold time for RC. That is

$$
\begin{equation*}
t_{\mathrm{holdRC}} \leq t_{\mathrm{ccqRC}}+t_{\mathrm{cdB}} \tag{12.115}
\end{equation*}
$$

And with numbers we find the condition to be

$$
\begin{equation*}
t_{\mathrm{cdB}} \geq 10 \mathrm{ps} \tag{12.116}
\end{equation*}
$$

b) In this case it it is the setup conditions we have to investigate. The path with the longest propagation delay is the one through blocks C and A . The condition is

$$
\begin{equation*}
T_{\mathrm{clk}} \geq t_{\mathrm{pcqRC}}+t_{\mathrm{pdC}}+t_{\mathrm{pdA}}+t_{\mathrm{setupB}} \tag{12.117}
\end{equation*}
$$

With numbers we find:

$$
\begin{equation*}
T_{\mathrm{clk}} \geq 10 \mathrm{ps}+50 \mathrm{ps}+30 \mathrm{ps}+10 \mathrm{ps}=100 \mathrm{ps} \tag{12.118}
\end{equation*}
$$

c) Here it is again the setup condition that is of interest. We have to find the path from IN to any register with the longest delay. That path goes through CL C and CL A to RB. To the total propagation delay through these two blocks we have to add the setup time for RB. Thus we have that setup condition for IN can be expressed as

$$
\begin{equation*}
t_{\mathrm{setupIN}} \geq t_{\mathrm{pdC}}+t_{\mathrm{pdA}}+t_{\text {setupB }} \tag{12.119}
\end{equation*}
$$

and with values we find that we have

$$
\begin{equation*}
T_{\text {setupIN }} \geq 50 \mathrm{ps}+30 \mathrm{ps}+10 \mathrm{ps}=90 \mathrm{ps} \tag{12.120}
\end{equation*}
$$

d) And in this case is it the hold condition that is of interest. The path with the shortest contamination delay is the one through CL C to RA.

$$
\begin{equation*}
t_{\text {holdin }}=t_{\mathrm{cdC}}-t_{\text {holdA }} \tag{12.121}
\end{equation*}
$$

so with numbers we have:

$$
\begin{equation*}
t_{\mathrm{holdIN}}=0 \tag{12.122}
\end{equation*}
$$

Which means that IN is allowed to change exactly when the clock edge comes or after that.
If we combine the results from tasks b), d) and d) we find that if we select $T_{\mathrm{clk}}$ as its smallest possible values 100 ps , then IN is only allowed to transition from the clock edge and for 10 ps after that, that is during $1 / 10$ of the clock period. That does not seem to be a good design. One has to be careful with the conditions on the input signals too!

### 12.10 Power, energy and scaling

Solution 10.1 Problem is on page 45.

For dynamic power the expression is:

$$
\begin{equation*}
P_{\mathrm{dyn}}=\alpha f C_{\mathrm{L}} V_{\mathrm{DD}}^{2} \tag{12.123}
\end{equation*}
$$

Thus, decreasing $V_{\mathrm{DD}}$ decreases power (with quadratic dependence). For the static power we assume that the subthreshold current dominates. The expression for the power is

$$
\begin{equation*}
P_{\mathrm{sub}}=V_{\mathrm{DD}} \times I_{\mathrm{sub}} \tag{12.124}
\end{equation*}
$$

Here, $V_{\mathrm{DD}}$ decreases too, but the subthreshold current increases more due to the decrease in threshold voltage. Remember that a 100 mV decrease in threshold voltage causes a tenfold increase in the current. So the static power increases.

Solution 10.2 Problem is on page 45.

For dynamic power the expression is:

$$
\begin{equation*}
P_{\mathrm{dyn}}=\alpha f C_{\mathrm{L}} V_{\mathrm{DD}}^{2} \tag{12.125}
\end{equation*}
$$

Thus, for dynamic power we have $V_{\mathrm{DD}}: \mathbf{D}, V_{\mathrm{T}}: \mathbf{N}, C_{\mathrm{L}}: \mathbf{D}$ and Width: $\mathbf{N}$.
For the short-circuit power we have we assume that $P_{\mathrm{SC}}=V_{\mathrm{DD}} \times I_{\mathrm{SC}}$ and it is the regular drain-current equation in the saturation region that gives the short-circuit current. This current flows when both transistors are on during the transition, so we can roughly assume that the input voltage is $V_{\mathrm{DD}} / 2$ if we want to have a detailed expression for it:

$$
\begin{equation*}
P_{\mathrm{SC}}=V_{\mathrm{DD}} I_{\mathrm{SC}} \approx V_{\mathrm{DD}} k^{\prime} \frac{W}{L}\left(\frac{V_{\mathrm{DD}}}{2}-V_{\mathrm{T}}\right)^{2} \tag{12.126}
\end{equation*}
$$

Thus, for short-circuit power we have $V_{\mathrm{DD}}: \mathbf{D}, V_{\mathrm{T}}: \mathbf{I}, C_{\mathrm{L}}: \mathbf{N}$ and Width: $\mathbf{D}$.
Similarly, for the static current the power is $P_{\text {sub }}=V_{\mathrm{DD}} \times I_{\text {sub }}$. The subthreshold (leakage) current depends exponentially on the gate-to-source voltage; the further below the threshold voltage, $V_{\mathrm{T}}, V_{\mathrm{GS}}$ is when the transistor is fully off, the lower the subthreshold current; so a higher threshold voltage gives a lower subtreshold current when $V_{\mathrm{GS}}=0 \mathrm{~V}$. Also again, the current is proportional to the transistor width.

Thus, for static power we also have $V_{\mathrm{DD}}: \mathbf{D}, V_{\mathrm{T}}: \mathbf{I}, C_{\mathrm{L}}: \mathbf{N}$ and Width: $\mathbf{D}$.

## Solution 10.3 Problem is on page 45 .

The reason for the difference is the stack effect. Both nMOS transistors are off when both A and B are " 0 ". That is why the fist input entry in the table has the lowet current.

The best combinations for the carry cell is when all inputs are " 1 " or all inputs are " 0 ". In those cases all pMOS transistors or all nMOS transistors are OFF and we have the benefit of the stack effect.

Solution 10.4 Problem is on page 46.
c) The dynamic power consumption due to recharging of the capacitances is $P_{\mathrm{dyn}}=\alpha f C_{\mathrm{tot}} V^{2}$ where $\alpha$ and $f$ are given in the problem statement. $V$ in the equation is $V_{\mathrm{DD}}$. The remaining factor is $C_{\text {tot }}$, the total switched capacitance, which we have to determine. $C_{\text {in }}$ for the inverter is not given in the problem but we can express the dynamic power in $C_{\text {in }}$. How many times Cin is the switched capacitance? It is

$$
\begin{equation*}
C_{\text {tot }}=C\left(1+4.5+4.5 f+4.5 f_{2}+4.5 f_{3}+4.5 f_{4}\right) \tag{12.127}
\end{equation*}
$$

if we count the also the input capacitance of the first inverter. We can also write the capacitance as

$$
\begin{equation*}
C_{\text {tot }}=C_{\text {parastictot }}+C_{\text {inputtot }}=1 / 2+2+32+128+(1+4+16+64+256+1024) . \tag{12.128}
\end{equation*}
$$

Either way the dynamic power is $0.25 \times 200 \mathrm{MHz} \times 1 \mathrm{~V}^{2} \times 1527.5 \times C_{\text {in }}$ where $C_{\text {in }}$ should be a few fF . Even though we do not have the exact number for the input capacitance we can check that the numbers are reasonable using our previous knowledge of the 65 nm process. If we assume that $C_{\text {in }}$ for the first inverter is a little less than 4 fF we can approximate $1527.5 \times C_{\text {in }}$ with $6000 \mathrm{fF}=6 \mathrm{pF}$. With this capacitance we arrive at

$$
\begin{equation*}
P_{\mathrm{dyn}}=50 \mathrm{MHzV}^{2} \times 6 \mathrm{pF}=300 \mu \mathrm{~W} \tag{12.129}
\end{equation*}
$$

which seems an entirely reasonable result.
d) It is incorrect because we neglect to take into account that if we have very large fanout for a gate the switching will be very slow at the gate output because its current is too small to charge the capacitance quickly. Then the n-net and p-net transistors in the gate will both be conducting at the same time and quite a large short-circuit current will flow during the switching.

Solution 10.5 Problem is on page 46.
a) (This task is also an example in Weste \& Harris) The capacitance for the logic part is 50 million transistors is

$$
\begin{equation*}
50 \times 10^{6} \times 0.3 \mu \mathrm{~m} \times 1.8 \mathrm{fF} / \mu \mathrm{m}=27 \mathrm{nF} \tag{12.130}
\end{equation*}
$$

and for the 950 million transistors in the memories:

$$
\begin{equation*}
950 \times 10^{6} \times 0.1 \mu \mathrm{~m} \times 1.8 \mathrm{fF} / \mu \mathrm{m}=171 \mathrm{nF} . \tag{12.131}
\end{equation*}
$$

The power consumption for the logic part is then $0.1 \times 1 \mathrm{GHz} \times 1.44 \mathrm{~V}^{2} \times 27 \mathrm{nF}=3.88 \mathrm{~W}$. The dynamic power for the memories is: $0.02 \times 1 \mathrm{GHz} \times 1.44 \mathrm{~V}^{2} \times 171 \mathrm{nF}=4.92 \mathrm{~W}$. All in all the dynamic power for the chip is $\mathbf{8 . 8 ~ W}$.
b) (This task is also an example in Weste \& Harris) The static power is due to the subthreshold leakage and gate leakage. We assume that half of all transistors as on and half are off. Only off transistors contribute subthreshold leakage and only on transistors contribute gate leakage current. For the memory part we have the total leakage current:

$$
\begin{equation*}
475 \times 10^{6} \times 0.1 \mu \mathrm{~m} \times(10 \mathrm{nA} / \mu \mathrm{m}+5 \mathrm{nA} / \mu \mathrm{m})=\mathbf{6 3 7} \mathbf{~ m A} \tag{12.132}
\end{equation*}
$$

For the logic part we have 25 million transistors that are on, and 25 that are off. Of the off ones $5 \%$ have the low VT. So the leakage current is:

$$
\begin{equation*}
25 \times 10^{6} \times 0.3 \mu \mathrm{~m} \times(0.95 \times 10 \mathrm{nA} / \mu \mathrm{m}+0.05 \times 100 \mathrm{nA} / \mu \mathrm{m}+5 \mathrm{nA} / \mu \mathrm{m})=\mathbf{1 4 6 . 2 5} \mathbf{~ m A} . \tag{12.133}
\end{equation*}
$$

All in all the leakage current is 783.75 mA and the power (since $P=U \times I$ ) is $\mathbf{9 4 0} \mathbf{~ m W}$. (Note that this result is not the same as in the book because there is a calculation error in book solution).
c) The dynamic power for the logic part would increase by $20 \%$ since the capacitance increases by $20 \%$ and all the other factors stay the same. The leakage current would be

$$
\begin{equation*}
0.3 \mu \mathrm{~m} \times 10^{6} \times((0.99 \times 25+5) \times 10 \mathrm{nA} / \mu \mathrm{m}+0.01 \times 25 \times 100 \mathrm{nA} / \mu \mathrm{m}+30 \times 5 \mathrm{nA} / \mu \mathrm{m})=\mathbf{1 4 1 . 8} \mathbf{~ m A} . \tag{12.134}
\end{equation*}
$$

So it would not save any dynamic power because the added 5 million transistors have more leakage than what we save by having fewer low VT transistors.
d) The dynamic power of 3.88 W from a) corresponds to 3.23 A of current. When this current is drawn through the power-gate switch there should be no more than 60 mV of voltage drop across it. Using Ohm's law, we find the maximum resistance of $R=0.06 \mathrm{~V} / 3.23 \mathrm{~A}=0.0186 \Omega$. So the transistor has to be very wide! $W=2000 \Omega \mu \mathrm{~m} / 0.0186 \Omega=107526 \mu \mathrm{~m}=\mathbf{1 0 8} \mathbf{~ m m}$. So the transistor is around 11 cm wide! (In practice it can be a bit less wide since the transistor resistance at low $V_{\mathrm{DS}}$ is smaller than $R$ ).
e) The capacitance of the switch is $W \times 1 \mathrm{fF} / \mu \mathrm{m}$; that results in 107526 fF or 107.5 pF . The energy is then 155 nJ (since $E=C V_{\mathrm{DD}}^{2}$ and $V_{\mathrm{DD}}$ is 1.2 V ). The static power for the logic part from b ) is $146 \mathrm{~mA} \times 1.2 \mathrm{~V}=$ 175.2 mW . Power is energy per time. So how long time for the total energy due to leakage to be equal to $E_{\text {sw }}$ ? We get $E_{\text {sw }}=E_{\text {leak }}=P_{\text {leak }} \times t$ so $t=E_{\text {sw }} / P_{\text {leak }}=186 \mathrm{~nJ} / 175.2 \mathrm{~mJ} \mathrm{~s}^{-1}$ (since Watts are Joules/second). We finally arrive at $t=\mathbf{0 . 8 9} \boldsymbol{\mu s}$.
f) BONUS QUESTION Obviously the case in c) is not good since both the dynamic power and the leakage is higher than in $b$ ), but in the general case it is a question about determining when to spend all the energy to turn the power off. One has to be rather good at predicting the down-time to spend the energy required. So it may be better to spend more on the dynamic power if the static power can be reduced without having to turn the power off, since it is very costly to do so.
$\mathrm{g})$ We have to redo the calculations for tasks a)-b) and d)-e) above but with $V_{D D}=1 \mathrm{~V}$.
a) Capacitance remains the same, of course. So the power consumption for the logic part is then $0.1 \times$ $1 \mathrm{GHz} \times 1 \mathrm{~V}^{2} \times 27 \mathrm{nF}=2.7 \mathrm{~W}$. The dynamic power for the memories is: $0.02 \times 1 \mathrm{GHz} \times 1 \mathrm{~V}^{2} \times 171 \mathrm{nF}=$ 3.42 W . All in all the dynamic power for the chip is 6.12 W .
b) We have the same leakage currect as before, if we assume that the numbers in the problem statement holds also at the lower $V_{\mathrm{DD}}$. All in all the leakage current is still 783.75 mA and the power is (since $P=U \times I) \mathbf{7 8 4} \mathbf{~ m W}$.
d) The dynamic power of 2.7 W from a) corresponds to 2.7 A of current. When this current is drawn through the power-gate switch there should be no more than 50 mV of voltage drop across it. Using Ohm's law, we find the maximum resistance of $R=0.05 \mathrm{~V} / 2.7 \mathrm{~A}=0.0186 \Omega$. So the transistor has to be very wide! $W=2000 \Omega \mu \mathrm{~m} / 0.0186 \Omega=107526 \mu \mathrm{~m}=\mathbf{1 0 8} \mathbf{~ m m}$. So the transistor is around 11 cm wide! (In practice it can be a bit less wide since the transistor resistance at low $V_{\mathrm{DS}}$ is smaller than R ).
e) The capacitance of the switch is $W \times 1 \mathrm{fF} / \mu \mathrm{m}$; that still results in 107526 fF or 107.5 pF . The energy is however lower: 107.5 nJ (since $E=C V_{\mathrm{DD}}^{2}$ and $V_{\mathrm{DD}}$ is 1.0 V ). The static power for the logic part from b) is now $146 \mathrm{~mA} \times 1.0 \mathrm{~V}=146 \mathrm{~mW}$. Power is energy per time. So how long time for the total energy due to leakage to be equal to $E_{\text {sw }}$ ? We get $E_{\text {sw }}=E_{\text {leak }}=P_{\text {leak }} \times t$ so $t=E_{\text {sw }} / P_{\text {leak }}=107.5 \mathrm{~nJ} / 146 \mathrm{~mJ} \mathrm{~s}^{-1}$ (since Watts are Joules/second). We finally arrive at $t=\mathbf{0 . 7 4} \boldsymbol{\mu s}$. So it is a bit shorter than at 1.2 V , even though the allowed voltage drop is smaller here.

Solution 10.6 Problem is on page 47.


Figure 12.20: Execution diagram for task 10.6 a).
a) We start by calculating the total capacitances in the logic and memory parts of the processor:

$$
\begin{aligned}
C_{L} & =(1+0.3) \mathrm{fF} / \mu \mathrm{m} \times 0.45 \mu \mathrm{~m} \times 150 \mathrm{M}=87.75 \mathrm{nF} \\
C_{M} & =(1+0.3) \mathrm{fF} / \mu \mathrm{m} \times 0.12 \mu \mathrm{~m} \times 600 \mathrm{M}=93.6 \mathrm{nF}
\end{aligned}
$$

The execution diagram is shown in Figure 12.20. We should calculate the energy consumption for one round, that is one second. During execution there is both dynamic and static power, while during idle time there is only static power.

For the logic part we find the dynamic energy as:

$$
\begin{aligned}
E_{L \mathrm{dyn}} & =P_{L \mathrm{dyn}} \times t \\
& =\alpha_{L} f_{c} C_{L} V_{\mathrm{DD}}^{2} t
\end{aligned}
$$

So we find for the two execution cases that the dynamic energy for the logic part is the same, which is what we expect, since it is the same number of clock cycles at the same supply voltage:

$$
\begin{aligned}
E_{L \text { dyn1G }} & =0.15 \times 1 \mathrm{GHz} \times 87.75 \mathrm{nF} \times 0.64 V^{2} \times 0.8 \mathrm{~s}=6.74 \mathrm{~J} \\
E_{L \mathrm{dyn} 800 \mathrm{M}} & =0.15 \times 0.8 \mathrm{GHz} \times 87.75 \mathrm{nF} \times 0.64 V^{2} \times 1 \mathrm{~s}=6.74 \mathrm{~J}
\end{aligned}
$$

For the static energy we assume that half of the transistors are on and half are off. The ones that are on have gate leakage and the ones that are off have subthreshold leakage. We start by calculating the static current for the logic:

$$
I_{L s t a t}=75 M \times 0.45 \mu \mathrm{~m} \times(5 \mathrm{nA} / \mu \mathrm{m}+0.2 \times 500 \mathrm{nA} / \mu \mathrm{m}+0.8 \times 50 \mathrm{nA} / \mu \mathrm{m})=4.89 \mathrm{~A}
$$

To find the power we have to multiply the static current by the supply voltage. And to find the energy we have to multiply by the time. Thus, we have:

$$
\begin{aligned}
E_{\text {Lstat1G }} & =4.89 \mathrm{~A} \times 0.8 \mathrm{~V} \times 0.8 \mathrm{~s}=3.13 \mathrm{~J} \\
E_{\text {Lstat800M }} & =4.89 \mathrm{~A} \times 0.8 \mathrm{~V} \times 1 \mathrm{~s}=3.91 \mathrm{~J}
\end{aligned}
$$

We cannot turn off the supply voltage for the memory, because then we would lose all data and state. Thus, the static energy for the memory will be the same for both cases. What happens with the dynamic energy depends on if we assume that we turn off the clocks to the memory, even though we cannot turn off the supply voltage. This approach seems reasonable so that is what we assume. So therefore the dynamic energy for the memory also remains the same for the two clock frequencies:

$$
\begin{aligned}
E_{M \mathrm{dyn} 1 \mathrm{G}} & =0.01 \times 1 \mathrm{GHz} \times 93.6 \mathrm{nF} \times 0.64 V^{2} \times 0.8 \mathrm{~s}=0.479 \mathrm{~J} \\
E_{M \text { dyn } 800 \mathrm{M}} & =0.01 \times 0.8 \mathrm{GHz} \times 93.6 \mathrm{nF} \times 0.64 V^{2} \times 1 \mathrm{~s}=0.479 \mathrm{~J}
\end{aligned}
$$

The static current for the memory is with the intended LVT type:

$$
I_{M \text { stat }}=300 \mathrm{M} \times 0.12 \mu \mathrm{~m} \times(5 \mathrm{nA} / \mu \mathrm{m}+50 \mathrm{nA} / \mu \mathrm{m})=1.98 \mathrm{~A}
$$

Thus, we get the static energy as:

$$
E_{M \text { stat }}=1.98 \mathrm{~A} \times 0.8 \mathrm{~V} \times 1 \mathrm{~s}=1.58 \mathrm{~J}
$$

Assuming we can turn of the power supply and the clocks for the logic and the clocks for the memory at no cost, we would save 0.78 J in static energy. The total energy is then 11.94 J .
b) We now need to find out how much we can reduce the supply voltage when we have a longer available time for execution in each clock cycle due to the lower clock frequency ( 800 MHz rather than 1 GHz ). Note that we cannot assume that the minimum $V_{\mathrm{DD}}$ is feasible to use. We know that we have:

$$
f_{c} \sim \frac{1}{R_{\mathrm{eff}}} \sim \frac{1}{\frac{V_{\mathrm{DD}}}{I_{D S}}} \sim \frac{V_{\mathrm{DD}}}{I_{D S}} \sim \frac{\left(V_{\mathrm{DD}}-V_{T H}\right)^{2}}{V_{\mathrm{DD}}}
$$

We do not know what tau is at 1 GHz but we only need the ratio so that is not a problem. At 1 GHz we have

$$
f_{\text {high }} \sim \frac{(0.8-0.2)^{2}}{0.8}=0.45
$$

Thus we have this equation for finding the reduced $V_{\mathrm{DD}}$ :

$$
\frac{800}{1000}=\frac{\frac{\left(V_{\mathrm{DD}}-0.2\right)^{2}}{V_{\mathrm{DD}}}}{0.45}
$$

So we have to solve a quadratic equation to find $V_{\mathrm{DD}}=0.7 \mathrm{~V}$. With this value we recalculate the energy consumption at the lower frequency. The static current is the same, so we just have to multiply with the new supply voltage. For the logic part we find:

$$
\begin{aligned}
E_{L \mathrm{dyn} 800 \mathrm{M}} & =0.15 \times 0.8 \mathrm{GHz} \times 87.75 \mathrm{nF} \times 0.49 \mathrm{~V}^{2} \times 1 \mathrm{~s}=5.16 \mathrm{~J} \\
E_{L s t a t 800 \mathrm{M}} & =4.89 \mathrm{~A} \times 0.7 \mathrm{~V} \times 1 \mathrm{~s}=3.42 \mathrm{~J}
\end{aligned}
$$

And for the memory part:

$$
\begin{aligned}
& E_{M \mathrm{dyn} 800 \mathrm{M}}=0.01 \times 0.8 \mathrm{GHz} \times 93.6 \mathrm{nF} \times 0.49 \mathrm{~V}^{2} \times 1 \mathrm{~s}=0.367 \mathrm{~J} \\
& E_{M \text { stat } 800 \mathrm{~m}}=1.98 \mathrm{~A} \times 0.7 \mathrm{~V} \times 1 \mathrm{~s}=1.386 \mathrm{~J}
\end{aligned}
$$

All in all we find that the total energy has been reduced to 10.3 J .
c) Not knowing when the events will appear makes it tricky to evaluate the effect. The cost of external event detection circuitry is unknown and it would most likely be complex to implement, reducing the benefit of the power gating. Reducing the clock frequency and power supply would be a safer approach. You are not required to calculate those numbers but we will later add them for completeness.

## Add calculations of energy savings at 100 MHz

Solution 10.7 Problem is on page 48.

## Preliminaries

Application: The digital video in the video-rendering application has 25 frames per second. Thus the maximum time for the calculations for one frame is 40 ms . One frame is $640 \times 480$ pixels, that is 307200 pixels.

PP processor: The PP processor requires 3072000 clock cycles for the calculations for one frame.
a) The first task is to fill in the empty cells in Table 10.4. To find the maximum clock frequency for lower $V_{\mathrm{DD}}$ we need to know how the delay (that is $\tau$ ) in the process scales with $V_{\mathrm{DD}}$. We know that $\tau=0.7 R C$. In this expression only $R$ changes with $V_{\mathrm{DD}}$. We know that $R=\frac{V_{\mathrm{DD}}}{I_{\mathrm{DS} A T}}$ and that $I_{\mathrm{DSAT}} \sim\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2}$ if we assume that the quadratic current equations hold as stated in the problem. Thus, the ratio of the max clock frequencies at two supply voltages can be written as:

$$
\begin{equation*}
\frac{f_{\mathrm{clk} 2}}{f_{\mathrm{clk} 1}} \sim \frac{\tau_{1}}{\tau_{2}} \sim \frac{\frac{V_{\mathrm{DD} 1}}{I_{\mathrm{DSAT} 1}}}{\frac{V_{\mathrm{DD} 2}}{I_{\mathrm{DSAT} 2}}} \sim \frac{\frac{V_{\mathrm{DD} 1}}{\left(V_{D 1}-V_{T}\right)^{2}}}{\frac{V_{\mathrm{DD} 2}}{\left(V_{\mathrm{DD} 2}-V_{\mathrm{T}}\right)^{2}}} \tag{12.135}
\end{equation*}
$$

For our three supply voltages $\frac{V_{\mathrm{DD}}}{\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2}}$ evaluates to: for $1.2 \mathrm{~V}: 1.48$, for $1.0 \mathrm{~V}: 2.04$, and for $0.8 \mathrm{~V}: 3.2$.
Thus, we have $\frac{f_{\text {clk } 1.0}}{f_{\text {clk } 1.2}}=\frac{1.48}{2.04}=0.73$ and $\frac{f_{\text {clk } 0.8}}{f_{\text {clk } 1.2}}=\frac{1.48}{3.2}=0.46$.
The current corresponding to the dynamic power consumption at the maximum clock frequency is

$$
\begin{equation*}
f_{\mathrm{clkmax}} \alpha C V_{\mathrm{DD}} \tag{12.136}
\end{equation*}
$$

(It has to be multiplied by the voltage to get the dynamic power.) So it scales with the maximum clock frequency (calculated above) and the supply voltage (the activity factor and capacitance does not change) as

$$
\begin{equation*}
\frac{I_{\mathrm{dyn} 2}}{I_{\mathrm{dyn} 1}} \sim \frac{f_{\mathrm{clk} 2} V_{\mathrm{DD} 2}}{f_{\mathrm{clk} 1} V_{\mathrm{DD} 1}} \tag{12.137}
\end{equation*}
$$

Thus we have $\frac{I_{\text {dyy } 1.0}}{I_{\text {dyn } 1.2}}=0.73 \frac{1.0}{1.2}=0.608$ and $\frac{I_{\text {dyn }} .8}{I_{\text {dynn } 1.2}}=0.46 \frac{0.8}{1.2}=0.306$. The resulting current values are shown in Table 12.1.

Sleep mode: The time it takes to enter sleep mode is $10 \mu$ s and it takes $20 \mu$ s for the processor to wake up from sleep mode. The energy required to switch the clocks off is $10 \mu \mathrm{~J}$. Hibernation mode: The time it takes to enter hibernation mode is 1 ms and it takes 19 ms to wake up from hibernation mode. The energy required to turn off $V_{\mathrm{DD}}$ is $500 \mu \mathrm{~J}$.

Table 12.1: Data for the PP processor

| Sup- <br> ply <br> volt- <br> age <br> $V_{\text {DD }}$ <br> $[\mathrm{V}]$ | Max- <br> imum <br> clock <br> fre- <br> quency <br> $[\mathrm{GHz}]$ | Current due to <br> dynamic power <br> consumption @ <br> max clock <br> frequency and a <br> realistic activity <br> factor $[\mathrm{mA}]$ | Idle current @ <br> room <br> temperature @ <br> max clock <br> frequency and <br> a low activity <br> factor $[\mathrm{mA}]$ | Static current in <br> sleep mode @ room <br> temperature (clock <br> signal turned off for <br> logic, but clock <br> generation <br> maintained) $[\mathrm{mA}]$ | Static current in <br> hibernation mode <br> (clock generation |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1.2 | 1.0 | 600 | 100 | intoped and <br> internal supply <br> voltages turned <br> off) $[\mu \mathrm{A}]$ |  |
| 1.0 | $\mathbf{0 . 7 3}$ | $\mathbf{4 3 8}$ | 80 | 60 |  |
| 0.8 | $\mathbf{0 . 4 6}$ | $\mathbf{1 8 4}$ | 64 | 37.5 | 60 |

b) It takes the same number of cycles to do the calculations at both voltages, but the cycle time differs. At 1 GHz the cycle time is 1 ns . At 460 MHz the cycle time is 2.17 ns . The energy is power times time. Thus, we arrive at: $E_{\mathrm{dyn}}=I_{\mathrm{dyn}} \times V_{\mathrm{DD}} \times t_{\text {cycle }} \times N_{\text {cycles }}$. And we have for 1.2 V and 0.8 V :

$$
\begin{aligned}
& E_{\text {dyn } 1.2}=600 \mathrm{~mA} \times 1.2 \mathrm{~V} \times 1 \mathrm{~ns} \times 3.07 \times 10^{6}=2210 \mu \mathrm{~J} \\
& E_{\text {dyn } 0.8}=184 \mathrm{~mA} \times 0.8 \mathrm{~V} \times 2.17 \mathrm{~ns} \times 3.07 \times 10^{6}=918 \mu \mathrm{~J}
\end{aligned}
$$

so a large reduction of the dynamic energy at the price of a more than doubled computations time.
c) The time in sleep mode is the time left after the computations for a frame are done. At 1.2 V the necessary computations take 3 ms and at 0.8 V they take 8 ms (we neglect the time it takes to switch to sleep mode since it is so short in comparison). The remaining times for a frame are $40-3=37 \mathrm{~ms}$ and $40-8=32 \mathrm{~ms}$ respectively. Thus, we have:

$$
\begin{aligned}
& E_{\text {sleep } 1.2}=60 \mathrm{~mA} \times 1.2 \mathrm{~V} \times 37 \mathrm{~ms}=2664 \mu \mathrm{~J} \\
& E_{\text {sleep } 0.8}=28 \mathrm{~mA} \times 0.8 \mathrm{~V} \times 32 \mathrm{~ms}=717 \mu \mathrm{~J}
\end{aligned}
$$

d) Since we have so much time available for the computations for one frame we should run at the lowest possible supply voltage, in this case 0.8 V , to save energy. At that supply voltage the energy required to switch off the power supply is larger than what we could save during the 12 ms when the power supply would be off. So the answer is NO. We should not use hibernation mode for this application.
a) We want to compare the energy used during each 100 ms execution for the two processors. On the fast core, L78, the execution of the POP application takes:

$$
\begin{equation*}
\frac{12 \mathrm{M}}{1 \mathrm{GHz}}=12 \mathrm{~ms} \tag{12.138}
\end{equation*}
$$

whereas on the slower L52 core the same execution takes

$$
\begin{equation*}
\frac{15 \mathrm{M}}{200 \mathrm{MHz}}=75 \mathrm{~ms} \tag{12.139}
\end{equation*}
$$

First we assume that we will put both cores in power-off mode. It takes 5 ms to put either core to sleep and 15 ms to wake it up, so 20 ms is lost there for both cores. For the L78 then there are then

$$
\begin{equation*}
100 \mathrm{~ms}-12 \mathrm{~ms}-20 \mathrm{~ms}=68 \mathrm{~ms} \tag{12.140}
\end{equation*}
$$

in power-off more. While for the L52 we have:

$$
\begin{equation*}
100 \mathrm{~ms}-75 \mathrm{~ms}-20 \mathrm{~ms}=5 \mathrm{~ms} \tag{12.141}
\end{equation*}
$$

in power-off mode. Which is so little that maybe it is not useful Now we can calculate the energy for both cores if power-off mode is used.

$$
\begin{align*}
E_{L 78 p o} & =12 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 1 \mathrm{~A}+2 \mathrm{~mJ}+68 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 1 \mathrm{~mA}+5 \mathrm{~mJ}  \tag{12.142}\\
& =14.4 \mathrm{~mJ}+2 \mathrm{~mJ}+81.6 \mu \mathrm{~J}+5 \mathrm{~mJ} \approx 21.4 \mathrm{~mJ} \tag{12.143}
\end{align*}
$$

Similarly, for the L52 core we get:

$$
\begin{align*}
E_{L 52 p o} & =75 \mathrm{~ms} \times 1 \mathrm{~V} \times 200 \mathrm{~mA}+500 \mu \mathrm{~J}+5 \mathrm{~ms} \times 1 \mathrm{~V} \times 1 \mathrm{~mA}+1 \mathrm{~mJ}  \tag{12.144}\\
& =15 \mathrm{~mJ}+0.5 \mathrm{~mJ}+5 \mu \mathrm{~J}+1 \mathrm{~mJ} \approx 16.5 \mathrm{~mJ} \tag{12.145}
\end{align*}
$$

For the L52 should consider sleep mode instead. We then get

$$
\begin{align*}
E_{L 52 s l} & =75 \mathrm{~ms} \times 1 \mathrm{~V} \times 200 \mathrm{~mA}+50 \mu \mathrm{~J}+23 \mathrm{~ms} \times 1 \mathrm{~V} \times 50 \mathrm{~mA}+50 \mu \mathrm{~J}  \tag{12.146}\\
& =15 \mathrm{~mJ}+1 \mathrm{~mJ}+1.15 \mathrm{~mJ} \approx 16.15 \mathrm{~mJ} \tag{12.147}
\end{align*}
$$

So the result is almost the same for sleep mode and power-off mode for the L52 core. But that core should be used, regardless.
b) Running the other application in L78 processor with the sleep mode uses

$$
\begin{align*}
E_{L 78 p o} & =60 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 1 \mathrm{~A}+2 \mathrm{~mJ}+20 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 1 \mathrm{~mA}+5 \mathrm{~mJ}  \tag{12.148}\\
& =72 \mathrm{~mJ}+2 \mathrm{~mJ}+24 \mu \mathrm{~J}+5 \mathrm{~mJ} \approx 79 \mathrm{~mJ} \tag{12.149}
\end{align*}
$$

And we know from task a) that executing the POP task on the L52 core requires 16.15 mJ . So in total 95.15 mJ if both cores are used.

Let's consider running the POP task in the L78 processor together with the other task that takes 60 ms . With the poower-off mode used in the L78 processor we just add 12 mJ and remove 16.15 mJ so the energy decreases to 91 mJ .

But for the L78 core then the time in power off mode is reduced to merely

$$
\begin{equation*}
100 \mathrm{~ms}-72 \mathrm{~ms}-20 \mathrm{~ms}=8 \mathrm{~ms} \tag{12.150}
\end{equation*}
$$

which may not be enough for it to be useful to go down in sleep mode.
If we instead consider sleep mode when running both applications in L78 we get the time in sleep mode to:

$$
\begin{equation*}
100 \mathrm{~ms}-72 \mathrm{~ms}-1 \mathrm{~ms}=23 \mathrm{~ms} \tag{12.151}
\end{equation*}
$$

The energy for the L78 processor is then

$$
\begin{align*}
E_{L 78 s p} & =72 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 1 \mathrm{~A}+0.2 \mathrm{~mJ}+23 \mathrm{~ms} \times 1.2 \mathrm{~V} \times 250 \mathrm{~mA}+0.05 \mathrm{~mJ}  \tag{12.152}\\
& =86.4 \mathrm{~mJ}+0.25 \mathrm{~mJ}+6.9 \mathrm{~mJ} \approx 93.55 \mathrm{~mJ} \tag{12.153}
\end{align*}
$$

So not a huge difference, but power-off is a little better. So the most energy efficient solution is to run both applications in the the L78 core, schedule them together and still use the power-off mode.

Solution 10.9 Problem is on page 50.

See solution in Table 12.2.
Solution 10.10 Problem is on page 50.

See solution in Table 12.2.

Solution 10.11 Problem is on page 50.

Table 12.2: Table for Dennard scaling and constant voltage scaling.

| Parameter | Sensitivity expression | Dennard, scaling factor $S$ | Constant voltage, scaling factor $S$ |
| :---: | :---: | :---: | :---: |
| Scaling parameters |  |  |  |
| $L$ : length |  | 1/S | 1/S |
| $W$ : width |  | 1/S | 1/S |
| $t_{\text {ox }}$ : gate oxide thickness |  | 1/S | 1/S |
| $V_{\mathrm{DD}}$ : power supply voltage |  | 1/S | 1 |
| $V_{\mathrm{T}}$ : threshold voltage(s) |  | 1/S | 1 |
| $N A$ : substrate doping |  | $S$ | $S$ |
| Device characteristics |  |  |  |
| $\beta$ : current factor | $\frac{W}{L} \frac{1}{t_{0 x}}$ | $S$ | $S$ |
| $I_{\text {DS }}$ : transistor current | $\beta\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2}$ | 1/S | $S$ |
| $R_{\text {eff }}$ resistance | $\frac{V_{\text {DD }}}{I_{\text {DS }}}$ | 1 | 1/S |
| $C$ : gate capacitance | $\frac{W L}{t_{0 x}}$ | 1/S | $1 / S$ |
| $\tau$ : gate delay | $R_{\text {eff }} C$ | 1/S | $1 / S^{2}$ |
| $f$ : clock frequency | $\frac{1}{\tau}$ | $S$ | $S^{2}$ |
| $E$ : switching energy (per gate) | $C V_{\text {DD }}^{2}$ | $1 / S^{3}$ | $1 / S$ |
| $P$ : switching power (per gate) | $E f$ | $1 / S^{2}$ | $S$ |
| $A$ : area (per gate) | WL | $1 / S^{2}$ | $1 / S^{2}$ |
| Switching power density | $\frac{P}{A}$ | 1 | $S^{3}$ |
| Switching current density | $\frac{I_{\text {DS }}}{A}$ | $S$ | $S^{3}$ |

The FO4 delay is $\left(4+p_{i n v}\right) \tau$. With no major changes in the transistor parasitics it should scale just as $\tau$. If nothing else is said we assume Dennard scaling where "all" parameters are scaled the same. See table 12.2. In that table we also already derived the scaling of $\tau$. We can redo it though: We have

$$
\begin{equation*}
\tau=R C=\frac{V_{D D} C}{I_{D S A T}}=\frac{V_{D D} C}{\frac{W}{L} \mu C_{\mathrm{OX}}\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2}} \tag{12.154}
\end{equation*}
$$

The only tricky part is $\mu C_{\mathrm{OX}}$ where we have to remember that it scales as $\frac{1}{t_{\mathrm{OX}}}$, the inverse of the gate-oxide thickness. Now we can derive the scaling:

$$
\begin{equation*}
\tau \sim \frac{\frac{1}{S} \times \frac{S}{S^{2}}}{\frac{S}{S} \times S \times \frac{1}{S^{2}}}=\frac{1}{S} \tag{12.155}
\end{equation*}
$$

The scaling of the transistor lengths is $S=\frac{0.35}{0.13}=2.7$ so the new FO4 delay should be $\frac{125}{2.7}=46 \mathrm{ps}$. Here, however the $V_{\mathrm{DD}}$ is not scaled down with the full scaling, which would have resulted in a $V_{\mathrm{DD}}$ of 1.22 V . The ratio for $V_{\mathrm{DD}}$ is only: $K=\frac{3.3}{1.8}=1.83$. We can derive the scaling for $\tau$ when the voltages are not scaled the same as as the transistors as:

$$
\begin{equation*}
\tau \sim \frac{\frac{1}{K} \times \frac{S}{S^{2}}}{\frac{S}{S} \times S \times \frac{1}{K^{2}}}=\frac{K}{S^{2}} \tag{12.156}
\end{equation*}
$$

As a result we expect the FO4 delay in the $0.13 \mu \mathrm{~m}$ process to be

$$
\begin{equation*}
\mathrm{FO}_{0.13}=125 \mathrm{ps} \times \frac{1.83}{2.7^{2}}=32 \mathrm{ps} \tag{12.157}
\end{equation*}
$$

Solution 10.12 Problem is on page 50.
a) The specification $f=\alpha V_{\mathrm{DD}}$ means we have $P_{\mathrm{dyn}}=C \times V_{D D}^{3}$. We also know that the dual-core will have twice the capacitance of the single core. So then $P_{\text {DUAL-CORE }}=P_{\text {SINGLE-CORE }}$ means

$$
\begin{equation*}
2 C\left(x V_{\mathrm{DD}}\right)^{3}=C V_{\mathrm{DD}}^{3} \tag{12.158}
\end{equation*}
$$

where $x$ is the scale factor. Hence we get $x=\frac{1}{\sqrt[3]{2}} \approx \frac{1}{1.26}$. Thus, the new supply voltage has to be $V_{\mathrm{DD}}=$ $\frac{1.2}{1.26} \approx 0.95 \mathrm{~V}$.
b) The scaling is from 0.90 to $0.65 \mu \mathrm{~m}$, that is approximately $S=\frac{1}{\sqrt{2}}$. Thus the area will be half of the old one, i.e. $A_{\text {new }}=\frac{A_{\text {old }}}{2}=100 \mathrm{~mm}^{2}$.

If frequency scales as $V_{\mathrm{DD}}$ it becomes $3.8 / 1.2=3.2 \mathrm{GHz}$. Power: the capacitance is scaled to half, so the new power is $P_{\text {dyn }}^{\text {new }} ⿵==\frac{P_{\text {dyn }}^{\text {old }}}{2}\left(\frac{1}{1.2}\right)^{3}=\frac{100}{2}\left(\frac{1}{1.2}\right)^{3} \approx 30 \mathrm{~W}$.
c) We now assume 10 W of static power and 90 W of dynamic power to begin with. 10 W of static power at $V_{\mathrm{DD}}=1.2 \mathrm{~V}$ corresponds to a leakage current of 8.3 A . A four-fold increase of the static leakage current would mean a 33 A leakage current, and hence, 33 W of static power at $V_{\mathrm{DD}}=1 \mathrm{~V}$. The dynamic power was originally 90 W , and becomes $0.30 \times 90=27 \mathrm{~W}$ in the new technology, when we use the reduction factor we already calculated in task b). 55-45! The new total power dissipation is 60 W . And discouragingly enough, more static power than dynamic power is dissipated in this what-if scenario. Not good. Something has to be done to reduce leakage!

Solution 10.13 Problem is on page 51.

The propagation delay in a processor is given by

$$
\begin{equation*}
t_{d}=0.7 R C=0.7 \frac{V_{\mathrm{DD}} C}{I_{\mathrm{DSAT}}} \tag{12.159}
\end{equation*}
$$

where the saturation current is given by the square-law model:

$$
\begin{equation*}
I_{\mathrm{DSAT}}=\frac{k}{2}\left(V_{\mathrm{DD}}-V_{\mathrm{T}}\right)^{2} \tag{12.160}
\end{equation*}
$$

With four cores we can allow an increase in delay by a factor four. This increase in turn allows us to decrease the supply voltage, $V_{\mathrm{DD}}$. To find the maximum allowed scale factor $x$ we equate four times the expression for the original $t_{d}$ with the expression for the new $t_{d}$. Remembering that $V_{\mathrm{T}}$ does not scale, we get:

$$
\begin{equation*}
4 \times \frac{1}{(1-0.25)^{2}}=\frac{x}{(x-0.25)^{2}} \tag{12.161}
\end{equation*}
$$

which yields

$$
\begin{equation*}
x=0.3125+\sqrt{0.3125^{2}-0.0625} 2=0.48 \approx \frac{1}{2} \tag{12.162}
\end{equation*}
$$

So the new $V_{\mathrm{DD}}$ is half of the original one. Dynamic power dissipation is given by:

$$
\begin{equation*}
P_{d y n}=\alpha f C V_{\mathrm{DD}}^{2} \tag{12.163}
\end{equation*}
$$

If we assume that the activity factor $\alpha$ is the same in both cases, the power dissipation for the multicore solution is:

$$
\begin{equation*}
P_{d y n m c}=\alpha\left(\frac{f}{4}\right) 4.2 C\left(\bar{V}_{\mathrm{DD}} 2\right)^{2}=\frac{4.2}{16} P_{d y n} \approx 0.26 P_{d y n}, \tag{12.164}
\end{equation*}
$$

where we have increased the capacitance another $0.2 C$ to account for the necessary wiring. So in conclusion the quad-core power dissipation is only a quarter of the single-core power dissipation. At least in the ideal world!

### 12.11 Adders

Solution 11.1 Problem is on page 53.
a) Here is the solution:

b) As hinted in the solution under a) the critical path is from $c_{\text {in }}$ to $c_{\text {out }}$ of the first 8-bit adder, through the 3 PG cells and from $c_{\text {in }}$ to SUM31 of the last 8 -bit adder. The delay to SUM31 is then: $9+3+8=20$ unit delays. We should also check that the delays to the P and G outputs are shorter. They are both $9+3+1=14$ unit delays.
c) A similar solution with 4-bit adders would have shorter delays to generate the first carry out and the last SUM output, but twice the number of PG cells. The delay to the last SUM bit would then be: $5+7+4=16$ unit delays. So in this case it would pay off to use a shorter adder.

Solution 11.2 Problem is on page 54.

Solution shown in Figure 12.21


Figure 12.21: The unknown adder completed and with the critical path drawn.

Solution 11.3 Problem is on page 54.

For the solution, see Figure 12.22. The diagram is from Weste and Harris.
It is possible to reason about which adder is which, even without all details. The four types are in alphabethical order:
A. Carry-lookahead adder
B. Carry-select adder

## C. Prefix adder

D. Ripple-carry adder

Ripple-carry adders use only one full-adder cell per bit so their areas are small. Their delays are linear in the number of bits. We also know that they are the slowest of all types of adders. From the diagram we thus deduce that ripple-carry adders have to be type 4.

Prefix adders are known for using a lot of hardware, but some types of prefix adders use more hardware than others. The reduction in delay is then quite small for the amount of hardware added (coresponding to the delay to one or two AND-OR cells. These fastest prefix adders have a delay that is mianly proportional to $\log _{2}(\mathrm{~N})$. (I say mainly because there is also a constant part in the expression for the delay). So from the diagram it is quite clear that prefix adders have to be type 1 since they are the fastest ones and the difference in delay for 32 and 64 bits is small while the added amount of hardware is the largest one.

The less obvious part is probably to match the two types in the middle: carry-lookahead adders and carry-select adders. Type 3 is quite a bit slower than type 2 especially for 64 -bit data. You may remember that carry-lookahead adders use block-propagate and block-generate signals as do prefix adders, but that they do not create the carries in a tree structure, while carry-select use multiplexers to select between the carry outputs of two ripple-carry blocks. So carry-lookahead should be closest in delay to prefix adders, while carry-select should be closer to ripple-carry in delay and have about twice the hardware of a simple ripple-carry adder. So type 2 has to be the carry-lookahead adders, while type 3 is the carry-select adders.


Figure 12.22: Solution to task 1 c ). The types of adders in the synthesis experiment. The diagram is taken from Weste and Harris.

Solution 11.4 Problem is on page 54.
a) It is possible to replace the multiplexer with the AO 21 gate in the carry-lookahead adder because the block generate and block propagate signals are never one at the same time.
b) In general the delay for the CLA adder with $k$ groups of $n$ bits each is

$$
\begin{equation*}
t_{\mathrm{CLA}}=t_{p g}+t_{p g(n)}[(n-1)+(k-1)] t_{a o}+t_{x o r}, \tag{12.165}
\end{equation*}
$$

which is equation 11.14 in Weste \& Harris. Here $t_{p g(n)}$ is the time for computing the valency-n block generate signal. With only AO gates available $\mathrm{n}-1$ such gates are needed for this computation, so we can rewrite this equation as:

$$
\begin{equation*}
t_{\mathrm{CLA}}=t_{p g}+[2(n-1)+(k-1)] t_{a o}+t_{x o r}, \tag{12.166}
\end{equation*}
$$

So one $(n-1)$ is due to the computation of the block-G signal and the other $(n-1)$ is due to calcuating the last carry.

So here we first have have $k=4$ and $n=4$ :

$$
\begin{equation*}
t_{\text {CLA4:4 }}=t_{p g}+[2 \times(4-1)+(4-1)] t_{a o}+t_{x o r} \tag{12.167}
\end{equation*}
$$

And second, $k=2$ and $n=8$ :

$$
\begin{equation*}
t_{\mathrm{CLA} 8: 2}=t_{p g}+[2 \times(8-1)+(2-1)] t_{a o}+t_{x o r} \tag{12.168}
\end{equation*}
$$

We see that the first setup has the lower delay. With the numbers in the problem we find $t_{C L A 4: 4}=31 \mathrm{u} . \mathrm{d}$. and $t_{C L A 8: 2}=49 \mathrm{u} . \mathrm{d}$.
c)

Add a figure here to make the argument for task c) clearer.
In either case when the 8-bit block is first or last the delay will be:

$$
\begin{equation*}
t_{\mathrm{CLA} 448}=t_{p g}+[(4-1)+(8-1)+(3-1)] t_{a o}+t_{x o r} \tag{12.169}
\end{equation*}
$$

So then the delay becomes $t_{C L 448}=40$ u.d.
However, with the 8 -bit block in the middle we have to analyze the delay in detail. Then we have $(4-1) t_{A} O$ at the beginning and $(4-1) t_{A} O$

$$
\begin{equation*}
t_{\mathrm{CLA} 484}=t_{p g}+[(4-1)+(8-3)+(4-1)] t_{a o}+t_{x o r} \tag{12.170}
\end{equation*}
$$

So then the delay becomes $t_{C L 484}=37 \mathrm{u}$.d. We will add a figure to show this more clearly.

## Add a figure here to make the argument for task d) clearer.

d) If you scale up the block size by one bit per block that is the best you can do since that delay will be "hidden", since it is in parallel with the next block. That way the initial $n$ part will be as small as possible while the number of blocks, $k$, is also as low as possible. But then you want to decrease the block size again, so that the last block is also small making the last "n" small. So one possibility with six blocks is 234432 but that is 18 bits! So this simple thought experiment makes it probable that six stages will not be the best solution with 16 bits. So then we propose the same solution cut down to five blocks: 23443 which should give us the number of $t_{A O}$ delays as $(2-1)+(5-1)+(3-1)=7$ and the total delay as $t_{C L A 23443}=25 \mathrm{u} . \mathrm{d}$..

Appendices

## Appendix A

## Templates and graphs to draw on

We placed all the large templates and graphs that you can draw on yourself in this appendix, so that if you want to print only these pages, that can easily be achieved without printing everything else. And conversely, if you want to print the exercises you do not get many pages just with the templates and graphs.


Figure A.1: Voltage transfer curves (VTC) for the three outputs X, Y and W and the derivative of the VTC for output X in larger scale.

## Bibliography


[^0]:    Exercise 10.3: Problem tests understanding of subthreshold leakage. Solution on page 90.

