Nvidia Turing GPU Diagnosing Guide

From Repair Wiki
Jump to navigation Jump to search
Nvidia Turing GPU Diagnosing Guide
Device RTX 2080Ti, RTX 2080, RTX 2070, RTX 2060, RTX 2060 Super, RTX 2080 Super, RTX 2070 Super
Affects part(s) Whole board
Needs equipment Multimeter
Difficulty Easy
Type Explanatory, Diagnostics

This guide is applicable for most Turing cards from 2060 to the 2080Ti, some vendors may create different PCBs or use different components but the general working principles for all of them should be the same unless specified. This guide uses a reference RTX 2080 as an example.

Have any questions? Need help with a specific GPU problem? Post to /r/GPURepair!

The Card Layout[edit | edit source]

RTX 2080 Reference board layout. (Figure 1)

*PCB Image courtesy of TechPowerUp*

Before doing anything, it's a good idea to inspect the card for physical damage. Especially cards that have no backplate. They can easily lose some components on the back due to poor handling.

After making sure there is no physical damage to the card itself you can now move on with a multimeter to check the resistances of the voltage rails.

Step 1: Base Voltage Rails (12V, 3.3V)[edit | edit source]

What are the Base Voltage rails for GPUs? The base voltages are the ones that get supplied to the card through the motherboard and the external 8pin power connector(s).

12V rails[edit | edit source]

The card gets supplied 12v through the PCIe slot and additional 6-8 pin connector(s)

Start by measuring the resistances of the 12V rail coming from the PCIe slot (first 3 pins, or the inductor marked with green in figure 1).

After that, measure each inductor for external power connectors (some cards have multiple external power connector and each of them have their own inductor you have to measure each of them individually).

The resistance varies from card to card and the value itself doesn't matter but it should be in the thousands+ Ω range.

3.3V rail[edit | edit source]

The card gets 3.3V from the PCIe slot only, from the 4th pin going left from the PCIe key notch in the front, and 2nd and 3rd pins on the back going from the notch again. You can either measure them from there or from the inductor marked in figure 1.

If you get less than 50Ω on one or multiple base rails then you have a card with a short. The computer might not turn on in such a case because the power supply is protecting itself using OCP (over current protection). Solution: Check out this page dedicated to Base Voltage Rail Short on Turing GPUs.

Otherwise, if you have no short then you can continue troubleshooting.

Step 2: Minor Voltage rails. (5V, 5V USBC, 1.8V, VCore, VMem, and PEX)[edit | edit source]

Minor voltage rails are the ones created by the card itself using the base rails through either Linear Voltage Regulators or Step Down Buck Converters.

Check the resistance of the output of those rails and compare them with Figure 1. VCore has such a low resistance on 1000+ series cards that you won't find it useful to measure its resistance. A more helpful way is to measure its resistance against the 12v rails not GND.

5V USBC is a new rail that is not present on all Turing cards, it is used to power the USBC port and is not necessary for the card's operation.

If you get lower resistance on one or more of those rails, head to their pages linked below.

Otherwise, continue with the guide.

Step 3: Powering on the card[edit | edit source]

Assuming you have no shorts anywhere you can go ahead and plug the card into the motherboard and start testing (Alternatively, you can use a Lab Bench Power Supply and a riser to test the card with. Safer for the motherboard and gives you more freedom to move the card around and you get to know the current draw of the card if there is a short).

Switch your multimeter to DC Voltage mode and start by measuring the base rails first, if they are present continue to the minor rails.

Minor rails turn on in series, if one doesn't start, the next ones in series will not turn on.

Power Sequence[edit | edit source]

The order in which they turn on in most Turing GPUs is as follows: 5V→ 1.8V→ VCore→ VMem/PEX.

For example, if 5v does not turn on, everything else in the chain won't turn on either. Hence no fan spin if you have a problem with 5V or 1.8V.

If you're missing one of them, check their respective page:

Step 4: No Video Out[edit | edit source]

Everything is present but still no video out? You either have faulty Memory, Bios, GPU chip itself, or in some cases a problem with straps.

Memory problems[edit | edit source]

If you've reached this point, the most likely culprit is the Memory. You can confirm this by powering on the card on the motherboard and plugging it in to the monitor, after a minute or so the monitor's backlight should turn on but without an image.

That behavior means the card initialized but detected a memory failure. Here: Nvidia Memory Testing Guide. is how to detect the faulty memory chips.

BIOS problems[edit | edit source]

If the memory is okay or the card is not even being detected in MATS then the problem is highly likely to be the bios. Check: BIOS Problems on Turing GPUs.

Straps[edit | edit source]

Configuration straps act like a switch to configure certain settings for the card. For example; memory type, memory capacity, enabling/disabling some functions etc.

Rarely, the strap resistors could become faulty and change in value or simply become an open line and prevent the card from working properly. Check their values outside of the circuit.

Location of the strap resistors on a reference RTX 2080. (Figure 2)
Schematic view of the straps and their functions. (Figure 3)

Crystal Oscillator[edit | edit source]

Often marked with Y followed by a number, crystal oscillators sometimes fail which will lead to the card not booting up.

Location of the crystal oscillator clock generator on a reference RTX 2080. (Figure 4)

In most if not all Turing GPUs, the frequency of the oscillator is 27MHz. An oscilloscope or a multimeter with Hz function that can go above 27MHZ is needed to test it.

Dead PCIE data lanes[edit | edit source]

If the card has been used for mining, there is a chance that the miner has inserted the riser backwards which can fry the first PCIE data lane inside the core. This video explains it more with a potential fix.

Faulty GPU Core.[edit | edit source]

If everything else is working as they should but still no video out then unfortunately you have a faulty GPU core. Best use for that card is as spare parts since getting hold of a GPU chip by itself is very hard and expensive and replacing it is a very advanced procedure that requires a BGA rework station and it's out of reach for many people.

Step 5: GPU outputs a picture[edit | edit source]

perhaps the card does output a picture but it is not working properly, here are the common problems and their potential fixes.

Artifacting[edit | edit source]

Artifacting is most often caused by memory problems, check Nvidia Memory Testing Guide

If you do not get memory errors even after 100+MB test in MATS then the core is very likely to be the issue.

Error 43[edit | edit source]

Just like artifacting, error 43 can be caused by faulty memory or core but also BIOS and straps.

Start by making sure the memory is fine as shown in the guide above, then check if the BIOS is not corrupted/modded (flash original bios from either TPU library or manufacturer's site) and check the bios circuit as shown here: BIOS Problems with Turing GPUs

After that if the problem persists, check the strap resistors, they can either get knocked off or change in value which will trigger error 43.

If everything is fine but the error persists then the core itself is faulty.