Coding a BIP39 Microlibrary in Java

Alan Evans
8 min readOct 8, 2017

--

BIP39 is a protocol that describes how to create a human readable mnemonic sentence and how to convert that mnemonic into a seed that can then produce an HD crypto wallet (BIP32/BIP44).

For example:

mnemonic:
talk smoke guess belt become ritual
powder lyrics annual tomorrow relief witness
passphrase: m3d1um

Creates the BIP39 seed:

da8fefd74e5ce5cd644aa4f73ef265f80e95e622331039cd33b223f069282347f071740d29bec6aed7e25159bcda9589566dd23152269a49b64490a95f684c34

Memorizing, writing down without error, and communicating the mnemonic and the optional passphrase is far easier than doing the same with the seed.

BIP39 mnemonics are used in many software wallets today, and the top hardware wallets like Trezor and Ledger.

You can read the full BIP39 description here: https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki

But as a software developer I have a better understanding if I can see the code. There are many libraries out there, and they are all very complex as they do more than just BIP39 of course. I’m going to walk through creating a BIP39 Microlibrary in this article.

The code I will provide will be heavily commented and not refactored out into many well named, single responsibility methods. This is not my usual style, and not what you will see in the published git repository, this is just the first draft of these functions.

BIP39’s public interface is three functions:

createMnemonic(entropy, wordlist) -> mnemonic
getSeed(mnemonic, passphrase) -> seed
verifyMnemonic(mnemonic, wordlist) -> validity

This is an ideal Microlibrary footprint. As it turns out, the getSeed function has nothing in common with the other two (save for some shared test vectors), and this could be two Microlibraries, but I don’t want to split the BIP.

createMnemonic(entropy, wordlist) -> mnemonic

First the spec provides some pre-checks; the length of the bits in the entropy must be divisible by 32 and within the inclusive range 128–256 bits.

final int ent = entropyByteCount * 8;
if (ent < 128)
throw new RuntimeException("Entropy too low, 128-256 bits allowed");
if (ent > 256)
throw new RuntimeException("Entropy too high, 128-256 bits allowed");
if (ent % 32 > 0)
throw new RuntimeException("Number of entropy bits must be divisible by 32");

Then the describes a method of getting the first cs bits of the hash.

//checksum length (bits)
final int cs = ent / 32;
//mnemonic length (words)
final int ms = (ent + cs) / 11;

We need to append the first cs bits of hash to our entropy byte array.

//make a copy with space for one more
final byte[] entropyWithChecksum = Arrays.copyOf(entropy, entropy.length + 1);
//hash the original
MessageDigest digest = MessageDigest.getInstance("SHA-256");
final byte[] hash = digest.digest(entropy);
//append the first byte of the hash to the copy
entropyWithChecksum[entropy.length] = hash[0];

But notice I am not going to bother trimming the checksum down, I am simply appending the first whole byte. This is because we are going to take 11 bytes at a time and the remaining fraction of a byte at the end will be ignored anyway.

We now simply have to process this byte array 11 bytes at a time, which is quite a daunting prospect.

For this I imagined a moving window of bit over the byte array. It will have an offset to the start of the window, and a length of 11 bits. The window may cover 2 or 3 bytes from the byte array depending on the offset.

static int next11Bits(byte[] bytes, int offset) {
//what's the index of the first byte
final int skip = offset / 8;

//how many bits will we need to drop from
// the end of a 3 byte word?
final int lowerBitsToRemove = (3 * 8 - 11) - (offset % 8);

//get the first byte, the 0xff trick is
// due to Java not having unsigned types
final int b1 = (int) bytes[skip] & 0xff;

//second byte
final int b2 = (int) bytes[skip + 1] & 0xff;

//third byte, but only if we need it,
// or we might go out of bounds
final int b3 = lowerBitsToRemove < 8
? ((int) bytes[skip + 2] & 0xff)
: 0;

//build up a 3 byte word from the three bytes
final int firstThreeBytes = b1 << 16 | b2 << 8 | b3;

//this mask is fixed, it's to keep the last 11 bits
final int mask = (1 << 11) - 1;

//drop the last n bits and apply the
// mask to loose the upper bits
return firstThreeBytes >> lowerBitsToRemove & mask;
}

Stripping the comments and a bit of inlining this looks like this:

static int next11Bits(byte[] bytes, int offset) {
final int skip = offset / 8;
final int lowerBitsToRemove = (3 * 8 - 11) - (offset % 8);
return (((int) bytes[skip] & 0xff) << 16 |
((int) bytes[skip + 1] & 0xff) << 8 |
(lowerBitsToRemove < 8
? ((int) bytes[skip + 2] & 0xff)
: 0)) >> lowerBitsToRemove & (1 << 11) - 1;
}

It’s ugly, but its in a well named method and I have it well covered in tests so I can safely refactor some more later if I feel it needs it.

So, we now have an entropyWithChecksum and a way or reading it 11 bits at a time, all that’s left is to read it from a wordlist and build up our mnemonic.

final StringBuilder sb = new StringBuilder();

for (int i = 0; i < ent + cs; i += 11) {
final int wordIndex = next11Bits(entropyWithChecksum, i);
sb.append(wordList.getWord(wordIndex));
sb.append(" ");
}

//drop the last space
sb.setLength(sb.length() - 1);

return sb.toString();

Different languages have different space characters, so we also need to pass that in, let’s introduce an interface for WordList :

public interface WordList {
String getWord(final int index);
char getSpace();
}

Altogether:

public static String createMnemonic(
final byte[] entropy,
final WordList wordList) throws NoSuchAlgorithmException {
//prechecks
final int ent = entropy.length * 8;
if (ent < 128)
throw new RuntimeException("Entropy too low, 128-256 bits allowed");
if (ent > 256)
throw new RuntimeException("Entropy too high, 128-256 bits allowed");
if (ent % 32 > 0)
throw new RuntimeException("Number of entropy bits must be divisible by 32");

//checksum length (bits)
final int cs = ent / 32;
//mnemonic length (words)
final int ms = (ent + cs) / 11;

//make a copy with space for one more
final byte[] entropyWithChecksum = Arrays.copyOf(entropy, entropy.length + 1);
//hash the original
MessageDigest digest = MessageDigest.getInstance("SHA-256");
final byte[] hash = digest.digest(entropy);
//append the first byte of the hash to the copy
entropyWithChecksum[entropy.length] = hash[0];

final StringBuilder sb = new StringBuilder();

for (int i = 0; i < ent + cs; i += 11) {
final int wordIndex = next11Bits(entropyWithChecksum, i);
sb.append(wordList.getWord(wordIndex));
sb.append(wordList.getSpace());
}
//drop the last space
sb.setLength(sb.length() - 1);

return sb.toString();
}

This works for all English and Japanese test vectors linked to from the BIP39 document.

getSeed(mnemonic, passphrase) -> seed

Now we move on to finding the seed from the mnemonic and passphrase. You might be surprised from the title of this section not to see a word list as an input to this function. That is because we are not going backwards to find the entropy, but instead hashing the mnemonic with an optional passphrase for salt.

To create a binary seed from the mnemonic, we use the PBKDF2 function with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string “mnemonic” + passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 bits (= 64 bytes).

That’s all there is to it this time.

First step is to normalize both strings.

mnemonic = Normalizer.normalize(mnemonic, Normalizer.Form.NFKD);
passphrase = Normalizer.normalize(passphrase, Normalizer.Form.NFKD);

NFKD is Compatibility Decomposition, roughly speaking this means that we simplify glyphs with the goal of getting down to something we can approximate in UTF-8 without needing to preserve the exact characters. For example this half-width Japanese Ka カ maps to a regular width カ.

Now we need to get the UTF-8 bytes for the salt which is prefixed with the string"mnemonic".

byte[] salt = ("mnemonic" + passphrase).getBytes("UTF-8");

For the mnemonic itself, I’m going to keep UTF-16

char[] chars = mnemonic.toCharArray();

Now we need to put these through a PBKDF2 function with HMAC-SHA512 hashing algorithm, with 2048 iterations and a key length of 512 bits.

PBEKeySpec spec = new PBEKeySpec(chars, salt, 2048, 512);
SecretKeyFactory secretKeyFactory =
SecretKeyFactory.getInstance("PBKDF2WithHmacSHA512");
byte[] seed = secretKeyFactory.generateSecret(spec).getEncoded();

All together, it’s a very short function.

public static byte[] getSeed(String mnemonic, String passphrase) throws Exception {
mnemonic = Normalizer.normalize(mnemonic, Normalizer.Form.NFKD);
passphrase = Normalizer.normalize(passphrase, Normalizer.Form.NFKD);

final char[] chars = mnemonic.toCharArray();
final byte[] salt = ("mnemonic" + passphrase).getBytes("UTF-8");
final PBEKeySpec spec = new PBEKeySpec(chars, salt, 2048, 512);
final SecretKeyFactory secretKeyFactory = SecretKeyFactory.getInstance("PBKDF2WithHmacSHA512");
return secretKeyFactory.generateSecret(spec).getEncoded();
}

This now works with all English and Japanese test vectors linked to from the BIP39 document.

We left the mnemonic in the Java’s native UTF-16 format, but I think that the PBKDF2 function must be treating it as UTF-8 or it simply would not be working. Just changing the salt to UTF-16, even English vectors don’t work.

verifyMnemonic(mnemonic, wordlist) -> validity

Although using a mnemonic not generated by the algorithm described in “Generating the mnemonic” section is possible, this is not advised and software must compute a checksum for the mnemonic sentence using a wordlist and issue a warning if it is invalid.

So as a last step, to make this library complete, I’m going to provide a function that will check a mnemonic is valid against a word list.

This is mapping words back to the entropyWithChecksum, stripping the last byte, recalculating the checksum. And comparing between 4 and 8 bits of the last byte, depending on the length.

private boolean validate(String mnemonic, WordList wordList) {
//build a map to look up word indexes
Map<String, Integer> map = new HashMap<>(1 << 11);
for (int i = 0; i < 1 << 11; i++) {
map.put(wordList.getWord(i), i);
}

//split the mnemonic
String[] words = mnemonic.split(String.valueOf(wordList.getSpace()));

//reverse calculate some of the variables from mnemonic generation, ms, ent, cs
final int ms = words.length;

final int entPlusCs = ms * 11;
final int ent = (entPlusCs * 32) / 33;
final int cs = ent / 32;
if (entPlusCs != ent + cs)
throw new RuntimeException("Not a correct number of words");
byte[] entropyWithChecksum = new byte[(entPlusCs + 7) / 8];

//look up the words
int[] wordIndexes = new int[ms];
for (int i = 0; i < ms; i++) {
String word = words[i];
Integer index = map.get(word);
if (index == null) throw new RuntimeException("Word not found in word list \"" + word + "\"");
wordIndexes[i] = index;
}

//build
for (int i = 0, bi = 0; i < ms; i++, bi += 11) {
writeNext11(entropyWithChecksum, wordIndexes[i], bi);
}

//strip the last byte
byte[] entropy = Arrays.copyOf(entropyWithChecksum, entropyWithChecksum.length - 1);
byte lastByte = entropyWithChecksum[entropyWithChecksum.length - 1];

//recalculate hash
byte sha = firstByteOfSha256(entropy);

//we only want to compare the first cs bits
byte mask = (byte) ~((1 << (8 - cs)) - 1);

//if the first cs bits are the same, it's valid
return ((sha ^ lastByte) & mask) == 0;
}

private void writeNext11(byte[] bytes, int value, int offset) {
int skip = offset / 8;
int bitSkip = offset % 8;
{//byte 0
byte firstValue = bytes[skip];
byte toWrite = (byte) (value >> (3 + bitSkip));
bytes[skip] = (byte) (firstValue | toWrite);
}

{//byte 1
byte valueInByte = bytes[skip + 1];
final int i = 5 - bitSkip;
byte toWrite = (byte) (i > 0 ? (value << i) : (value >> -i));
bytes[skip + 1] = (byte) (valueInByte | toWrite);
}

if (bitSkip >= 6) {//byte 2
byte valueInByte = bytes[skip + 2];
byte toWrite = (byte) (value << 13 - bitSkip);
bytes[skip + 2] = (byte) (valueInByte | toWrite);
}
}

This is all available under GPL-3.0 on the repository here https://github.com/NovaCrypto/BIP39

It’s all highly tested and souped- up for memory security.

--

--

Alan Evans
Alan Evans

Written by Alan Evans

British Canadian Software Developer living in Connecticut, Staff Android Engineer at The New York Times

Responses (2)