Andrew Timberlake Andrew Timberlake

Hi, I’m Andrew, a programer and entrepreneur from South Africa, founder of Sitesure for monitoring websites, APIs, and background jobs.
Thanks for visiting and reading.


Why code_change wouldn’t work on my GenServer

I had a GenServer that I wanted to change the state of during a hot upgrade release, so I dutifully reached for code_change/3 as per the documentation, but no matter how hard I tried, I couldn’t get it to work.
I read and re-read all the documentation I could find on releases and hot upgrades and tried and tried again but my callback was never called.

I quite like Dave Thomas’ method of splitting the API from the server implementation so my code looked something like this:

defmodule MyStore do
  def child_spec(opts) do
    %{
      id: MyStore.Server,
      start: {MyStore, :start_link, [opts]},
      type: :worker,
      restart: :permanent,
      shutdown: 500
    }
  end

  def start_link(args \\ nil, opts \\ []) do
    GenServer.start_link(MyStore.Server, args, opts)
  end

  def put(pid, key, value) do
    GenServer.call(pid, {:put, key, value})
  end

  def get(pid, key) do
    GenServer.call(pid, {:get, key})
  end

  defmodule Server do
    use GenServer
    require Logger

    @impl true
    def init(_opts) do
      {:ok, []}
    end

    @impl true
    def handle_call({:put, key, value}, _from, server_state) do
      server_state = [{key, value} | server_state]
      {:reply, :ok, server_state}
    end

    def handle_call({:get, key}, _from, server_state) do
      {:reply, Keyword.get(server_state, key), server_state}
    end

    @vsn "1"
    @impl true
    def code_change(from_vsn, server_state, _extra) do
      Logger.info("code_change from: #{inspect(from_vsn)}")
      {:ok, server_state}
    end
  end
end

A very simple and contrived example of a store running on a GenServer with the obvious flaw that it’s implemented as a keyword list instead of the more obvious map. So the idea is to change the state via a hot upgrade.

Adding the following code_change/3 code before the original implementation should do the trick—along with updating the server API to use the map.

  defmodule Server do
    use GenServer
    require Logger

    @impl true
    def init(_opts) do
      {:ok, %{}}
    end

    @impl true
    def handle_call({:put, key, value}, _from, server_state) do
      server_state = Map.put(server_state, key, value)
      {:reply, :ok, server_state}
    end

    def handle_call({:get, key}, _from, server_state) do
      {:reply, Map.get(server_state, key), server_state}
    end

    @vsn "2"
    @impl true
    # Ignoring downgrading for this example
    def code_change("1", server_state, _extra) do
      Logger.info("code_change from: #{inspect(server_state)}")
      {:ok, Map.new(server_state)}
    end

    def code_change(from_vsn, server_state, _extra) do
      Logger.info("code_change from: #{inspect(from_vsn)}")
      {:ok, server_state}
    end
  end

All good. So have you found out what’s wrong yet? Neither had I.
So far as I can tell, there is nothing wrong with my code. The problem isn‘t even visible here, it becomes apparent when you look at the supervisor and how Erlang finds the processes it’s going to run code_change/3 against.
During an application upgrade, the Release handler works through the supervision tree and pauses processes that need updating. It then runs the code_change/3 function on the module for each process and then unpauses the processes and finalises the release.
The appup file for the example above would look something like this:

{"2",
 [{"1", [{update, 'Elixir.MyStore.Server', {advanced, []}}]}],
 [{"1", [{update, 'Elixir.MyStore.Server', {advanced, []}}]}]
}.

That looks fine. We want the upgrade to run MyStore.Server.code_change/3.

When the map is started under a dynamic supervisor, the response from which_children/1 is

[{:undefined, #PID<0.161.0>, :worker, [MyStore]}]

This is the same result that Erlang gets when it retrieves all supervised processes in get_supervised_procs/0 which is ”…the magic function. It finds all process in the system and which modules they execute as a call_back or process module.”
{:undefined, #PID<0.161.0>, :worker, [MyStore]} is included in the results of :release_handler_1.get_supervised_procs() (which I was super happy to find was an exported function—thank you Erlang) and there we have the problem—==Erlang thinks that MyStore is the module that is being executed as the call_back or process module, not MyStore.Server==
Because MyStore is not listed as changing in the appup file, no code_change/3 is called on it, and because MyStore.Server isn’t listed as a module of a running process, code_change/3 isn’t called on that module either and so the process is left, state unchanged, and the next call to the process will have the incorrect state and the process will crash 💣.

After a lot of code spelunking I have identified the problem and the solution is quite a simple change: move start_link/3 into MyStore.Server and update the child_spec accordingly.

defmodule MyStore do
  def child_spec(opts) do
    %{
      id: MyStore.Server,
      start: {MyStore.Server, :start_link, [opts]},
      type: :worker,
      restart: :permanent,
      shutdown: 500
    }
  end

  #...

  defmodule Server do
    use GenServer
    require Logger

    def start_link(args \\ nil, opts \\ []) do
      GenServer.start_link(Server, args, opts)
    end

    #...
  end
end

Now the output of :release_handler_1.get_supervised_procs() looks like this:

[#...
{:undefined, #PID<0.161.0>, :worker, [MyStore.Server]}]

and code_change/3 is correctly called 🎉.

I always appreciate gaining a deeper understanding of how the underlying toolset of a system works and I hope that when you are searching for “why code_change isn’t called on my GenServer” you’ll get this helpful result ;-)

3 Jul 2023