Mbed-os: ODIN_EVK_W2_UBLOX fails to reconnect when Nanostack is used

Created on 17 Sep 2019  ·  18Comments  ·  Source: ARMmbed/mbed-os

Description

When compiled with Nanostack set up in mbed_app.json:

   "target_overrides": {
        "*": {
            ...
            "nsapi.default-stack": "NANOSTACK"

(and all other configs set up for correct WiFi operation)
this minimal example does not work as expected:

#include "mbed.h"

int main (void) {
    NetworkInterface *net = NetworkInterface::get_default_instance();

    nsapi_error_t result = net->connect();
    printf("connect result: %d\n\r", result);

    result = net->disconnect();
    printf("disconnect result: %d\n\r", result);

    result = net->connect();
    printf("connect result: %d\n\r", result);
}

The output is:

connect result: 0
disconnect result: 0
connect result: -3004

The other connect should work and return 0 (NSAPI_ERROR_OK), while it returns -3004 (NSAPI_ERROR_NO_CONNECTION) instead.
It works just fine if I compile with LWIP, no matter if it's ipv6 or ipv4 enabled.
I checked that it doesn't matter if the network is secured or not - the next connect always fails, unless the whole new NetworkInterface is created (for example in a test suite teardown).

I debugged the Interface all the way down to OdinWiFiInterface::wlan_connect and it seems that parameters are set up correctly. I cannot get any deeper than that because the code is provided as a library.

@ARMmbed/team-ublox , could you please take a look at this issue?

Issue request type


[ ] Question
[ ] Enhancement
[x] Bug

CLOSED ublox bug

Most helpful comment

@0xc0170 , @MarceloSalazar , do you think this task can be closed as irrelevant, now that Ublox has been dropped from mbed-os-6?

All 18 comments

@michalpasztamobica internal jira: https://jira.u-blox.net/browse/TE_OM2-310

It's not a problem with ODIWIFIInterfcae rather with nanostack, a quick analysis by team mate reveals, its connect_semaphore lock is expiring rather than wifi_connect().

if (blocking) {

    bool acquired = connect_semaphore.try_acquire_for(30000);

    if (!acquired) {
        return NSAPI_ERROR_DHCP_FAILURE; // sort of...
    }
}

Hi @aqib-ublox . Thanks a lot for a quick analysis.
Looking at the code, the Nanostack::EthernetInterface::bringup() gets called? Perhaps that's the issue - we want WiFi to connect, not the Ethernet?

If I run with Ethernet connection:

 "target_overrides": {
        "*": {
            "target.network-default-interface-type": "ETHERNET",
            "nsapi.default-stack": "NANOSTACK"

I can reconnect without a problem. Only WiFi is failing...

Hi @aqib-ublox . Thanks a lot for a quick analysis.
Looking at the code, the Nanostack::EthernetInterface::bringup() gets called? Perhaps that's the issue - we want WiFi to connect, not the Ethernet?

Have u anything investigated?

@aqib-ublox , I have not, sorry, I will try to get down to this next week.

Hi @michalpasztamobica

I am sharing our findings so far with you.

The program is showing different results for us. As you stated, it only fails to connect second time. But we are not able to make it connect for the first time either over the both WiFi and Ethernet.

Upon debugging, we found out that the reason for failure is unsuccessful DHCP bringup. It fails because of not able to acquire the connect_semaphore inside Nanostack::EthernetInterface::bringup(). The function responsible for releasing this semaphore is Nanostack::Interface::network_handler() which keeps waiting for the status to change from MESH_BOOTSTRAP_STARTED to MESH_CONNECTEDbut it doesn't change. Further investigation showed that in IPv6 negotiation, it remains in the IPV6_ROUTER_SOLICATION state inside protocol_ipv6.c and runs out of number of tries, where it is expected that it receives the IP address from ICMPv6 Router Advertisement and sets the global_address_available flag to true, which always remains false in this case. If global_address_available gets set to true, MESH_BOOTSTRAP_STARTED will change to MESH_CONNECTED hence releasing the connect_semaphore.

On the other hand, on the router (AP) end, it shows device as connected because AP advertised the address and it was received and processed by device.

We ran the same example on a different target NUCLEO_F429ZI with Ethernet as Interface Type and the behavior is same as above.

Attached is the screenshot of Call Stack for this debugging session for your reference.

call-stack

I believe we are not on the same line here, may be we have different configurations set up. Can you please check if the following configurations are correct or share yours?

/mbed_app.json:

...
    "target_overrides": {
        "*": {
            "target.network-default-interface-type": "WIFI",
            "platform.stdio-convert-newlines": true, 
            "ble_button_pin_name": "BUTTON1",
            "nsapi.default-stack": "NANOSTACK"
        }
    }
....

/mbed-os/features/netsocket/mbed_lib.json:

{
    "name": "nsapi",
    "config": {
        "present": 1,
        "default-stack": {
            "help" : "Default stack to be used, valid values: LWIP, NANOSTACK.",
            "value" : "LWIP"
        },
        "default-wifi-ssid" : {
            "help" : "Default Wi-Fi SSID.",
            "value": "\"<ssid>\""
        },
        "default-wifi-password" : {
            "help" : "Password for the default Wi-Fi network.",
            "value": "\"<pass>\""
        },
        "default-wifi-security" : {
            "help" : "Wi-Fi security protocol, valid values are WEP, WPA, WPA2, WPA/WPA2.",
            "value" : "WPA2"
        },
        "default-cellular-plmn" : {
            "help" : "Default Public Land Mobile Network for cellular connection.",
            "value": null
        },
        "default-cellular-sim-pin" : {
            "help" : "PIN for the default SIM card.",
            "value": null
        },
        "default-cellular-apn" : {
            "help" : "Default cellular Access Point Name.",
            "value": null
        },
        "default-cellular-username" : {
            "help" : "Username for the default cellular network.",
            "value": null
        },
        "default-cellular-password" : {
            "help" : "Password for the default cellular network.",
            "value": null
        },
        "default-mesh-type": {
            "help": "Configuration type for MeshInterface::get_default_instance(). [LOWPAN/THREAD/WISUN]",
            "value": "THREAD"
        },
        "dns-response-wait-time": {
            "help": "How long the DNS translator waits for a reply from a server in milliseconds",
            "value": 10000
        },
        "dns-total-attempts": {
            "help": "Number of total DNS query attempts that the DNS translator makes",
            "value": 10
        },
        "dns-retries": {
            "help": "Number of DNS query retries that the DNS translator makes per server, before moving on to the next server. Total retries/attempts is always limited by dns-total-attempts.",
            "value": 1
        },
        "dns-cache-size": {
            "help": "Number of cached host name resolutions",
            "value": 3
        },
        "socket-stats-enabled": {
            "help": "Enable network socket statistics",
            "value": false
        },
        "socket-stats-max-count": {
            "help": "Maximum number of socket statistics cached",
            "value": 10
        }
    },
    "target_overrides": {
        "KW24D": {
            "nsapi.default-mesh-type": "LOWPAN"
        },
        "NCS36510": {
            "nsapi.default-mesh-type": "LOWPAN"
        },
        "TB_SENSE_12": {
            "nsapi.default-mesh-type": "LOWPAN"
        }
    }
}

Hi @hamza-ubx . Thanks a lot for this insightfult comment!
I have the same setup as you do (mbed_lib.json is untouched).
My mbed_app.json only differs slightly:

    "target_overrides": {
        "*": {
            "target.network-default-interface-type": "WIFI",
            "nsapi.default-wifi-ssid": "MBED_CONF_APP_WIFI_SECURE_SSID",
            "nsapi.default-wifi-password": "MBED_CONF_APP_WIFI_PASSWORD",
            "nsapi.default-wifi-security": "WPA2",
            "nsapi.default-stack": "NANOSTACK"
        },

On my end I managed to reconnect multiple times if I enabled a debug message in OdinWiFiInterface::handle_wlan_status_connected.
It may be some kind of a race condition, perhaps we need wait for the disconnect to end successfully before we reconnect. Investigation is ongoing and I'll also consider your remarks, to find more clues.

It seems with the debug print in place (which delays the handler execution) on reconnection I am getting the following status callbacks:
MESH_CONNECTED_GLOBAL
MESH_BOOTSTRAP_STARTED

But if I remove the printf the following statuses are reported:
MESH_BOOTSTRAP_STARTED
MESH_BOOTSTRAP_START_FAILED
MESH_DISCONNECTED
MESH_CONNECTED_GLOBAL arrives when _connect_status is NSAPI_STATUS_DISCONNECTED, so it will not trigger the semaphore release. Instead it will change _connect_status to NSAPI_STATUS_CONNECTING
MESH_BOOTSTRAP_STARTED arrives when we're already in NSAPI_CONNECTED_GLOBAL, so again - the semaphore is not released.
I think this algorithm is the root cause, although I must say the order of events in this case looks a bit suspicious...
@hamza-ubx , if you see anything wrong with those events, let me know. I will study this further, too. I am wondering if BOOTSTRAP should come before the CONNECTED perhaps?
If not - I think the best solution would be to trigger the semaphore in case of MESH_CONNECTED_GLOBAL, regardless whether we are connecting or not.

@michalpasztamobica I think you are right about the delayed execution of handler due to debug statement. There needs to be some transitioning state to accommodate race condition between disconnect and connect.

We however, haven't received MESH_CONNECTED_GLOBAL callback. It maintains the MESH_BOOTSTRAP_STARTED status till DHCP fails. I will investigate it further and will let you know.

Looking at the name of the states and functioning of the algorithm, it also seems to me that MESH_BOOTSTRAP_STARTED should come before MESH_CONNECTED_GLOBAL but if you see the sequence of states in mesh_connection_status_t enum, it seems like MESH_BOOTSTRAP_STARTED is something that needs to be done after MESH_CONNECTED_GLOBAL.

typedef enum {
    MESH_CONNECTED = 0,             /*<! connected to network */
    MESH_CONNECTED_LOCAL,           /*<! connected to network, got local IP */
    MESH_CONNECTED_GLOBAL,          /*<! connected to network, got global IP */
    MESH_DISCONNECTED,              /*<! disconnected from network */
    MESH_BOOTSTRAP_START_FAILED,    /*<! error during bootstrap start */
    MESH_BOOTSTRAP_FAILED,          /*<! error in bootstrap */
    MESH_BOOTSTRAP_STARTED          /*<! bootstrap started */
} mesh_connection_status_t;

But again functioning of algorithm shows that MESH_BOOTSTRAP_STARTED is just a transitioning state and isn't doing anything else.

I amended my previous comment, as I figured out that I mixed up the names of events slightly...
Currently this is what I am getting in my sample app:

incoming status event value   current _connect_status value
MESH_BOOTSTRAP_STARTED        NSAPI_STATUS_DISCONNECTED
MESH_CONNECTED_GLOBAL         NSAPI_STATUS_CONNECTING
connect result: 0
MESH_BOOTSTRAP_STARTED        NSAPI_STATUS_GLOBAL_UP
MESH_DISCONNECTED             NSAPI_STATUS_CONNECTING
disconnect result: 0
MESH_BOOTSTRAP_STARTED        NSAPI_STATUS_DISCONNECTED
MESH_BOOTSTRAP_START_FAILED   NSAPI_STATUS_CONNECTING
MESH_DISCONNECTED             NSAPI_STATUS_DISCONNECTED
MESH_CONNECTED_GLOBAL         NSAPI_STATUS_DISCONNECTED
MESH_BOOTSTRAP_STARTED        NSAPI_STATUS_GLOBAL_UP
connect result: -3004

What worries me is that after a failed bootstrap there is a reversed order of MESH_CONNECTED_GLOBAL and MESH_BOOTSTRAP_STARTED

I couldn't find any proper documentation on the state machine operation. This is the best I could find. @mikter , can you point us to any document explaining which state should come after which?

@mikter , @SeppoTakalo , @AnttiKauppila , if you could confirm, that our understanding of expected order of events is correct, or point me to a relevant documentation, that would be great.
@hamza-ubx , did you have time to look into the driver's code and check why events come in different order on reconnection?

@hamza-ubx Do you have IPv6 network available?

What you reported, sounds like there is no IPv6 router and therefore the semaphore will never be released, as the Nanostack will never get IPv6 address.

Please note that Nanostack is only IPv6 capable, as it is meant for mesh networks. However Ethernet is used as a backbone network in border routers.

@SeppoTakalo , I can reproduce the issue on our RAAS configuration. What is more the first connect passes fine. I can also see that all other tests (including tcp/udp/tls) pass fine with Nanostack.
Looking at the printouts, I think the issue is just the reversed order of events coming from the driver (MESH_CONNECTED_GLOBAL comes before MESH_BOOTSTRAP_STARTED).

We need to clarify whether this is acceptable and we need to amend the mesh interface or not acceptable and the driver's behavior needs adjustment.

@SeppoTakalo Yes we have already figured that IPv6 was disabled on our local network for some reason. We are setting up the network with IPv6 support and will continue the testing.

@michalpasztamobica Hopefully once we test with IPv6 enabled network, we will be able to reproduce the sequence of events you are receiving. We will further investigate the drivers behavior and keep you posted.

Hi @michalpasztamobica

Can you kindly test with this private branch and see if you are still getting the states in reversed order: ublox_odin_driver_os_5_v3.7.1_rc3

We have fixed some issues with the driver state machine but we are not certain if this issue may also have been fixed. While we wait for IPv6 network to be setup, kindly let us know about your test results.

Hi @hamza-ubx , I checked out the branch which you posted and ran my sample application, but unfortunately on reconnection I got the same result as before - MESH_BOOTSTRAP_STARTED -> MESH_BOOTSTRAP_START_FAILED -> MESH_DISCONNECTED -> MESH_CONNECTED_GLOBAL and then MESH_BOOTSTRAP_STARTED.
It's okay to fail the bootstrap and go into disconnected, but CONNECTED_GLOBAL coming before BOOTSTRAP_STARTED confuses the mesh-api state machine :(

@0xc0170 , @MarceloSalazar , do you think this task can be closed as irrelevant, now that Ublox has been dropped from mbed-os-6?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  4Comments

DuyTrandeLion picture DuyTrandeLion  ·  3Comments

0xc0170 picture 0xc0170  ·  3Comments

sarahmarshy picture sarahmarshy  ·  4Comments

davidantaki picture davidantaki  ·  3Comments