Modelling of hydrological processes at basin-to-continental scales can be classified into two general approaches: land surface models (LSM) used in numerical weather prediction and climate simulation, and rainfall-runoff models (RRM) used for water resources management and flood forecasting. Whether the models use complex physically-based equations or relatively simple conceptual parameterisations there is inevitably some level of uncertainty in model predictions. Improving the performance of such models has been hindered by a lack of suitable spatially extensive observations. These are required both to facilitate model development and for quantifying model uncertainty and the relative performance of different models and modelling approaches. A breakthrough may lie in satellite remote sensing which provides spatially and temporally extensive observations that can complement the limited data available from stream gauging and point-scale monitoring sites. Recently a fundamentally new type of satellite observation has become available for hydrological analyses - observation of the time-varying component of the earth's gravity field from GRACE (the Gravity Recovery and Climate Experiment). Early studies of GRACE have thus far attempted to use hydrological model output to evaluate these new observations. This paper presents the first detailed study aimed instead at exploring the potential for GRACE to provide a valuable new assessment of large-scale hydrological models. Centred on the Murray-Darling Basin (MDB) in Australia, 4 different model simulations of water storage changes from 2002-2003 are compared with the newly available observations from GRACE. Launched in March 2002, GRACE measures precise changes in the earth's gravity field arising from the redistribution of mass that occurs throughout a given region over an approximately 1 month time span. Following the a priori removal of tidal, atmospheric and oceanic effects, the observed changes over land are mainly attributed to variability in the total terrestrial water storage - the integration of soil moisture, ground water, surface water and snow/ice. The unique analysis of the MDB presented in this paper uses the first 15 gravity field solutions derived from GRACE over the period of April 2002 to December 2003. Results indicate that greater model complexity does not necessarily lead to better performance. All of the models assessed in this study often over- or under-predict the monthly mean basin-wide change in water storage relative to GRACE observations. On an annual basis the models all tend to under-predict the amplitude of water storage evolution, with the LSM VB95 and the RRM SIMHYD (using default parameters) exhibiting the most damped responses. In particular the Bureau of Meteorology's coupled VB95 model seems to be excessively damped as a result of the implementation of a screen level nudging scheme for soil moisture. These results provide valuable insight on potential sources of error in the land surface component to this operational weather prediction model, as well as on the feasibility of a recently proposed approach to forecasting soil moisture deficit over Australia. Despite encouraging results in diagnosing more significant model shortcomings, we conclude that a rigorous assessment of most conventional large-scale models is difficult at present. This is due to the present lack of an independent validation of GRACE observation error and the fact that models often neglect to account for all processes affecting the total water storage (e.g. LSMs often neglect ground water storage and the effects of streamflow alteration on storage). SIMHYD does include ground water storage and in a second version of this model we implicitly account for the impact of runoff alteration on storage by way of streamflow calibration. Comparison to measurements from the Murrumbidgee Catchment suggests the model is reasonable, yet it still under predicts the total storage variability observed by GRACE over the MDB. These results highlight the potential of GRACE to assess model performance and different approaches to model development.