INTRODUCTION
Several years after an acute poliomyelitis infection, patients may experience new symptoms such as pain, muscle weakness, muscle fatigue as well as general fatigue, which are commonly referred to as late effects of polio (LEoP) or postpolio syndrome [
1]. Among these symptoms, fatigue is often reported as the most disabling [
2-
6] and chronic challenge [
7]. Fatigue has been defined as “an overwhelming sense of tiredness, lack of energy and feeling of exhaustion” [
8]. As fatigue is negatively associated with mobility [
4], quality of life [
9,
10] and life satisfaction [
5], it is important to evaluate fatigue and to plan appropriate interventions that reduce its impact.
As fatigue is mainly a subjective experience, self-report rating scales are used to assess fatigue. The three most commonly used scales to assess fatigue in persons with LEoP include the Fatigue Severity Scale (FSS) [
11], the Fatigue Impact Scale (FIS) [
12], and the Multidimensional Fatigue Inventory (MFI-20) [
13]. To facilitate accurate assessments, the self-report rating scales need to be psychometrically sound. The validity and reliability of FSS, FIS, and MFI-20 have been studied in persons with LEoP [
14-
19], but comprehensive analyses of other psychometric properties of the scales are unavailable. Factors such as the number of missing items, the score distribution, any skewness, and floor and ceiling effects are important to evaluate. Moreover, these scales are considered for the assessment of similar underlying construct, i.e., fatigue. However, no study has explored their convergent validity in terms of the relationship between the scales. To enhance our understanding and support the choice of scales in clinical research, further evaluations of their psychometric properties are required.
The aim of this study was to evaluate the psychometric properties of the FSS, FIS, and MFI-20 in persons with LEoP. More specifically, we explored the data completeness, scaling assumptions, targeting, reliability, and convergent validity. Our hypothesis was that all scales are psychometrically sound and have high convergent validity.
RESULTS
Of the 77 potential participants who received the first postal survey (t1), 14 did not respond and 2 explicitly declined to participate. Thus, a total of 61 persons (54% women; age, 68±5 years; range, 55–75 years) responded at t1 (response rate 79%). Their age at the acute poliomyelitis infection was 5±3 years (min–max, 1–14 years), and the number of years before the onset of LEoP was 44±8 years (min–max, 30–60 years).
A majority used lower limb orthotics (61%) and outdoor mobility devices (52%), and 75% walk more than 100 m. Their SIPP score was 26±7 (min–max, 15–46), which indicates that they were moderately bothered by various LEoP-related impairments. The majority of the participants (n=46; 75%) reported comorbidities, e.g., cardiovascular disorders (n=31), diabetes (n=9), gastrointestinal disorders (n=7) and sleep apnea (n=7, including 5 using a night-time ventilator). Most participants (n=49; 80%) were treated with medication, mostly for hypertension (n=31), musculoskeletal pain (n=20), sleep disturbances (n=12), depression (n=4) and thyroid disease (n=4).
Of the 61 subjects that responded to the survey at t1, 56 responded to the second survey (t2) and thereby constitute the sample for the test-retest reliability analysis.
Data completeness
Data completeness for FSS was 100%, suggesting no missing item responses and all the 61 participants reported a FSS total score. The rate of missing item responses for FIS was 0.5%. A FIS total score was reported in 51 participants. The MFI-20 showed a 0.4% missing item response rate, while 58 participants obtained a total score. Data completeness of the three fatigue rating scales is presented in
Tables 1–
3.
Scaling assumptions
All three fatigue rating scales showed corrected itemtotal correlations exceeding 0.40; item means and SDs for the three scales are presented in
Tables 1–
3.
The mean scores and SDs for FSS were roughly parallel for all items; the mean scores ranged from 3.7 to 5.1 and SDs ranged from 1.7 to 2.3.
The SDs for FIS remained roughly parallel for all items, whereas the mean scores varied more across items. SDs ranged from 1.0 to 1.3 and mean scores ranged from 0.8 to 2.3. Twelve items (no. 6, 9, 12, 18, 21, 26, 28, 30, 32–34, and 39) included more respondents who selected the lower response options indicating less fatigue, resulting in a mean score that was >20% lower than the average item mean score (i.e., 1.4). Ten items (#1, 3, 13, 14, 17, 23, 24, 31, 37, and 38) included more respondents who selected the higher response options indicating more fatigue, resulting in a mean score that was >20% higher than the average item mean score.
The SDs for MFI-20 remained roughly parallel for all items, whereas the mean scores varied more across items. SDs ranged from 1.1 to 1.4 and mean scores ranged from 2.1 to 3.9. Four items (#4, 7, 9, and 18) elicited lower response options (indicating less fatigue) by a higher number of respondents. This resulted in a mean score that was >20% lower than the average item mean score (i.e., 3.0). Three items (#8, 16, and 20) had more respondents who preferred the higher response options indicating more fatigue, resulting in a mean score that was >20% higher than the average item mean score.
Targeting
FSS and FIS total scale scores ranged across almost their full possible scoring ranges. The MFI-20 total scores ranged from 27.3 to 93.0 (possible scoring range, 20–100), which suggests that only 82% of the possible scoring range was used. Mean scores for the three fatigue rating scales were fairly close to the scale midpoints (within 1 SD). Floor and ceiling effects were substantially below 20% and skewness was less than ±1 for all three scales (
Table 4).
Reliability
Cronbach’s α was 0.96 for FSS, 0.99 for FIS, and 0.95 for MFI-20. The results of the test-retest reliability analyses of the three fatigue rating scales are presented in
Table 5. All three scales obtained ICC values ≥0.80 and one scale (FIS) yielded an ICC value of 0.90. SEM% ranged from 7% to 10% and SDD% ranged from 20% to 28%, and was the highest (i.e., worst) for FSS. The 95% CI around đ included 0 in all the three fatigue scales.
Convergent validity
There were significant correlations (rs) between all three fatigue rating scales. The correlation between FSS and FIS was 0.80. The correlations with FSS and FIS total scores for the MFI-20 raw score were 0.79 and 0.80 (p<0.001); the corresponding correlations for the MFI-20 transformed score were 0.47 and 0.49, respectively.
DISCUSSION
Understanding various aspects of the psychometric properties of self-rating scales is a basic, albeit important, starting point when selecting a scale in clinical research. Over the past decade, various strategies were used to evaluate the psychometric properties of self-rating scales. This study is, to the best of our knowledge, the first that includes a comprehensive psychometric evaluation and head-to-head comparison of three fatigue rating scales (FSS, FIS, and MFI-20) in individuals with LEoP.
In summary, our results show that all rating scales displayed acceptable psychometric properties in terms of data completeness, scaling assumptions, targeting and reliability, and high convergent validity. Previous studies have explored various validity and/or reliability aspects of FSS [
14-
18], FIS [
14,
15] and MFI-20 [
19] in subjects with LEoP by using traditional psychometrics or Rasch analysis. However, the lack of studies evaluating other psychometric aspects limits an adequate comparison of our results with previous studies.
Data completeness was excellent for the three scales, without any missing item responses in FSS and only 0.4%–0.5% missing item responses in MFI-20 and FIS. However, due to the large number of items in FIS (n=40) and the decision not to use imputation for missing responses, 10 subjects (16%) did not report FIS total scores due to one or more missing item response. The developer of FIS states that imputation can be used in cases with less than 10% missing responses [
36]. However, the use of imputation is based on assumptions of a participant’s response to items, which might be another challenge than the items responded to (which are commonly used as a basis in the imputation) [
27], rendering imputation unreliable. Thus, scales with fewer items may be favorable compared with more extensive scales, as this may affect the number of dropouts. In addition, the time needed to respond is another factor determining the selection of a fatigue rating scale.
The items of FSS were roughly parallel in terms of mean scores and SDs, whereas FIS and MFI-20 contained a few items that were rated lower or higher (i.e., indicating less or more fatigue) than the other items. In fact, 12 out of the 40 FIS items were rated as easier and another 10 items were rated as more difficult than the other items. In MFI-20, 4 of the 20 items were rated as easier and another three items were rated as more difficult than the other items. Items within a rating scale are supposed to be “roughly parallel” with the legitimacy of total scores [
27,
28]. However, no guideline is available describing the limits of parallel items. Item SDs were roughly parallel with items in all three scales and corrected item-total correlations fulfilled the criterion >0.4, which support the use of total scores. Moreover, a previous Rasch analysis of MFI-20 has confirmed its uni-dimensionality and the use of total score [
19]. Conversely, a previous Rasch analysis of FSS concluded that a simplified version of the scale (without the first item and with 3 response categories instead of the original 7) is more psychometrically sound than the original scale [
18]. Taken together, further studies of these commonly used fatigue scales are required in order to fully establish their construct validity.
All three rating scales appear to be well targeted with very little floor and ceiling effects indicating that the scales can be used to detect changes in fatigue levels in individuals with LEoP. The transformed (Rasch analyzed) score in MFI-20 did not range the full span of possible scale scores (scoring range, 20–100; actual scoring range, 27.3–93.0), which implies that 18% of the scoring range was not used by any participant. However, the corresponding raw scores ranged from 21 to 99 [
19], indicating our sample coverage of almost the full possible scoring range.
Reliability coefficients were acceptable for the three scales with Cronbach’s α well above the recommended limit of 0.7 [
32] consistent with previous studies of Cronbach’s α for FSS and FIS [
14,
16-
18]. It is also in agreement with a previous Rasch analysis of MFI-20, which reported the scale’s person separation index, considered to be equivalent to Cronbach’s α [
19]. FIS yielded the highest Cronbach’s α (=0.99), suggesting redundant items as Cronbach’s α is strongly affected by the length of a rating scale [
37]. However, a previous study reported a Cronbach’s α of 0.82 for FIS [
14], which contradicts this observation.
All three rating scales yielded ICC values >0.70, indicating very good test-retest reliability [
21], and can therefore be used to assess fatigue at a group level [
32]. Only FIS yielded an ICC of 0.90, which is the lower limit of a rating scale for individual comparisons [
32]. Previous studies have reported an ICC value of 0.91 for FIS [
14] and ICC values for FSS ranging from 0.80 to 0.97 [
14,
16,
17]. To the best of our knowledge, no previous study has reported ICC values for MFI-20.
FIS and MFI-20 yielded identical SEM% and SDD%, which implies that a change in scale score in either FIS or MFI-20 greater than 7% of the possible scoring range indicates a real change (above measurement error) on a group level. Correspondingly, a change in the scale score of more than 20% of the possible scoring range indicates a real change (above measurement error) in an individual. The corresponding values for FSS are 10% at the group level (i.e., SEM%) and 28% in an individual (i.e., SDD%).
No systematic differences were detected between the two test occasions in any of the rating scales—95% CI around đ included 0 for all three rating scales—indicating the absence of learning effects.
The correlations between the three scales were high (r
s=0.79–0.80) based on the raw scores, indicating high convergent validity. However, when using the transformed scores for MFI-20 the correlations were lower, most likely as a result of the lower score distribution between the measures. FSS and FIS are aimed at evaluating the impact of fatigue on daily living [
11,
12]. MFI-20 is intended to assess fatigue ‘as experienced by patients’ [
13]. These constructs appear similar, and our findings suggest that they can be used interchangeably.
Several psychometric properties were similar among the three fatigue rating scales. The differences between them were mainly related to the number of items in the scales. Clearly, a higher number of items yielded increased number of missing responses and signs of item redundancy. Thus, the number of items in a fatigue rating scale is a central factor in determining the choice of scale in clinical investigations of persons with LEoP.
In clinical practice and in previous studies, the total cumulative scores of the three fatigue rating scales were used, even though they were all ordinal scales. Future studies evaluating the construct validity and the unidimensionality of the FIS, using the Rasch method, are needed to determine if the extra 31 items in the FIS are necessary compared with scales carrying fewer items.
Our results are in many ways similar to studies of other neurological conditions. A recent systematic review summarized the psychometric properties (validity and reliability) and clinical utility (ability to detect change) of several fatigue rating scales [
38]. The scales were evaluated among people with multiple sclerosis, spinal cord injury, acquired brain injury and Parkinson disease. Overall, the FSS and FIS showed good to excellent reliability (internal consistency and/or test-retest reliability), and acceptable validity and scaling structure with no floor and ceiling effects. The authors suggested that a fatigue measure effective in one condition is not necessarily appropriate for use with another [
38]. Therefore, a comprehensive evaluation of the psychometric properties of rating scales for specific conditions is required.
The head-to-head comparison of three commonly used fatigue rating scales using a comprehensive set of analyses is one of the strengths of the study. Furthermore, the high response rate yielded a ‘good sample size’ for all the analyses, according to the general recommendations [
20,
21]. The study sample included subjects who were in general moderately bothered by LEoP-related impairments, and the results might vary in persons with a more severe disability. Thus, the inferences of the study should be restricted to patients with moderate LEoP.
The results of this head-to-head comparison suggest that the FSS, FIS, and MFI-20 exhibit sound psychometric properties in terms of data completeness, scaling assumptions, targeting, reliability, and high convergent validity. These results support our hypothesis and indicate that these three scales can be used to assess fatigue in persons with LEoP. However, a scale with fewer items, such as FSS, compared with multiple items may be completed quickly. Further, the risk of missing responses is minimized. Given the similarities and differences between these three scales, the choice of fatigue rating scale in clinical research depends on the research question and the study design.